GENES Database

KEGG GENES is a collection of genes and proteins in complete genomes of cellular organisms and viruses generated from publicly available resources, mostly from NCBI RefSeq and GenBank, and annotated by KEGG in the form of KO (KEGG Orthology) assignment. The collection is supplemented with a KEGG original collection of functionally characterized proteins from published literature. Protein sequences and RNA sequences of all GENES entries are subject to SSDB computation and KO assignment by KOALA tools (see annotation statistics).

Category Content Data source Organism
code
Gene identifier Genome
identifier
KEGG organisms
(Complete genomes)
Genes and proteins in cellular organisms RefSeq or
GenBank
<org> GeneID or
Locus_tag
T number
KEGG Viruses Genes and proteins in viruses RefSeq vg GeneID Taxonomy ID
Mature peptides in viruses vp GeneID-no
Addendum Functionally characterized proteins Publication ag ProteinID, etc N/A

<org> three- or four-letter organism code for cellular organisms

The Addendum category is a PubMed-based collection of protein sequences whose functions are experimentally characterized. They are used to define new KOs that are not covered by complete genomes (see KO database).

The viral peptide (vp) category is a collection of mature peptides processed from genome-encoded polyproteins, which are not usually found as separate entries in the public databases such as NCBI and UniProt. Viral mature peptides appear in KEGG pathway maps and as drug targets and are given KOs.

Each GENES entry is identified by the combination of organism code and gene identifier in the form of
  • org:gene
such as hsa:351 for human amyloid beta gene.


SSDB Database

KEGG SSDB (Sequence Similarity DataBase) is a computationally generated database of sequence similarity scores for all protein pairs (and for all RNA pairs as well) in the GENES database, together with the information of best hits and bidirectional best hits (best-best hits) in pairwise genome comparisons. The computation is performed using the SSEARCH program.

For the purpose of gene annotation, SSDB is organized as a collection of "GFIT tables" that are useful for identifying possible orthologs and paralogs.


Gene Annotation Tools

The annotation of KEGG GENES involves assignment of KO identifiers (K numbers). Internally, this is done using the KOALA/KoAnn and GFIT annotation tools (see: KO Database). For outside users, the following automatic annotation servers are made available.
BlastKOALA
automatic KO assignment by BLASTP sequence similarity search
GhostKOALA
automatic KO assignment by GHOSTX sequence similarity search
KofamKOALA
automatic KO assignment by HMM profile search
For a single sequence, the following standard programs may be used to find matching sequences with assigned KOs.
BLAST
sequence similarity search by BLAST
FASTA
sequence similarity search by FASTA

Gene Identifier Conversion

KEGG GENES can be retrieved by giving identifiers of outside databases: NCBI-ProteinID (INSDC accession), NCBI-GeneID (Entrez Gene ID) and UniProt accession numbers.


Last updated: May 21, 2021