Genome Annotation (KO Assignment)
Genome annotation in KEGG is to assign KO (KEGG Orthology) identifiers or K numbers to both protein-coding and RNA genes in the genome. KO entries in the KO database represent functional orthologs, which are defined in a context-dependent manner as nodes of KEGG molecular networks; namely, pathway maps in the PATHWAY database, Brite hierarchies and tables in the BRITE database and KEGG modules in the MODULE database.
The GENES consists of over 10,000 sequenced genomes of cellular organisms (called KEGG organisms), as well as viruses. Furthermore, it is supplemented by a publication-based (rather than a genome-based) collection of functionally characterized proteins. Proteins and RNAS in the GENES database are internally annotated with K numbers. The current statistics according to the taxonomic groups is the following. In general, each KO represents one or more sequence similarity groups. Thus, the sequence similarity search of a query genome against the entire or a selected subset of GENES database can be used to assign most appropriate K numbers (see KO assignment tools). The assigned set of K numbers can then be used to reconstruct KEGG pathway maps, BRITE hierarchies and KEGG modules, enabling interpretation of high-level functions.
The GENES consists of over 10,000 sequenced genomes of cellular organisms (called KEGG organisms), as well as viruses. Furthermore, it is supplemented by a publication-based (rather than a genome-based) collection of functionally characterized proteins. Proteins and RNAS in the GENES database are internally annotated with K numbers. The current statistics according to the taxonomic groups is the following. In general, each KO represents one or more sequence similarity groups. Thus, the sequence similarity search of a query genome against the entire or a selected subset of GENES database can be used to assign most appropriate K numbers (see KO assignment tools). The assigned set of K numbers can then be used to reconstruct KEGG pathway maps, BRITE hierarchies and KEGG modules, enabling interpretation of high-level functions.
KO Assignment Methods
The internal procedure to assign KOs to both proteins and RNAs in the GENES database is based on the SSEARCH computation results of pairwise genome comparisons stored in the SSDB database. For each gene in a genome the GFIT table is generated from SSDB, which is a list of top hit genes in the other genomes. The KOALA (KEGG Orthology And Links Annotation) program processes GFIT tables to automatically assign KOs by a simple algorithm, internally called newkoala algorithm to distinguish from the previous version, using the modified identity score shown below as a measure of similarity between two sequences:
The KOALA program is used to automatically assign and update KOs every day. In addition, there are two tools for manual annotations based on GFIT tables. One is the KoAnn (KO Annotation) tool for examining the entire set of annotated genes for a given KO, and the other is the Check GN tool for checking the consistency of the entire GENES annotation. The consistency check is performed every night presenting additional candidates and possible misannotations for human intervention.
Since the newkoala algorithm uses the identity score, it can readily be applied not only to SSEARCH, but also to other programs. BlastKOALA is a web server for automatic KO assignment now using the newkoala algorithm for BLAST search against a limited set of GENES data. BlastKOALA is also used internally for initial annotation of new genomes, before SSDB computation and GFIT table generation are completed.
Reference
identity * min(1, overlap*2/(aalen1+aalen2))
where "identity" is the identity score given by SSEARCH, "overlap" is the alignment length, aalen1 and aalen2 are the sequence lengths being compared.
The penalty of the sequence length difference represents the nature of KO grouping, which is entire sequence based, not domain based.
The KOALA program is used to automatically assign and update KOs every day. In addition, there are two tools for manual annotations based on GFIT tables. One is the KoAnn (KO Annotation) tool for examining the entire set of annotated genes for a given KO, and the other is the Check GN tool for checking the consistency of the entire GENES annotation. The consistency check is performed every night presenting additional candidates and possible misannotations for human intervention.
Since the newkoala algorithm uses the identity score, it can readily be applied not only to SSEARCH, but also to other programs. BlastKOALA is a web server for automatic KO assignment now using the newkoala algorithm for BLAST search against a limited set of GENES data. BlastKOALA is also used internally for initial annotation of new genomes, before SSDB computation and GFIT table generation are completed.
| KOALA | BlastKOALA | |
| Usage | Internal GENES annotation | Tentative annotation of new genomes Outside service of genome annotation |
| Search program | SSEARCH | BLASTP |
| Scoring | newkoala algorithm (using SSEARCH identity scores) |
newkoala algorithm (using BLASTP identity scores) |
| Database | Entire GENES database sequences | KEGG Reference genomes and functionally characterized seuences linked from KO references |
Reference
- Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M.; KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457-D462 (2016). [pubmed]
- Kanehisa, M., Sato, Y., and Morishima, K.; BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J. Mol. Biol. 428, 726-731 (2016). [pubmed]
Last updated: December 1, 2025
