KEGG Annotation

Genome Annotation (KO Assignment)

Genome annotation in KEGG is to assign KO (KEGG Orthology) identifiers or K numbers to both protein-coding and RNA genes in the genome. KO entries in the KO database represent functional orthologs, which are defined in a context-dependent manner as nodes of KEGG molecular networks; namely, pathway maps in the PATHWAY database, Brite hierarchies and tables in the BRITE database and KEGG modules in the MODULE database.

The GENES consists of over 10,000 sequenced genomes of cellular organisms (called KEGG organisms), as well as viruses. Furthermore, it is supplemented by a publication-based (rather than a genome-based) collection of functionally characterized proteins. Proteins and RNAS in the GENES database are internally annotated with K numbers. The current statistics according to the taxonomic groups is the following.

Current statistics of KO assignment

In general, each KO represents one or more sequence similarity groups. Thus, the sequence similarity search of a query genome against the entire or a selected subset of GENES database can be used to assign most appropriate K numbers (see KO assignment tools). The assigned set of K numbers can then be used to reconstruct KEGG pathway maps, BRITE hierarchies and KEGG modules, enabling interpretation of high-level functions.

KO Assignment Methods

The internal procedure to assign KOs to both proteins and RNAs in the GENES database is based on the SSEARCH computation results of pairwise genome comparisons stored in the SSDB database. For each gene in a genome the GFIT table is generated from SSDB, which is a list of top hit genes in the other genomes. The KOALA (KEGG Orthology And Links Annotation) program processes GFIT tables to automatically assign KOs by a simple algorithm, internally called newkoala algorithm to distinguish from the previous version, using the modified identity score shown below as a measure of similarity between two sequences:

identity * min(1, overlap*2/(aalen1+aalen2))

where "identity" is the identity score given by SSEARCH, "overlap" is the alignment length, aalen1 and aalen2 are the sequence lengths being compared. The penalty of the sequence length difference represents the nature of KO grouping, which is entire sequence based, not domain based.

The KOALA program is used to automatically assign and update KOs every day. In addition, there are two tools for manual annotations based on GFIT tables. One is the KoAnn (KO Annotation) tool for examining the entire set of annotated genes for a given KO, and the other is the Check GN tool for checking the consistency of the entire GENES annotation. The consistency check is performed every night presenting additional candidates and possible misannotations for human intervention.

Since the newkoala algorithm uses the identity score, it can readily be applied not only to SSEARCH, but also to other programs. BlastKOALA is a web server for automatic KO assignment now using the newkoala algorithm for BLAST search against a limited set of GENES data. BlastKOALA is also used internally for initial annotation of new genomes, before SSDB computation and GFIT table generation are completed.

	KOALA	BlastKOALA
Usage	Internal GENES annotation	Tentative annotation of new genomes Outside service of genome annotation
Search program	SSEARCH	BLASTP
Scoring	newkoala algorithm (using SSEARCH identity scores)	newkoala algorithm (using BLASTP identity scores)
Database	Entire GENES database sequences	KEGG Reference genomes and functionally characterized seuences linked from KO references

Reference

Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M.; KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457-D462 (2016). [pubmed]
Kanehisa, M., Sato, Y., and Morishima, K.; BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J. Mol. Biol. 428, 726-731 (2016). [pubmed]

Last updated: December 1, 2025

KEGG Annotation

KO-based annotation for linking genomes to phenotypes

Genome Annotation (KO Assignment)

KO Assignment Methods