Linking genomes to pathways by ortholog annotation

KO Database of Molecular Functions

In KEGG, molecular-level functions are stored in the KO (KEGG Orthology) database and associated with ortholog groups in order to enable extension of experimental evidence in a specific organism to other organisms. Genome annotation in KEGG is ortholog annotaion, assigning KO identifiers (K numbers) to individual genes in the GENES database. No updates are made to original data, such as gene names and descriptions given by RefSeq or GenBank, even if they are inconsistent with the KO assignment.

Major efforts have been initated to associate each KO entry with experimental evidence of functionally characterized sequence data, now shown in the SEQUENCE subfield of the REFERENCE field. Furthermore, the genome-based collection of KEGG GENES has been expanded to allow individual protein data to be included in the addendum category. Eventually the KO database will cover all knowledge on functionally characterized protein sequences (see also KEGG Enzyme).

KEGG Mapping by the KO System

In general KO grouping of functional orthologs is defined in the context of KEGG molecular networks (KEGG pathway maps, BRITE hierarchies and KEGG modules), which are in fact represented as networks of nodes identified by K numbers. The relationships between KOs and corresponding molecular networks are represented in the following KO system.
KEGG Orthology (KO)
The fact that functional information is associated with ortholog groups is a unique aspect of the KEGG resource. The sequence similarity based inference as a generalization of limited amount of experimental evidence is predefined in KEGG. As implemented in BlastKOALA and other tools, the sequence similarity search against KEGG GENES is a search for most appropriate K numbers. Once K numbers are assigned to genes in the genome, the KEGG pathways maps, Brite hierarchies, and KEGG modules are automatically reconstructed, enabling biological interpretation of high-level functions.

The following interface allows some of the KEGG mapping functions (see also KEGG Annotation).

Genome Annotation in KEGG

Genome annotation in KEGG is essentially cross-species annotation giving K numbers to orthologous genes in all available genomes, and is currently done as follows.
  1. Experimental evidence on known functions of genes and proteins is organized in the KO database, which is created together with the KEGG PATHWAY, KEGG BRITE, and KEGG MODULE databases.
  2. Gene catalogs of complete genomes are generated from RefSeq, GenBank and other public resources, and stored in the KEGG GENES database.
  3. Sequence similarity scores and best hit relations are computed from GENES by pairwise genome comparisons using SSEARCH, and stored in the KEGG SSDB database.
  4. For each gene in a genome the GFIT (Gene Function Identification Tool) table is created detailing the information about best-hit genes, including paralogs, in all other genomes.
  5. In the past, GFIT tables were used to manually assign K numbers by the GFIT tool, which is integrated with other tools including the gene cluster tool for consistency check of operon-like structures and the ortholog table for completeness check of pathway modules and complexes.
  6. The KOALA (KEGG Orthology And Links Annotation) tool was developed in 2008 to computerize KEGG annotators' knowledge of using GFIT tables. KOALA processes all the GFIT tables at a time and makes computational K number assignments.
  7. GFIT tables are continuously updated, and KOALA's computational assignments are automatically reflected for a selected set of well-curated K numbers (about 80%) in a newly determined genome, and also in the existing genomes that meet various other criteria.
  8. KOALA's computational assignments are repeated every two to three days, and a summary of discrepancies between its assignments and the current annotations is presented. Discrepancies are examined by annotators with the manual version of KOALA and GFIT tools.
  9. Annotation results can be mapped to KEGG pathways, BRITE hierarchies, and KEGG modules for inferring systemic functions of individual organisms, groups of organisms (eg., pangenomes), and combinations of organisms (eg., host-pathogen and human-microbiome relationships).
The read-only version of KEGG annotation tools is available for public view.
  • KOALA - linked from each KO entry page
  • GFIT - linked from each KEGG GENES entry page

Last updated: January 1, 2016
