Announcement: Data Source of KEGG GENES

The KEGG GENES database for prokaryotic genomes was created using the RefSeq FTP site until January 2014 or until it was updated. Since then newly added prokaryotic genomes are all taken from GenBank. The NCBI has recently released updated prokaryotic RefSeq genomes annotated by the NCBI Prokaryotic Genome Annotation Pipeline. Consequently, gene assignments and locus_tags are different from current KEGG data, and RefSeq annotation now contains WP_ records for cross-species identical sequences.
Excluding the RefSeq reference genomes (currently 122 genomes) that are annotated manually, all prokaryotic genomes previously taken from RefSeq are being updated with GenBank as the data source. Note that eukaryotic genomes will continue to be taken from RefSeq, as well as viruses and plasmids, as summarized in the table. Together with this change, all protein coding genes are associated with NCBI Protein IDs (INSDC accessions) and GI numbers are being eliminated.

Posted on August 26, 2015
data source
Prokaryotes (reference)RefSeqLocus_tag
Prokaryotes (the rest)GenBankLocus_tag
Viruses (vg)RefSeqGeneID
Plasmids (pg)RefSeqGeneID
Addendum (ag)PubMedProteinID

Gene Catalogs

KEGG GENES is a collection of gene catalogs for all complete genomes (see release history) generated from publicly available resources, mostly NCBI RefSeq and GenBank. They are subject to SSDB computation and KO assignment (gene annotation) by KOALA tool. KEGG DGENES is a supplementary collection of gene catalogs for eukaryotic draft genomes, which are given automatic KO assignment by BlastKOALA with GENES used as a reference data set. KEGG MGENES contains gene catalogs for metagenomes of environmental samples (see also KEGG GENOME) with automatic annotation. The collections of viral genomes and plasmids in RefSeq are now included in KEGG GENES with standard annotation procedures. The organism codes are vg (T40000) and pg (T20000), which may be considered as pan-virus and pan-plasmid codes.
Category DBGET Remark
Complete genomes GENES High-quality genomes with KOALA and manual annotations
Plasmids Plasmids with KOALA and manual annotations
Viruses Viral genomes with KOALA and manual annotations
Addendum New! Additional collection of functionally characterized genes
Draft genomes DGENES Draft genomes with automatic (BlastKOALA) annotation
Metagenomes MGENES Metagenomes with automatic (GhostKOALA) annotation

Gene Annotation

The annotation of KEGG GENES involves assignment of KO identifiers (K numbers). Internally, this is done using the KOALA and GFIT annotation tools based on the SSDB database (see: Genome Annotation in KEGG). The annotation of KEGG DGENES and MGENES is done automatically using the BlastKOALA and GhostKOALA programs, respectively, shown below.

BlastKOALA: automatic KO assignment by BLAST search
GhostKOALA: automatic KO assignment by GHOSTX search
BLAST:sequence similarity search by BLAST
FASTA:sequence similarity search by FASTA

Gene Name Conversion

KEGG GENES can be retrieved by giving identifiers of outside databases, such as NCBI-GeneID (Entrez Gene ID), NCBI-gi, and UniProt accession numbers. Cross-reference lists are available at the FTP site.

bget mode

Last updated: May 21, 2015
