How to Use DBGET


DBGET and LinkDB

DBGET is a simple database retrieval system for a diverse range of molecular biology databases. Here a database is considered simply as a set of entries, which may be stored in a single file or multiple files. Most of the existing molecular biology databases can be treated in this simplified manner, or as so-called flat-file databases. Our definition of flat-file is not limited to text data; it also includes other types of data such as GIF images for KEGG pathways, Java graphics for genome maps and expression profiles, and 3D graphics for protein structures. This is accomplished by treating a collection of HTML files as a database.

Because each entry of a database is given a unique identifier, i.e., an entry name or an accession number, the molecular biology databases in the world can be retrieved uniformly by specifying the combination of the database name and the identifier:

The KEGG gene catalogs are also considered as flat-file databases where the combination of the organism name and the gene name: is used for identification.

It is a common practice now to cross-reference related data among different databases. Thus, the molecular biology databases in the world form a web of data and data links, which is a huge graph object like the WWW. DBGET has powerful capabilities to search against this graph object, which is stored in the LinkDB database.

LinkDB is a database of links, each of which is represented as a binary relation in the form of:

LinkDB contains all cross-reference links, called original links, extracted from all the databases in DBGET. Furthermore, LinkDB dynamically generates additional links by computation, i.e., by combining multiple links and/or using links in reverse directions. Thus, LinkDB is a deductive database and the links in LinkDB are of the following three types:

The architecture of the DBGET system is illustrated in the figure. DBGET has three basic commands (or three basic modes in the Web version), bfind, bget, and blink, to search and extract database entries. bget performs the retrieval of database entries specified by the combination of dbname:identifier. bfind is used for searching entries by keywords. One notable feature of DBGET, which is different from Entrez, SRS, and other text search systems, is that no keyword indexing is performed when a database is installed or updated. Instead, selected fields are extracted and stored in separate files for bfind searches. This is an advantage for rapid database updates, but sometimes a disadvantage for elaborate searching. To supplement bfind, the full text search STAG is provided by GenomeNet. The primary purpose of molecular biology databases is to collect and store factual data, and associated text information is not always complete or appropriate. As illustrated in the figure, sequence similarity searches by BLAST and FASTA, sequence motif searches by MOTIF, and biological searches in KEGG are all linked to the DBGET system.

Once entries of interest are found, blink, which is the LinkDB search, can be used to retrieve related entries in a given database or all databases in GenomeNet. Related entries are found not only by the original links but also by the reverse links and indirect links. How to compute indirect links is defined in the link table for each database, which may be edited if necessary. In the previous versions of LinkDB, indirect links were precomputed and stored in the database. Currently, indirect links are computed on the fly using the data structure called suffix array.

In addition to finding links for a single entry or multiple entries, blink is capable to find links for all entries in a database. This database-to-database link capability is especially useful, when a local database is to be included in the web of molecular biology databases. Suppose, for example, a new genome is completely sequenced and each ORF is linked to an entry of the existing databases by sequence similarity search. By uploading the binary relation file containing ORF to dbname:identifier links, and defining paths for indirect links, it is possible, for example, to assign EC numbers to ORFs or to map ORFs onto KEGG pathways.


Database names

Each database (or organism) has a full name and an abbreviation as shown in:

In addition, generic names (compound database names) are predefined to facilitate search against similar databases, such as: For GenBank, EMBL, and SWISS-PROT databases, the distinction of fixed releases and daily/weekly updates is made. However, since many other databases are also daily/weekly updated, genbank, embl, swissprot are often used to actually mean genbank-today, embl-today, and swissprot-today, respectively, in the Web version of DBGET.


Basic search

The Web version of DBGET provides the choice of the bfind mode (default) and the bget mode. In the bfind mode, keywords and optional characters can be entered in the search box. You then get a list of entries that contain matching keywords. By selecting an entry name in the list, you obtain the database entry. When you know beforehand the entry name (or the primary accession number or the primary gene name) of your interest, it is much faster to retrieve the entry by switching to the bget mode. Just enter the entry name (or the accession number or the gene name) in the search box. Once you get an entry in either mode, you can further retrieve related entries in different databases by clicking on marked items. In order to obtain all related entries at a time, click on LinkDB at the top line or the marked entry name. This invokes the LinkDB search or the blink mode search.

The default file searched in the bfind mode contains the title description, which is derived from entry name, primary accession, gene names and synonyms (GENES database only), and definition (or title) fields. The field names in the major databases are the following.


Advanced search by bfind

At the unix command level, bfind command takes the following form:

Here expression is to be entered in the search box in the Web, and it may contain search options and Boolean operators. A search option is specified by a one-letter code followed by a colon. The default search is case-insensitive and it is as if wild characters are added to both ends of the given keyword. To restrict the query, use the following options: The other options are used to select search fields: When two keywords are given, the default search will identify entries that contain both keywords. In other words, the default is an AND search. To modify this condition, use the following Boolean operators: To search a block of keywords in sequence, use double quotes: Without double quotes the search is made for separate keywords with the AND operator. You can also use parentheses to specify the priority of evaluation. For example: For more details of the bfind command syntax please refer to the bfind on-line manual.


Advanced retrieval by bget

The bget command syntax is the following:

where the marked items may be entered in the DBGET search box when the bget mode is selected. Thus, more than one entry may be specified in the search box, and if the second form is used entries in different databases may be retrieved. A most useful command option is: which is to obtain only the sequence data in the fasta format. When an entry contains multiple sequences, such as in PDB and GENES, use to select the sequence, where # is the sequence number or it can be a for the amino acid sequence and n for the nucleotide sequence in the GENES database. For more details of the bget command syntax please refer to the bget on-line manual.


Brief History of DBGET, LinkDB, and KEGG


Created: 21 August 1995
Updated: 16 August 2002