HGC icon


Aligns and merges sequence fragments resulting from shotgun sequencing or gene transcripts (EST) fragments in order to reconstruct the original segment or gene
Home One-Click Assembly Step-by-Step Assembly Stand-Alone Processing Help
Program's manual EGassembler Tutorial Acknowledgements

EGassembler Tutorial

Table of Contents


     EGassembler is an online service, which provides an automated as well as a user-customized analysis tool for cleaning, repeat masking, vector trimming, organelle masking, clustering and assembly of ESTs and genomic fragments. EGassembler consists of a pipeline of the following five components, each using highly reliable open-source tools (see Acknowledgements for details) and a non-redundant custom-made database of vectors and repeats covering almost all publicly available vectors and repeats databases. Figure 1 shows a flow chart of the EGassembler process.

Pipeline Description

     The web server accepts any type of DNA sequences in FASTA format (EST, GSS, cDNA, gDNA). Pipeline includes:

    • Sequence Cleaning: automated trimming and screening for various contaminants, low quality and low-complexity sequences.

    • Repeat Masking: masking DNA sequences for repetitive elements including small RNA pseudogenes, LINEs, SINEs, LTR elements, microsatellites and other interspersed repeats. By default it uses our custom-made non-redundant repeats database which includes: RepBase, TREP repeats, TIGR plant repeats and over thousands other publicly available repeat sequences on the Internet. Researchers can also use their own libraries of repeats for screening.

    • Vector Masking: screening out the vector, adaptors and other contamination. It uses by default the NCBI's UniVec core vector/adaptor library, and EMBL's emvec vector library as an option. Users can also upload their own database of vector sequences for screening.

    • Organelle Masking: using NCBI's entire current organelle database (762 mitochondria, 42 plastids, 14 plasmids and 3 nucleomorph), users have the opportunity to screen their sequences against all plastids and mitochondrial genomes (Fungi, Metazoan, Plants and plasmids). Users can also use their organelle sequences for screening.

    • Sequence Assembly: Clustering and assembly the sequences into contigs and singletons using CAP3.

Interface Description

     EGassembler web interface has three sub-menus, each targeted for different users.

    1. One-Click Assembly
    2. Step-by-Step Assembly
    3. Stand-Alone Processing

    One-Click Assembly

      This option suits users new to bioinformatics. All the components in the pipeline would be running consecutively with their default options. Users only select libraries for masking repeats, vectors and organelles. Each process runs consequently until all process finished.

    Step-by-Step Assembly

      Users can run all the components outlined in the pipeline interactively and have the opportunity to run each one of them with advanced options. The output of each step of the process will be automatically used as input to the next step of the pipeline; users can also jump into any step at anytime with the previous results.

    Stand-Alone Processing

      Users can use each one of the components alone with all options available. Web-interface displays the default parameters of the original programs, any of which users can choose/change for each program. This option is the same as Step-By-Step Processing. The only difference is that here users can not use the output of one process as input to another process.

Browser Compatibility

    The web server has been test successfully on a number of browsers, including Internet Explorer, Firefox, Mozilla, Opera, Safari and Maxthon on three operating system (Microsoft Windows, Linux and MAC).


    Using One-Click Assembly for downloading and assembly all ESTs of Arabidopsis lyrata deposited in Genbank.

    Downloading EST:

    1. Go to the web site: http://www.ncbi.nlm.nih.gov/
    2. Search Nucleotide for "arabidopsis lyrata AND gbdiv_est[PROP] "
    3. Change Display format to FASTA
    4. Send to File
    5. Save on your computer

    One-Click Assembly:

    1. Go to web site: https://www.genome.jp/tools/egassembler/
    2. Choose File Upload and upload your file, or you may copy and paste your sequence into the text field.
    3. Enable Sequence Cleaning Process if needed (you also can choose CPU numbers and identity threshold)
    4. Enable Repeat Masking Process to select the library of repeats that you want to screen against (here choose RepBase database and arabidopsis option)
    5. Enable Vector Masking Process: select the library of the vector that you want to screen against (here NCBI's core vector library)
    6. Enable Organelle Masking Process: select either plastids or Mitochondria database (here plastids, arabidopsis)
    7. Enable Sequence Assembly Process; overlap percent identity cutoff can be modified.
    8. Click on Submit button.

    Output of is a page with hyperlinks to all results from the different processes.

    Viewing Results

      1- Sequence Cleaning Process

        .clean file is the query sequences, with Poly-A/Ploy-T and low-complexes removed.
        .cln file is the summary of cleaning

      2- Repeat Masking Process

        .masked file is the query sequence along with parts masked by X's

        .tbl file is the table of masked repeats in the query sequence that have been classified.

        .out file is summary of repeats found in the query sequence

    3- Vector Masking Process

      .vec_screen file is query file with matches to vector database masked with X's

    4- Organelle Masking Process

      .org_screen file is query file with matches to organelle database masked with X's

    5- Assembly Process

      .contigs file is the consensus contig from the assembly of overlapping ESTs

      .singletons file contains ESTs with no similarity to other ESTs

      .cap3-alignment file is a summary of the alignment for ESTs taking part in assembly


    Using Step-By-Step Assembly for assembly all ESTs of Arabidopsis lyrata.

    1- Sequence Cleaning Process

    2- Repeat Masking Process

    3- Vector Masking Process

    4- Organelle Masking Process

      Organelle database can be screened instead of the vector database using same procedure as Vector Masking.

    5- Sequence Assembly Process

Tutorial Movie

    Download Tutorial movie (AVI file 274MByte)
Home KEGG GenomeNet Kanehisa Laboratory

Masoudi-Nejad A, Tonomura K, Kawashima S, Moriya Y, Suzuki M, Itoh M, Kanehisa M, Endo T, Goto S (2006)
EGassembler: online bioinformatics service for large-scale processing, clustering and assembling ESTs and genomic DNA fragments. Nucleic Acids Res. 34:W459-462.