Table of Contents
EGassembler is an online service, which provides an automated as well as a user-customized analysis tool for cleaning, repeat masking, vector trimming, organelle masking, clustering and assembly of ESTs and genomic fragments. EGassembler consists of a pipeline of the following five components, each using highly reliable open-source tools (see Acknowledgements for details) and a non-redundant custom-made database of vectors and repeats covering almost all publicly available vectors and repeats databases. Figure 1 shows a flow chart of the EGassembler process.
The web server accepts any type of DNA sequences in FASTA format (EST, GSS, cDNA, gDNA). Pipeline includes:
- Sequence Cleaning: automated trimming and screening for various contaminants, low quality and low-complexity sequences.
- Repeat Masking: masking DNA sequences for repetitive elements including small RNA pseudogenes, LINEs, SINEs, LTR elements, microsatellites and other interspersed repeats. By default it uses our custom-made non-redundant repeats database which includes: RepBase, TREP repeats, TIGR plant repeats and over thousands other publicly available repeat sequences on the Internet. Researchers can also use their own libraries of repeats for screening.
- Vector Masking: screening out the vector, adaptors and other contamination. It uses by default the NCBI's UniVec core vector/adaptor library, and EMBL's emvec vector library as an option. Users can also upload their own database of vector sequences for screening.
- Organelle Masking: using NCBI's entire current organelle database (762 mitochondria, 42 plastids, 14 plasmids and 3 nucleomorph), users have the opportunity to screen their sequences against all plastids and mitochondrial genomes (Fungi, Metazoan, Plants and plasmids). Users can also use their organelle sequences for screening.
- Sequence Assembly: Clustering and assembly the sequences into contigs and singletons using CAP3.
EGassembler web interface has three sub-menus, each targeted for different users.
- One-Click Assembly
- Step-by-Step Assembly
- Stand-Alone Processing
This option suits users new to bioinformatics. All the components in the pipeline would be running consecutively with their default options. Users only select libraries for masking repeats, vectors and organelles. Each process runs consequently until all process finished.
Users can run all the components outlined in the pipeline interactively and have the opportunity to run each one of them with advanced options. The output of each step of the process will be automatically used as input to the next step of the pipeline; users can also jump into any step at anytime with the previous results.
Users can use each one of the components alone with all options available. Web-interface displays the default parameters of the original programs, any of which users can choose/change for each program. This option is the same as Step-By-Step Processing. The only difference is that here users can not use the output of one process as input to another process.
The web server has been test successfully on a number of browsers, including Internet Explorer, Firefox, Mozilla, Opera, Safari and Maxthon on three operating system (Microsoft Windows, Linux and MAC).
Using One-Click Assembly for downloading and assembly all ESTs of Arabidopsis lyrata deposited in Genbank.
- Go to the web site: http://www.ncbi.nlm.nih.gov/
- Search Nucleotide for "arabidopsis lyrata AND gbdiv_est[PROP] "
- Change Display format to FASTA
- Send to File
- Save on your computer
- Go to web site: https://www.genome.jp/tools/egassembler/
- Choose File Upload and upload your file, or you may copy and paste your sequence into the text field.
- Enable Sequence Cleaning Process if needed (you also can choose CPU numbers and identity threshold)
- Enable Repeat Masking Process to select the library of repeats that you want to screen against (here choose RepBase database and arabidopsis option)
- Enable Vector Masking Process: select the library of the vector that you want to screen against (here NCBI's core vector library)
- Enable Organelle Masking Process: select either plastids or Mitochondria database (here plastids, arabidopsis)
- Enable Sequence Assembly Process; overlap percent identity cutoff can be modified.
- Click on Submit button.
Output of is a page with hyperlinks to all results from the different processes.
3- Vector Masking Process
.vec_screen file is query file with matches to vector database masked with X's
4- Organelle Masking Process
.org_screen file is query file with matches to organelle database masked with X's
5- Assembly Process
.contigs file is the consensus contig from the assembly of overlapping ESTs
.singletons file contains ESTs with no similarity to other ESTs
.cap3-alignment file is a summary of the alignment for ESTs taking part in assembly
Using Step-By-Step Assembly for assembly all ESTs of Arabidopsis lyrata.
1- Sequence Cleaning Process
2- Repeat Masking Process
3- Vector Masking Process
4- Organelle Masking Process
Organelle database can be screened instead of the vector database using same procedure as Vector Masking.
5- Sequence Assembly Process
Download Tutorial movie (AVI file 274MByte)
Masoudi-Nejad A, Tonomura K, Kawashima S, Moriya Y, Suzuki M, Itoh M, Kanehisa M, Endo T, Goto S (2006)
EGassembler: online bioinformatics service for large-scale processing, clustering and assembling ESTs and genomic DNA fragments.
Nucleic Acids Res. 34:W459-462.