Parallel PRRN : Multiple sequence alignment by the best-first iterative refinement strategy with tree-dependent partitioning (modified from the program prrp/prrn) ------------------------------------------------------------------------ Method : The program Parallel PRRN is an implementation of the best-first search iterative refinement strategy with tree-dependent partitioning for multiple sequence alignment [5]. The basic alignment algorithms [2-4] are common to those of the programs prrp/prrn (now unified to "ordinary" or "serial" prrn [8] ftp://ftp.genome.ad.jp/pub/db/hgc/software/saitama-cc, which adopt the randomized iterative refinement strategy [1]. These strategies perform a large number of pairwise group-to-group alignments to gradually improve overall weighted sum-of-pairs score, where the pair weights are introduced to correct for uneven representations of the sequences to be aligned [6]. These hill-climbing strategies do not guarantee to achieve true optimization, but are proven powerful to solve practical alignment problems. Parallel PRRN, as well as the serial counterpart, uses the doubly nested randomized iterative (DNR) method [7] to make alignment, phylogenetic tree and pair weights mutually consistent. In Parallel PRRN, individual alignment processes are distributed to a number of (presently 32) processors running in parallel, which greatly reduces the overall execution time. The strategy works most effectively for refinement of a crude alignment obtained by other more rapid methods, e.g. a progressive alignment method. The default settings now generate such a provisional alignment from a set of totally unaligned sequences given by the user. ------------------------------------------------------------------------ References : [1] Berger, M.P., and Munson, P.J. (1991) "A novel randomized iterative strategy for aligning multiple protein sequences." CABIOS 7, 479-484. [2] Gotoh, O. (1993) "Optimal alignment between groups of sequences and its application to multiple sequence alignment." CABIOS 9, 361-370. [3] Gotoh, O. (1993) "Extraction of conserved or variable regions from a multiple sequence alignment." Proceedings of Genome Informatics Workshop IV." pp. 109-113. [4] Gotoh, O. (1994) "Further improvement in group-to-group sequence alignment with generalized profile operations." CABIOS 10, 379-387. [5] Hirosawa, M., Totoki, Y., Hoshida, M., and Ishikawa, M. (1995) "Comprehensive study on iterative algorithms of multiple sequence alignment." CABIOS 11, 13-18. [6] Gotoh, O. (1995) "A weighting system and algorithm for aligning many phylogenetically related sequences." CABIOS 11, 543-551. [7] Gotoh, O. (1996) "Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments." J. Mol. Biol. 264, 823-838. [8] Gotoh, O. (1999) "Multiple sequence alignment: algorithms and applications" Adv. Biophys. 36, 159-206. ------------------------------------------------------------------------ Restrictions of the data size: 1. [The number of sequences] <= 200 2. [The maximum length] <= 2000 3. [The number of sequences] * [The maximum length] <= 50000 ------------------------------------------------------------------------ How to use: 1. Select 'Sequence Type'. 2. Select 'Score Matrix'. 3. Set 'Gap Penalty'. Fill in each field for 'Gap Penalty'. Each 'Gap Penalty' must be "0 <= Gap_Penalty <= 100". 4. Select 'Output Format'. 5. Select 'Method'. 6. If you want to change other parameters in detail, you can set 'Window Size', 'Score Matrix (Conserved Regions)' and 'Threshold Value (Conserved Regions)'. 7. To enter your sequences, select "File Upload" or "Copy & Paste". 7.1 If you select "File Upload", push 'Browse' button and upload your sequence file. 7.2 If you select "Copy & Paste", copy and paste your sequences into the text area. 8. Enter your E-mail address. 9. If you want to reset the input form, Push 'Reset'. 10. Then, push 'Submit' button to submit your query to the server. ------------------------------------------------------------------------ Input sequence format: Only the "concatenated Fasta" format is acceptable. A greater than symbol (>) at the first column indicates that a new sequence begins from the next line. The name of the sequence (less than or equal to 10 characters) must immediately follow (>). A double slash (//) indicates the end of a sequence, but usually you can omit it. A line starting with a semicolon (;) is a comment line. Each line should not exceed 255 characters in length. All IUPAC-IUB codes for both amino acids and nucleotides are recognized, except for the amino-acid code 'U' for selenocysteine. In addition to dash (-), asterisk (*) is also regarded as a deletion character in an amino-acid sequence. Other characters, including dot (.) and space ( ), are simply disregarded. An example of the format is shown below. >Seq1 aaatt-cccggg >Seq2 ; This is a comment line atcgatc gatcgat // >NCODE ACMGRSVTWYHKDBN // ------------------------------------------------------------------------ Output sequence format: Examples of the format are shown below. * Native >newdata [3:87] ( 1 - 87 ) % 5.0000000e-01 5.0000000e-01 1.0000000e+00 1 KLGQGCFGEVWMGTWNG----TTRVAIKTLKPGTMSPE--AFLQEAQVMKKLRHEKLVQL| CSRC(HUMAN 1 KLGGGQYGEVYEGVWKK---YSLTVAVKTLKEDTMEVE--EFLKEAAVMKEIKHPNLVQL| CABL(HUMAN 1 -IGEGEFGEVYRGTLRLPSQDCKTVAIKTLKDTSPGGQWWNFLREATIMGQFSHPHILHL| EPH(HUMAN) +@G G $GEV$ G oj VA@KTLK . _ FLjEA @M jo H j@@jL 55 YAVVSEE-PIYIVTEYMSKGSLLDFLK | CSRC(HUMAN 56 LGVCTREPPFYIITEFMTYGNLLDY-- | CABL(HUMAN 60 EGVVTKRKPIMIITEFMENGA------ | EPH(HUMAN) .Vo.jj PooI@TE$M G @@_ Special characters in the last line in each block represent some features conserved among the amino acid residues aligned at the sites (columns). Completely conserved amino acids are indicated by the standard single-letter codes. Hydrophobic (o), hydrophilic (j), small aliphatic (.), large aliphatic (@), aromatic ($), positive (+), and negative (_) sites are also indicated, if one of such features is conserved among the residues. If non of the above features are conserved, the position is left blank. * PHYLIP CSRC(HUMAN KLGQGCFGEVWMGTWNG----TTRVAIKTLKPGTMSPE--AFLQEAQVMKKLRHEKLVQL CABL(HUMAN KLGGGQYGEVYEGVWKK---YSLTVAVKTLKEDTMEVE--EFLKEAAVMKEIKHPNLVQL EPH(HUMAN) -IGEGEFGEVYRGTLRLPSQDCKTVAIKTLKDTSPGGQWWNFLREATIMGQFSHPHILHL YAVVSEE-PIYIVTEYMSKGSLLDFLK LGVCTREPPFYIITEFMTYGNLLDY-- EGVVTKRKPIMIITEFMENGA------ * GCG Name: CSRC(HUMAN oo Len: 87 Check: 6501 Weight: 0.5000 Name: CABL(HUMAN oo Len: 87 Check: 4840 Weight: 0.5000 Name: EPH(HUMAN) oo Len: 87 Check: 6193 Weight: 1.0000 // CSRC(HUMAN KLGQGCFGEV WMGTWNG... .TTRVAIKTL KPGTMSPE.. AFLQEAQVMK CABL(HUMAN KLGGGQYGEV YEGVWKK... YSLTVAVKTL KEDTMEVE.. EFLKEAAVMK EPH(HUMAN) .IGEGEFGEV YRGTLRLPSQ DCKTVAIKTL KDTSPGGQWW NFLREATIMG CSRC(HUMAN KLRHEKLVQL YAVVSEE.PI YIVTEYMSKG SLLDFLK CABL(HUMAN EIKHPNLVQL LGVCTREPPF YIITEFMTYG NLLDY.. EPH(HUMAN) QFSHPHILHL EGVVTKRKPI MIITEFMENG A...... * CLUSTAL CSRC(HUMAN KLGQGCFGEVWMGTWNG----TTRVAIKTLKPGTMSPE--AFLQEAQVMKKLRHEKLVQL CABL(HUMAN KLGGGQYGEVYEGVWKK---YSLTVAVKTLKEDTMEVE--EFLKEAAVMKEIKHPNLVQL EPH(HUMAN) -IGEGEFGEVYRGTLRLPSQDCKTVAIKTLKDTSPGGQWWNFLREATIMGQFSHPHILHL CSRC(HUMAN YAVVSEE-PIYIVTEYMSKGSLLDFLK CABL(HUMAN LGVCTREPPFYIITEFMTYGNLLDY-- EPH(HUMAN) EGVVTKRKPIMIITEFMENGA------ * GDE LOCUS CSRC(HUMAN 87 bp PROTEIN ORIGIN 1 KLGQGCFGEV WMGTWNG--- -TTRVAIKTL KPGTMSPE-- AFLQEAQVMK KLRHEKLVQL 61 YAVVSEE-PI YIVTEYMSKG SLLDFLK // LOCUS CABL(HUMAN 87 bp PROTEIN ORIGIN 1 KLGGGQYGEV YEGVWKK--- YSLTVAVKTL KEDTMEVE-- EFLKEAAVMK EIKHPNLVQL 61 LGVCTREPPF YIITEFMTYG NLLDY-- // LOCUS EPH(HUMAN) 87 bp PROTEIN ORIGIN 1 -IGEGEFGEV YRGTLRLPSQ DCKTVAIKTL KDTSPGGQWW NFLREATIMG QFSHPHILHL 61 EGVVTKRKPI MIITEFMENG A------ // ------------------------------------------------------------------------ Parameters: * Score Matrix Score matrix of amino-acid substitutions. PET91 matrixes (Jones et al. (1992) CABIOS 8, 275-282) are used. PAM levels of [0 - 300] step by 50 are selectable (the default is PAM 250). * Gap Opening Penalty The constant term of the affine gap weighting function. * Gap Extension Penalty The proportional coefficient of the affine gap weighting function. * Gap Background Penalty An additional penalty given to each deletion character with a sequence specific weighting factor. * Multiple Alignment Methods The following methods differ in providing the provisional alignment. The refinement process is common to all selections. - Iterative refinement The alignment given by the user is refined without pre-processing. - Sequential (input order) + Iterative refinement A provisional alignment is generated by sequential merger of individual sequences in the input order. - Sequential (from shorter) + Iterative refinement A provisional alignment is generated by sequential merger of individual sequences in the order from shorter to longer ones. - Sequential (from longer) + Iterative refinement A provisional alignment is generated by sequential merger of individual sequences in the order from longer to shorter ones. - Sequential (random) + Iterative refinement A provisional alignment is generated by sequentially merging individual sequences in a random order. - Progressive (input alignment) + Iterative refinement A provisional alignment is generated by a progressive method. Distance between each pair of sequences is calculated according to the numbers of substitutions and gaps found in the input alignment. - Progressive (amino acid content) + Iterative refinement A provisional alignment is generated by a progressive method. Distances are calculated based on differences in amino acid / base compositions. - Progressive (pairwise alignment) + Iterative refinement A provisional alignment is generated by a progressive method. Distances are calculated based on the numbers of substitutions and gaps in pairwise alignments conducted within the session. Parallel processors are used for this process. * Window Size The program sets a window of the size of (|M - N| + 2 * Input_Value) along the center of the main diagonal, where M and N are the lengths of the sequences. * Score Matrix (Conserved Regions) Score matrix used in finding conserved regions which are fixed during multiple alignment calculation. * Threshold Value (Conserved Regions) Threshold value used in finding conserved regions which are fixed during multiple alignment calculation. ------------------------------------------------------------------------ Caveat: While the default parameter values are carefully examined for amino-acid sequences, examinations are not sufficient for nucleotide sequences. Since no information about secondary structures is taken into account, the quality of alignment of RNA sequences obtained by Parallel PRRN or prrn may be inferior to that obtained by other methods. ------------------------------------------------------------------------ If you have any questions, comments or bug reports, please contact: Yasushi Totoki E-mail: totoki@gsc.riken.jp