Parallel PRRN : Multiple sequence alignment
by the best-first iterative refinement strategy
with tree-dependent partitioning
(modified from the program prrp/prrn)
------------------------------------------------------------------------
Method :
The program Parallel PRRN is an implementation of the best-first search
iterative refinement strategy with tree-dependent partitioning for multiple
sequence alignment [5]. The basic alignment algorithms [2-4] are common to
those of the programs prrp/prrn (now unified to "ordinary" or "serial" prrn
[8] ftp://ftp.genome.ad.jp/pub/db/hgc/software/saitama-cc, which adopt the
randomized iterative refinement strategy [1]. These strategies perform a
large number of pairwise group-to-group alignments to gradually improve
overall weighted sum-of-pairs score, where the pair weights are introduced
to correct for uneven representations of the sequences to be aligned [6].
These hill-climbing strategies do not guarantee to achieve true
optimization, but are proven powerful to solve practical alignment problems.
Parallel PRRN, as well as the serial counterpart, uses the doubly nested
randomized iterative (DNR) method [7] to make alignment, phylogenetic tree
and pair weights mutually consistent. In Parallel PRRN, individual
alignment processes are distributed to a number of (presently 32) processors
running in parallel, which greatly reduces the overall execution time. The
strategy works most effectively for refinement of a crude alignment obtained
by other more rapid methods, e.g. a progressive alignment method. The
default settings now generate such a provisional alignment from a set of
totally unaligned sequences given by the user.
------------------------------------------------------------------------
References :
[1] Berger, M.P., and Munson, P.J. (1991)
"A novel randomized iterative strategy for aligning multiple protein
sequences." CABIOS 7, 479-484.
[2] Gotoh, O. (1993)
"Optimal alignment between groups of sequences and its application to
multiple sequence alignment." CABIOS 9, 361-370.
[3] Gotoh, O. (1993)
"Extraction of conserved or variable regions from a multiple sequence
alignment." Proceedings of Genome Informatics Workshop IV." pp. 109-113.
[4] Gotoh, O. (1994)
"Further improvement in group-to-group sequence alignment with generalized
profile operations." CABIOS 10, 379-387.
[5] Hirosawa, M., Totoki, Y., Hoshida, M., and Ishikawa, M. (1995)
"Comprehensive study on iterative algorithms of multiple sequence
alignment." CABIOS 11, 13-18.
[6] Gotoh, O. (1995)
"A weighting system and algorithm for aligning many phylogenetically
related sequences." CABIOS 11, 543-551.
[7] Gotoh, O. (1996)
"Significant improvement in accuracy of multiple protein sequence
alignments by iterative refinement as assessed by reference to structural
alignments." J. Mol. Biol. 264, 823-838.
[8] Gotoh, O. (1999)
"Multiple sequence alignment: algorithms and applications"
Adv. Biophys. 36, 159-206.
------------------------------------------------------------------------
Restrictions of the data size:
1. [The number of sequences] <= 200
2. [The maximum length] <= 2000
3. [The number of sequences] * [The maximum length] <= 50000
------------------------------------------------------------------------
How to use:
1. Select 'Sequence Type'.
2. Select 'Score Matrix'.
3. Set 'Gap Penalty'.
Fill in each field for 'Gap Penalty'.
Each 'Gap Penalty' must be "0 <= Gap_Penalty <= 100".
4. Select 'Output Format'.
5. Select 'Method'.
6. If you want to change other parameters in detail,
you can set 'Window Size', 'Score Matrix (Conserved Regions)'
and 'Threshold Value (Conserved Regions)'.
7. To enter your sequences, select "File Upload" or "Copy & Paste".
7.1 If you select "File Upload",
push 'Browse' button and upload your sequence file.
7.2 If you select "Copy & Paste",
copy and paste your sequences into the text area.
8. Enter your E-mail address.
9. If you want to reset the input form, Push 'Reset'.
10. Then, push 'Submit' button to submit your query to the server.
------------------------------------------------------------------------
Input sequence format:
Only the "concatenated Fasta" format is acceptable.
A greater than symbol (>) at the first column indicates that a new sequence
begins from the next line. The name of the sequence (less than or equal to
10 characters) must immediately follow (>). A double slash (//) indicates
the end of a sequence, but usually you can omit it. A line starting with a
semicolon (;) is a comment line. Each line should not exceed 255 characters
in length.
All IUPAC-IUB codes for both amino acids and nucleotides are recognized,
except for the amino-acid code 'U' for selenocysteine. In addition to dash
(-), asterisk (*) is also regarded as a deletion character in an amino-acid
sequence. Other characters, including dot (.) and space ( ), are simply
disregarded.
An example of the format is shown below.
>Seq1
aaatt-cccggg
>Seq2
; This is a comment line
atcgatc
gatcgat
//
>NCODE
ACMGRSVTWYHKDBN
//
------------------------------------------------------------------------
Output sequence format:
Examples of the format are shown below.
* Native
>newdata [3:87] ( 1 - 87 )
% 5.0000000e-01 5.0000000e-01 1.0000000e+00
1 KLGQGCFGEVWMGTWNG----TTRVAIKTLKPGTMSPE--AFLQEAQVMKKLRHEKLVQL| CSRC(HUMAN
1 KLGGGQYGEVYEGVWKK---YSLTVAVKTLKEDTMEVE--EFLKEAAVMKEIKHPNLVQL| CABL(HUMAN
1 -IGEGEFGEVYRGTLRLPSQDCKTVAIKTLKDTSPGGQWWNFLREATIMGQFSHPHILHL| EPH(HUMAN)
+@G G $GEV$ G oj VA@KTLK . _ FLjEA @M jo H j@@jL
55 YAVVSEE-PIYIVTEYMSKGSLLDFLK | CSRC(HUMAN
56 LGVCTREPPFYIITEFMTYGNLLDY-- | CABL(HUMAN
60 EGVVTKRKPIMIITEFMENGA------ | EPH(HUMAN)
.Vo.jj PooI@TE$M G @@_
Special characters in the last line in each block represent some features
conserved among the amino acid residues aligned at the sites (columns).
Completely conserved amino acids are indicated by the standard single-letter
codes. Hydrophobic (o), hydrophilic (j), small aliphatic (.), large
aliphatic (@), aromatic ($), positive (+), and negative (_) sites are also
indicated, if one of such features is conserved among the residues. If non
of the above features are conserved, the position is left blank.
* PHYLIP
CSRC(HUMAN KLGQGCFGEVWMGTWNG----TTRVAIKTLKPGTMSPE--AFLQEAQVMKKLRHEKLVQL
CABL(HUMAN KLGGGQYGEVYEGVWKK---YSLTVAVKTLKEDTMEVE--EFLKEAAVMKEIKHPNLVQL
EPH(HUMAN) -IGEGEFGEVYRGTLRLPSQDCKTVAIKTLKDTSPGGQWWNFLREATIMGQFSHPHILHL
YAVVSEE-PIYIVTEYMSKGSLLDFLK
LGVCTREPPFYIITEFMTYGNLLDY--
EGVVTKRKPIMIITEFMENGA------
* GCG
Name: CSRC(HUMAN oo Len: 87 Check: 6501 Weight: 0.5000
Name: CABL(HUMAN oo Len: 87 Check: 4840 Weight: 0.5000
Name: EPH(HUMAN) oo Len: 87 Check: 6193 Weight: 1.0000
//
CSRC(HUMAN KLGQGCFGEV WMGTWNG... .TTRVAIKTL KPGTMSPE.. AFLQEAQVMK
CABL(HUMAN KLGGGQYGEV YEGVWKK... YSLTVAVKTL KEDTMEVE.. EFLKEAAVMK
EPH(HUMAN) .IGEGEFGEV YRGTLRLPSQ DCKTVAIKTL KDTSPGGQWW NFLREATIMG
CSRC(HUMAN KLRHEKLVQL YAVVSEE.PI YIVTEYMSKG SLLDFLK
CABL(HUMAN EIKHPNLVQL LGVCTREPPF YIITEFMTYG NLLDY..
EPH(HUMAN) QFSHPHILHL EGVVTKRKPI MIITEFMENG A......
* CLUSTAL
CSRC(HUMAN KLGQGCFGEVWMGTWNG----TTRVAIKTLKPGTMSPE--AFLQEAQVMKKLRHEKLVQL
CABL(HUMAN KLGGGQYGEVYEGVWKK---YSLTVAVKTLKEDTMEVE--EFLKEAAVMKEIKHPNLVQL
EPH(HUMAN) -IGEGEFGEVYRGTLRLPSQDCKTVAIKTLKDTSPGGQWWNFLREATIMGQFSHPHILHL
CSRC(HUMAN YAVVSEE-PIYIVTEYMSKGSLLDFLK
CABL(HUMAN LGVCTREPPFYIITEFMTYGNLLDY--
EPH(HUMAN) EGVVTKRKPIMIITEFMENGA------
* GDE
LOCUS CSRC(HUMAN 87 bp PROTEIN
ORIGIN
1 KLGQGCFGEV WMGTWNG--- -TTRVAIKTL KPGTMSPE-- AFLQEAQVMK KLRHEKLVQL
61 YAVVSEE-PI YIVTEYMSKG SLLDFLK
//
LOCUS CABL(HUMAN 87 bp PROTEIN
ORIGIN
1 KLGGGQYGEV YEGVWKK--- YSLTVAVKTL KEDTMEVE-- EFLKEAAVMK EIKHPNLVQL
61 LGVCTREPPF YIITEFMTYG NLLDY--
//
LOCUS EPH(HUMAN) 87 bp PROTEIN
ORIGIN
1 -IGEGEFGEV YRGTLRLPSQ DCKTVAIKTL KDTSPGGQWW NFLREATIMG QFSHPHILHL
61 EGVVTKRKPI MIITEFMENG A------
//
------------------------------------------------------------------------
Parameters:
* Score Matrix
Score matrix of amino-acid substitutions.
PET91 matrixes (Jones et al. (1992) CABIOS 8, 275-282) are used.
PAM levels of [0 - 300] step by 50 are selectable (the default is PAM 250).
* Gap Opening Penalty
The constant term of the affine gap weighting function.
* Gap Extension Penalty
The proportional coefficient of the affine gap weighting function.
* Gap Background Penalty
An additional penalty given to each deletion character with a sequence
specific weighting factor.
* Multiple Alignment Methods
The following methods differ in providing the provisional alignment.
The refinement process is common to all selections.
- Iterative refinement
The alignment given by the user is refined without pre-processing.
- Sequential (input order) + Iterative refinement
A provisional alignment is generated by sequential merger of
individual sequences in the input order.
- Sequential (from shorter) + Iterative refinement
A provisional alignment is generated by sequential merger of
individual sequences in the order from shorter to longer ones.
- Sequential (from longer) + Iterative refinement
A provisional alignment is generated by sequential merger of
individual sequences in the order from longer to shorter ones.
- Sequential (random) + Iterative refinement
A provisional alignment is generated by sequentially merging
individual sequences in a random order.
- Progressive (input alignment) + Iterative refinement
A provisional alignment is generated by a progressive method.
Distance between each pair of sequences is calculated according to
the numbers of substitutions and gaps found in the input alignment.
- Progressive (amino acid content) + Iterative refinement
A provisional alignment is generated by a progressive method.
Distances are calculated based on differences in amino acid / base
compositions.
- Progressive (pairwise alignment) + Iterative refinement
A provisional alignment is generated by a progressive method.
Distances are calculated based on the numbers of substitutions and
gaps in pairwise alignments conducted within the session. Parallel
processors are used for this process.
* Window Size
The program sets a window of the size of (|M - N| + 2 * Input_Value)
along the center of the main diagonal, where M and N are the lengths of
the sequences.
* Score Matrix (Conserved Regions)
Score matrix used in finding conserved regions which are fixed
during multiple alignment calculation.
* Threshold Value (Conserved Regions)
Threshold value used in finding conserved regions which are fixed
during multiple alignment calculation.
------------------------------------------------------------------------
Caveat:
While the default parameter values are carefully examined for amino-acid
sequences, examinations are not sufficient for nucleotide sequences. Since
no information about secondary structures is taken into account, the quality
of alignment of RNA sequences obtained by Parallel PRRN or prrn may be
inferior to that obtained by other methods.
------------------------------------------------------------------------
If you have any questions, comments or bug reports, please contact:
Yasushi Totoki
E-mail: totoki@gsc.riken.jp