Parallel PRRN : Multiple sequence alignment

Parallel PRRN : Multiple sequence alignment
                   by the best-first iterative refinement strategy
                   with tree-dependent partitioning
                   (modified from the program prrp/prrn)

  ------------------------------------------------------------------------

Method :

  The program Parallel PRRN is an implementation of the best-first search
iterative refinement strategy with tree-dependent partitioning for multiple
sequence alignment [5]. The basic alignment algorithms [2-4] are common to
those of the programs prrp/prrn (now unified to "ordinary" or "serial" prrn
[8]  ftp://ftp.genome.ad.jp/pub/db/hgc/software/saitama-cc, which adopt the
randomized iterative refinement strategy [1]. These strategies perform a
large number of pairwise group-to-group alignments to gradually improve
overall weighted sum-of-pairs score, where the pair weights are introduced
to correct for uneven representations of the sequences to be aligned [6].
These hill-climbing strategies do not guarantee to achieve true
optimization, but are proven powerful to solve practical alignment problems.
Parallel PRRN, as well as the serial counterpart, uses the doubly nested
randomized iterative (DNR) method [7] to make alignment, phylogenetic tree
and pair weights mutually consistent. In Parallel PRRN, individual
alignment processes are distributed to a number of (presently 32) processors
running in parallel, which greatly reduces the overall execution time. The
strategy works most effectively for refinement of a crude alignment obtained
by other more rapid methods, e.g. a progressive alignment method. The
default settings now generate such a provisional alignment from a set of
totally unaligned sequences given by the user.

  ------------------------------------------------------------------------

References :

[1] Berger, M.P., and Munson, P.J. (1991) 
    "A novel randomized iterative strategy for aligning multiple protein 
    sequences." CABIOS 7, 479-484.

[2] Gotoh, O. (1993) 
    "Optimal alignment between groups of sequences and its application to 
    multiple sequence alignment." CABIOS 9, 361-370.

[3] Gotoh, O. (1993) 
    "Extraction of conserved or variable regions from a multiple sequence 
    alignment." Proceedings of Genome Informatics Workshop IV." pp. 109-113.

[4] Gotoh, O. (1994) 
    "Further improvement in group-to-group sequence alignment with generalized 
    profile operations." CABIOS 10, 379-387.

[5] Hirosawa, M., Totoki, Y., Hoshida, M., and Ishikawa, M. (1995) 
    "Comprehensive study on iterative algorithms of multiple sequence 
    alignment." CABIOS 11, 13-18.

[6] Gotoh, O. (1995) 
    "A weighting system and algorithm for aligning many phylogenetically 
    related sequences." CABIOS 11, 543-551.

[7] Gotoh, O. (1996) 
    "Significant improvement in accuracy of multiple protein sequence 
    alignments by iterative refinement as assessed by reference to structural 
    alignments." J. Mol. Biol. 264, 823-838.

[8] Gotoh, O. (1999) 
    "Multiple sequence alignment:  algorithms and applications" 
    Adv. Biophys. 36, 159-206.

  ------------------------------------------------------------------------

Restrictions of the data size:

 1. [The number of sequences] <= 200

 2. [The maximum length] <= 2000

 3. [The number of sequences] * [The maximum length] <= 50000

  ------------------------------------------------------------------------

How to use:

 1. Select 'Sequence Type'.

 2. Select 'Score Matrix'.

 3. Set 'Gap Penalty'.
    Fill in each field for 'Gap Penalty'.
    Each 'Gap Penalty' must be "0 <= Gap_Penalty <= 100".

 4. Select 'Output Format'.

 5. Select 'Method'.

 6. If you want to change other parameters in detail,
    you can set 'Window Size', 'Score Matrix (Conserved Regions)' 
    and 'Threshold Value (Conserved Regions)'.

 7. To enter your sequences, select "File Upload" or "Copy & Paste".

    7.1 If you select "File Upload",
        push 'Browse' button and upload your sequence file.

    7.2 If you select "Copy & Paste", 
        copy and paste your sequences into the text area.

 8. Enter your E-mail address.

 9. If you want to reset the input form, Push 'Reset'.

10. Then, push 'Submit' button to submit your query to the server.

  ------------------------------------------------------------------------

Input sequence format:

  Only the "concatenated Fasta" format is acceptable.  
  A greater than symbol (>) at the first column indicates that a new sequence
begins from the next line. The name of the sequence (less than or equal to
10 characters) must immediately follow (>). A double slash (//) indicates 
the end of a sequence, but usually you can omit it. A line starting with a
semicolon (;) is a comment line. Each line should not exceed 255 characters
in length.
  All IUPAC-IUB codes for both amino acids and nucleotides are recognized,
except for the amino-acid code 'U' for selenocysteine. In addition to dash
(-), asterisk (*) is also regarded as a deletion character in an amino-acid
sequence. Other characters, including dot (.) and space ( ), are simply
disregarded.

An example of the format is shown below.

>Seq1
aaatt-cccggg
>Seq2
; This is a comment line
atcgatc
gatcgat
//
>NCODE
ACMGRSVTWYHKDBN
//

  ------------------------------------------------------------------------

Output sequence format:

Examples of the format are shown below.

* Native

>newdata [3:87]  ( 1 - 87 )
%  5.0000000e-01  5.0000000e-01  1.0000000e+00

     1 KLGQGCFGEVWMGTWNG----TTRVAIKTLKPGTMSPE--AFLQEAQVMKKLRHEKLVQL| CSRC(HUMAN
     1 KLGGGQYGEVYEGVWKK---YSLTVAVKTLKEDTMEVE--EFLKEAAVMKEIKHPNLVQL| CABL(HUMAN
     1 -IGEGEFGEVYRGTLRLPSQDCKTVAIKTLKDTSPGGQWWNFLREATIMGQFSHPHILHL| EPH(HUMAN)
       +@G G $GEV$ G oj        VA@KTLK  .   _   FLjEA @M jo H j@@jL

    55 YAVVSEE-PIYIVTEYMSKGSLLDFLK                                 | CSRC(HUMAN
    56 LGVCTREPPFYIITEFMTYGNLLDY--                                 | CABL(HUMAN
    60 EGVVTKRKPIMIITEFMENGA------                                 | EPH(HUMAN)
        .Vo.jj PooI@TE$M  G @@_


Special characters in the last line in each block represent some features
conserved among the amino acid residues aligned at the sites (columns).
Completely conserved amino acids are indicated by the standard single-letter
codes.  Hydrophobic (o), hydrophilic (j), small aliphatic (.), large
aliphatic (@), aromatic ($), positive (+), and negative (_) sites are also
indicated, if one of such features is conserved among the residues.  If non
of the above features are conserved, the position is left blank.


* PHYLIP

CSRC(HUMAN  KLGQGCFGEVWMGTWNG----TTRVAIKTLKPGTMSPE--AFLQEAQVMKKLRHEKLVQL
CABL(HUMAN  KLGGGQYGEVYEGVWKK---YSLTVAVKTLKEDTMEVE--EFLKEAAVMKEIKHPNLVQL
EPH(HUMAN)  -IGEGEFGEVYRGTLRLPSQDCKTVAIKTLKDTSPGGQWWNFLREATIMGQFSHPHILHL

            YAVVSEE-PIYIVTEYMSKGSLLDFLK
            LGVCTREPPFYIITEFMTYGNLLDY--
            EGVVTKRKPIMIITEFMENGA------

* GCG

Name: CSRC(HUMAN      oo  Len:   87  Check:  6501  Weight:  0.5000
Name: CABL(HUMAN      oo  Len:   87  Check:  4840  Weight:  0.5000
Name: EPH(HUMAN)      oo  Len:   87  Check:  6193  Weight:  1.0000

//

CSRC(HUMAN      KLGQGCFGEV WMGTWNG... .TTRVAIKTL KPGTMSPE.. AFLQEAQVMK
CABL(HUMAN      KLGGGQYGEV YEGVWKK... YSLTVAVKTL KEDTMEVE.. EFLKEAAVMK
EPH(HUMAN)      .IGEGEFGEV YRGTLRLPSQ DCKTVAIKTL KDTSPGGQWW NFLREATIMG

CSRC(HUMAN      KLRHEKLVQL YAVVSEE.PI YIVTEYMSKG SLLDFLK
CABL(HUMAN      EIKHPNLVQL LGVCTREPPF YIITEFMTYG NLLDY..
EPH(HUMAN)      QFSHPHILHL EGVVTKRKPI MIITEFMENG A......

* CLUSTAL

CSRC(HUMAN     KLGQGCFGEVWMGTWNG----TTRVAIKTLKPGTMSPE--AFLQEAQVMKKLRHEKLVQL
CABL(HUMAN     KLGGGQYGEVYEGVWKK---YSLTVAVKTLKEDTMEVE--EFLKEAAVMKEIKHPNLVQL
EPH(HUMAN)     -IGEGEFGEVYRGTLRLPSQDCKTVAIKTLKDTSPGGQWWNFLREATIMGQFSHPHILHL

CSRC(HUMAN     YAVVSEE-PIYIVTEYMSKGSLLDFLK                                 
CABL(HUMAN     LGVCTREPPFYIITEFMTYGNLLDY--                                 
EPH(HUMAN)     EGVVTKRKPIMIITEFMENGA------       

* GDE

LOCUS       CSRC(HUMAN         87 bp    PROTEIN
ORIGIN
        1 KLGQGCFGEV WMGTWNG--- -TTRVAIKTL KPGTMSPE-- AFLQEAQVMK KLRHEKLVQL
       61 YAVVSEE-PI YIVTEYMSKG SLLDFLK
//
LOCUS       CABL(HUMAN         87 bp    PROTEIN
ORIGIN
        1 KLGGGQYGEV YEGVWKK--- YSLTVAVKTL KEDTMEVE-- EFLKEAAVMK EIKHPNLVQL
       61 LGVCTREPPF YIITEFMTYG NLLDY--
//
LOCUS       EPH(HUMAN)         87 bp    PROTEIN
ORIGIN
        1 -IGEGEFGEV YRGTLRLPSQ DCKTVAIKTL KDTSPGGQWW NFLREATIMG QFSHPHILHL
       61 EGVVTKRKPI MIITEFMENG A------
//

  ------------------------------------------------------------------------

Parameters:

* Score Matrix
    Score matrix of amino-acid substitutions.
    PET91 matrixes (Jones et al. (1992) CABIOS 8, 275-282) are used.
    PAM levels of [0 - 300] step by 50 are selectable (the default is PAM 250).

* Gap Opening Penalty
    The constant term of the affine gap weighting function.

* Gap Extension Penalty
    The proportional coefficient of the affine gap weighting function.

* Gap Background Penalty
    An additional penalty given to each deletion character with a sequence
    specific weighting factor.

* Multiple Alignment Methods
    The following methods differ in providing the provisional alignment.  
    The refinement process is common to all selections.

  - Iterative refinement
      The alignment given by the user is refined without pre-processing.

  - Sequential (input order) + Iterative refinement
      A provisional alignment is generated by sequential merger of
      individual sequences in the input order.

  - Sequential (from shorter) + Iterative refinement
      A provisional alignment is generated by sequential merger of
      individual sequences in the order from shorter to longer ones.

  - Sequential (from longer) + Iterative refinement
      A provisional alignment is generated by sequential merger of
      individual sequences in the order from longer to shorter ones.

  - Sequential (random) + Iterative refinement
      A provisional alignment is generated by sequentially merging
      individual sequences in a random order.

  - Progressive (input alignment) + Iterative refinement
      A provisional alignment is generated by a progressive method.
      Distance between each pair of sequences is calculated according to 
      the numbers of substitutions and gaps found in the input alignment.

 - Progressive (amino acid content) + Iterative refinement
      A provisional alignment is generated by a progressive method.
      Distances are calculated based on differences in amino acid / base
      compositions.

 - Progressive (pairwise alignment) + Iterative refinement
      A provisional alignment is generated by a progressive method.
      Distances are calculated based on the numbers of substitutions and
      gaps in pairwise alignments conducted within the session.  Parallel
      processors are used for this process.


* Window Size
    The program sets a window of the size of (|M - N| + 2 * Input_Value)
    along the center of the main diagonal, where M and N are the lengths of 
    the sequences.

* Score Matrix (Conserved Regions)
    Score matrix used in finding conserved regions which are fixed
    during multiple alignment calculation.

* Threshold Value (Conserved Regions)
    Threshold value used in finding conserved regions which are fixed
    during multiple alignment calculation.

  ------------------------------------------------------------------------

Caveat:

  While the default parameter values are carefully examined for amino-acid
sequences, examinations are not sufficient for nucleotide sequences. Since
no information about secondary structures is taken into account, the quality
of alignment of RNA sequences obtained by Parallel PRRN or prrn may be
inferior to that obtained by other methods.

  ------------------------------------------------------------------------

If you have any questions, comments or bug reports, please contact:

Yasushi Totoki
E-mail: totoki@gsc.riken.jp