What is TSEG?


Contents

How to Analyze Membrane Proteins Using TSEG system?
Algorithm of Detection Tool of Membrane Proteins in Genome Sequence
Algorithm of TSEG
Accuracy of TSEG
References

How to Analyze Membrane Proteins Using TSEG system?

To predict transmembrane segments in query protein sequences, use the page named "Executing TSEG". At the top of the page, select the number of discriminant functions to use, namely 5 or 1. Using 5 discriminant functions, you can get better prediction results, especially for 5 and 7 transmembrane proteins. But prediction using 5 discriminant functions are rather slow for long sequences (longer than about 1500 amino acids). (See also algorithm of TSEG)
Enter the sequences in the next window. Fasta format is available. Or you can also specify a file of fasta sequences on your local disk.
In case you want to know which are transmembrane proteins in a multiple seqnuences like all ORFs in a genome, I strongly recommend to use the page named "Tool for Detecting Membrane Proteins in Genome Sequence" . In the page, at first, transmembrane proteins are identified much faster methods (but not position of transmembrane segments in the sequences). If you want to have the prediction of position of transmembrane segments afterwards, TSEG is executed by clicking the ID of an identified transmembrane protein.
[Top]
Algorithm of Detection Tool of Membrane Proteins in Genome Sequence

The discriminant function is used which was constructed to discriminate the most hydrophobic 17 residue-long segments in sequences from 2 groups of the training set, membrane proteins and globular proteins. The training set of membrane proteins consists of 3250 sequences of membrane proteins. Every pair of the sequences has less than 30% sequence identity. The training set of globular proteins are 928 sequences from PDBSELECT May 1997 version. 94.3% of the membrane proteins and 95.2% of the globular proteins are correctly predicted by the discriminant function.
[Top]
Algorithm of TSEG

TSEG is based on a classification of transmembrane(TM) segments. TM segments in SWISS-PROT database are first devided into subgroups by the number of TM segments in proteins and the order it spans the membrane. Then similar subgroups are merged to form 5 groups. Mahalanobis distance was used as the measure of subgroups, and average hydrophobicity and AP value (periodicity of alpha helix) are used as parameters to express the character of subgroups. The location of the 5 groups of TM segments in 1 TM to 14 TM proteins are shown in Figure 1(the right figure). The 15 TM proteins (from 0 to 14 TM proteins) are called models of membrane proteins below.
The linear discriminant function was constructed for each of the groups of TM segments against the group of loop segments. A Loop is defined as a segment that is not transmembrane. Thus, the group of loop regions consisted of all non-transmembrane segments that were longer than four residues, which was long enough to calculate the AP value, in the membrane protein sequences of the training data set. Models (Figure 1) is lines of TM segments in the 5 groups. Therefore, a model is defined by a combination of multiple discriminant functions. The prediction procedure is illustrated in Figure 2 (the left figure). In the first stage, a query sequence is applied to each model to see if it is compatible. When the model is represented only by a single discriminant function, TM segments are selected in order of their scores. When the model contains multiple discriminant functions, each function first selects candidate TM segments independently up to the maximum number allowed by the model, then the combination that gives the highest score is adopted for the model avoiding overlapping segments. The selection of TM segments is made by a window search, where the score for a window is the result of computing the parameter values and employing the discriminant function. The window length is set to 17. Those windows determined to be transmembrane are called _gcores_h of transmembrane segments. Then the length of each core is adjusted as follows. The 17 residue window is moved from both ends of the core toward N- and C-terminal directions until the discriminant function gives a negative score, which determines outer boundaries of the transmembrane segment. The final predicted boundary is taken to be the halfway between the outer boundary and the core boundary. In case the outer boundaries of neighboring transmembrane segments overlap, both boundaries are shortened as much as possible to eliminate the overlap. If the transmembrane segments obtained are longer than 35 residues, they are shortened not to exceed 35 residues.
In the second stage, the models are compared by their scores. The score for each model is the sum of scores for all the windows in the sequence.
Our method is characterized by the following three features.
The main feature is that different properties of different TM segments are incorporated based on classification of TM segments in a database. In fact, not all TM segments are equally hydrophobic, but some of them have distinctive features. For example, TM segments of single spanning membrane proteins are known to be highly hydrophobic and have small hydrophobic moments, whereas the last TM segments in seven-transmembrane proteins are relatively less hydrophobic and are often difficult to detect by prediction methods. Thus, we have classified TM segments first by the total number of TM segments in a protein and the order that they appear in the protein sequence, and then by merging similar ones into groups.
Second, our method enumerates possible models ranked by their scores where a model is distinguished by the number of TM segments in a membrane protein. Even for a membrane protein whose topology is derived by some experimental evidence, it is often the case that contradictory results are suggested by other experiments. Therefore, it is desirable for a predictive method to output not a single prediction but a list of possibilities with certainty measures, so that further experiments can be designed to distinguish between several topology models.
Third, the possibility that the query sequence is a globular protein is explicitly taken into consideration, which is not necessarily a feature incorporated in the existing methods.
(See the reference [3] for detail)
[Top]
Accuracy of TSEG

The prediction results to the 89 test sequences using the best parameter set (average hydrophobicity, AP value, polarity ) are blow.
Rank Protein-based Segment-based
Obs-sov(%) Prd-sov(%) Nseg under Nseg over
top1 61.8 85.1 91.5 28 53
top2 73.0 89.9 93.8 21 36
top3 74.2 92.1 95.3 16 28
Rank : Up to 1, 2 or 3 predictions were considered.
Obs / Prd -sov(%) : Obvserved/predicted TM segment overlaps.
Nseg under / over : Number of false positive/negative TM segments.
(See reference [3] for detail )
[Top]
References

[Detection of Membrane Proteins]
[1] Kihara,D., Kanehisa, M.
Detection of Membrane Proteins in the Whole Genome Sequences.
Genome Informatics 1997, pp.300-301, Universal Academy Press, Tokyo (1997) - [pdf]

[2] Kihara, D., Kanehisa, M.
Tandem clusters of membrane proteins in complete genome sequences. (1999). ( in preparation )

[TSEG]
[3] Kihara,D., Shimizu,T., Kanehisa,M.
Prediction of Membrane Proteins Based on Classification of Transmembrane Segments.
Protein Engineering 11: 961-970 (1998) - PubMed
[Top]
Back to TSEG homepage
Last updated: March 11, 1999
dkihara@purdue.edu