PATOSEQ Documentation

Up-to-date documentation and standalone version

What is PATOSEQ?

PATOSEQ (Pattern TO SEQuence alignments) is a tool that aligns motifs describing a sequence (secondary characteristics such as charge, hydrophobicity, volume, frequency, etc...) to amino acid sequences. PATOSEQ has three modes of operation:

What are motifs?

A motif is a linear sequence of tokens, each representing one or more secondary characteristics of a sequence. The basic token types are:

A A fixed amino acid represented by its one-letter code in upprecase.
a A variable amino acid represented by its one-letter code in lowercase. The amino acid is assumed, for alignment to mutate according to the 250-PAM Dayhoff mutation matrix.
_ Any single amino acid.
* Any sequence of amino acids.
{a=0.5,g=0.5} An amino acid frequency vector. Missing entries will be assigned probability 0. An empty frequency vector will be initialized to contain the natural frequency of the amino acids.
[x,mu:sigma] A sequence of amino acids containing the characteristics described by x which can be one or more of:
  • p, n: a positively or negatively charged sequence
  • o, y: a hydrophobic or hydrophilic sequence
  • s, l: a sequence of small or large amino acid residues
  • {}: a frequency vector
  • *: any sequence of amino acids
  • a: an amphipatic alpha helix
  • b: an amphipatic beta sheet
  • v: a volume helix
  • h: a hydrophobic helix (transmembrane)
mu and sigma are numeric values that represent the distribution of the sequence length. The sequence length can be restricted to a range [a..b] by appending (a,b) to the token.

An example motif could then be:

M[p,4:2][h,12:4]{}_{}C{}*

which can be interpreted as an N-terminal Methionine residue, follwed by a positively charged sequence of about 4 amino acids length, a hydrophobic helix of length 12, a frequency vector, any amino acid, another frequency vecotr, a Cysteine residue, a final frequency vector and a sequence of any amino acids.

Another example could be

*[h,15:5](10,20)*

which would align to a hydrophobic helix of length around 15 residues, yet restricted to a length between 10 and 20 residues anywhere within the sequence.

Characteristics can also be combined, i.e.:

*[sy,10:2]*

which will align to a sequence of relatively small, hydrophilic amino acids with a length of about 10 residues.

Each motif token -- and the characteristics in sequence tokens -- can be prefixed with a weight. Weights are used to make one characteristic more important than another for classification. An example with weights could be:

M [5p,4:2] 2[h,12:4] {}_{}C{}*

which would be interpreted as the positive characteristic of the first subsequence being five times more important, and the length characteristic of the second subsequence being two times more important than the other tokens. If no weight is specified, a weight of 1 is assumed.

How are the motifs aligned?

Motif alignement is done by dynamic programming, very much in the same way as normal sequence-to-sequence alignment is done. Each token is assigned a residue or subsequence of the aligned sequence against which an alignment score is calculated.

The partial scores for each token, as with the total alignment score, are the probability of the token or motif matching the sequence.

What is motif refinement?

Since PATOSEQ is also a motif creation tool, it is possible to refine an initial motif to better discriminate between a positive and negative training set. The motif is refined by adjusting parameters such as:

The user can choose which parameters can be adjusted by selecting the respective checkboxes in the online interface.

Refinement is preformed iteratively. This means that after the parameters have been adjusted, the sequences must be realigned. This cycle is repeated until no further improvement is achieved or the maximum number of rounds has been reached.

How do I choose my training sets?

Very, very carefully. The best way to create a positive training set is to go look into SwissProt and hope there are enough sequences annotated with the feature you wish to detect.

Be very carefull with annotations marked as "probable", "putative" or "by similarity" and try to stay away from putative proteins. Even a small percentage of false data in the training set is enough to ruin results completely.

For the negative training set, do not rely on the absence of an anotation to assume it is not there.

Since it is not assumed that anybody will get their training sets right on the first go, it is usually a good idea to take a closer look at the false positives and negatives produced after a refinement run.

The distribution of the alignment scores is also a very good indicator: if the false positives or negatives are outliers, there is probably a problem with the selection of your training set. If the distributions are smooth, however, there is probably something wrong with the motif.

How are the sequences classified?

The scores of the sequences from the positive and negative training sets are assumed to adhere to Beta-distributions. During refinement, the parameters are adjusted to minimize the overlap of these two distributions. The cutoff proposed after refinement is the point where at which the overlap of the two distributions is minimal.

After a refined motif and a cutoff value are obtained, test sequences can be classified as being positive or negative according to their alignment score relative to the cutoff value.

How do I create a motif?

Since motifs (usually protein signals) are often not well described, a little exploration is needed. For example, for the detection of lipoprotein signals in B. Subtilis, the following initial motif, assuming only two characteristic subsequences and a preserved site near the Cysteine residue, was used:

M [{},10:1] [{},20:1] {} {} {} C {} *

After refinement, the composition of the frequency vectors in the subsequences was analysed and shown to contain mostly positively charged residues in the first case and hydrophobic residues in the second. The subsequence lengths were also adjusted according to the observations in the positive training set.

The second frequency vector before the Cysteine residue closely resembled the natural frequency of amino acids and had recieved a low weight, indicating that it was not helpfull for classification. It was therefore replaced by a _.

Much care must be taken to not use frequency vectors in an abusive maner, as they may overfit a motif, and therefore only detect the sequences from the training set and no others.

Once a "good" motif has been extracted, it is usually a good idea to try to eliminate characteristics that do not contribute greatly to the classification. The idea is to get a minimal set of characteristics to describe the classification and therefore avoid overfitting.

What is bootstrapping good for?

If only a small positive training set is available for refining, the risk of overfitting is not negligeable. Therefore, after a first round of refinement, it is adviseable to apply the resulting motif to a greater set of sequences -- i.e. an entire genome -- and reuse the classification results as positive and negative training sets for a second round of refinement.

This can be done iteratively until either the motif converges (re-refinement confirms the initial classification) or too many false positives or false negatives appear. In the second case, bootstrapping has failed.

The idea behind bootstrapping goes as follows: consider a classification based on two independant characteristics A and B. If we refine over a training set consisting of sequences only containing the characteristic A, application of the refined motif to an entire genome will then identify all sequences containing the characteristics A and A & B, the sequences containing only B however, will not be identified.

If bootstrapping is preformed on the results of the initial classification (that is, with the sequences containing A and A & B as the positive training set), and B is a good discriminator, the classification will accept the sequences containing B as false positives (provided this tightens the distributions of the other scores). After a second round of bootstrapping, all sequences containing either A or B will be in the positive set.

Where are the References?

Pedro Gonnet and Frédérique Lisacek. Probabilistic alignment of motifs with sequences. Bioinformatics 2002 18: 1091-1101

Contact

for all inquiries regarding PATOSEQ.

Acknowledgments

Grégoire Rossier for the program name.
Alexandre Gattiker for the online implementation.


Last modified 23/Jan/2002 by AGA