BIOINFORMATICS<-->STRUCTURE
Jerusalem, Israel, November 17-21, 1996

Abstract


The bank of patterns PROF_PAT 1.0: computer programs for rapid search for patterns in amino acid sequences

Bachinsky A.G., Yarigin A.A., Gusev V.D., Nemitikova L.A. and Kulichkov V.A.

Theoretical Department, Research Institute of Molecular Biology, SRC VB 'Vector', Koltsovo, Novosibirsk region, 633159, Russia
Institute of Mathematics SB RAS, Novosibirsk, 630090, Russia

bachin@vector.nsk.su


A method and software tool to develop patterns of protein families have been designed. These patterns are intended for the identification of local similarities in arbitrary amino acid sequences with proteins of the SWISS-PROT bank.

The main algorithm of comparing AAS with the pattern database uses the finite automaton of Aho-Corasic, constructed on the basis of a set of samples, which are to be searched for in the input text.

The automaton is presented as an oriented tree-like graph,where nodes are states of the automaton and arcs are admissible transitions from some states to the others, marked with symbols from the alphabet S of amino acids designation. Automaton works in tacts. In every tact one more symbol of a text is read, which determines automaton's transition from the current state into a new one.

Automaton's behavior is characterized by three functions: function of transitions G(s,a); rejections' function F(s) and output function O(s). The values of these functions are calculated one time when constructing an automaton on the base of a given set of samples.

When there is no admissible transition, a situation called "rejection" arises which indicates that comparison has failed. In this case there does not occure a backward move through the text to the beginning of another fragment of the sequence (exit to the initial state of the automaton). A new sample is being examined from the break site of the previous one. It ensures linear dependence of the search time on the length of a query sequence.

The value of the rejections' function F(s) is calculated in the situations, when G(s,a) = "rejection" and it indicates into what state the automaton passes from a current state s, if next symbol of the text does not coincide with a label of any of the arcs, which go out from s. It is transitions "upon rejections" that guarantee returnless manner of text scanning.

Output function O(s) indicates the list of elements, represented by a sample (as a sequence of arc labels on the way from initial state "0" into "s") successful search for which is realized as automaton passes into state s.

To reveal a distant similarity the algorithm of comparison is modified. The user specifies the matrix of likeness of amino acid residues (e.g., the one from families PAM, BLOSUM, etc.) and D - the grade of likeness. For all states of the automaton function of rejection is set equal to zero. Besides, not a sequence as a whole is input to the automaton, but specially processed words formed according to D and matrix of likeness.

The construction of automation and its recording on the hard disk is perfomed only once and takes IBM PC 360/20 about three minutes for 2384 families. Total number of elements in patterns is over 37000. 3-6 seconds are necessary for the search for exact matching for AAS of length 500-1500 amino acids through all the patterns of the bank. Reducing the level of similarity for 40% doubles the time of sequence processing. So AAS comparison with the pattern bank may be performed in interactive mode on computers like IBM PC 486 and higher.


Back to the Abstract Index.