CN101826132B

CN101826132B - Visual extraction method for protein sequence characteristics

Info

Publication number: CN101826132B
Application number: CN201010100242A
Authority: CN
Inventors: 肖绚; 王普
Original assignee: Jingdezhen Ceramic Institute
Current assignee: Jingdezhen Ceramic Institute
Priority date: 2010-01-22
Filing date: 2010-01-22
Publication date: 2012-10-10
Anticipated expiration: 2030-01-22
Also published as: CN101826132A

Abstract

The invention relates to a visual extraction method for protein sequence characteristics, which mainly comprises the following steps: firstly, numerically encoding every amino acid in the protein sequence; converting a protein character sequence into three digit sequences reflecting protein sequence physicochemical property through an encoding model; constructing three Hasse matrices on the basis of the partial ordering theory; transforming the three Hasse matrices into an improved Hasse matrix, wherein the elements in the Hasse matrix comprise 0, 1, 2, 3, 4, 5, 6 and 7; and finally, transforming the improved Hasse matrix into an 8-color picture to obtain a visual picture of protein long-sequence characteristics. The method has the characteristics of long sequence analysis, visuality and universality, and can obtain the characteristics of different protein sequences from the generated visual sequence picture.

Description

Visual extraction method for protein sequence characteristics

Technical field

The present invention is a kind of protein complete sequence feature visualization method for distilling; The technology that relates to Flame Image Process, pattern-recognition and conventional proteins sequential analysis; Different with traditional protein sequence compare of analysis method, the characteristic of ability comparison image ground reflection protein sequence.

Background technology

Along with increasing gene order is loaded into various biometric databases, these sequences are analyzed the requirement that becomes urgent, this is the challenge to the biologist, also is the challenge to the computing machine scholar.In traditional gene sequencing method, there is suitable part to compare and accomplishes through gene order.Traditional gene order comparison is main through the gene alignment, and base ratio is to what accomplish one by one, and wherein relatively typical method is to accomplish (http: ∥ www.ncbi.nlm.nih.gov/BLAST) with very ripe softwares such as BLAST.Can reflect disappearance, insertion, the variation of base with comparalive ease with this software.Though this method can obtain genetic mutation very simply, the result who obtains is not directly perceived.

Analyzing these gene orders can be from many levels, like base sequence, protein, genome etc., determines that analysis of amino acid sequence has certain advantage because many biological phenotype character and gene regulation all are amino acid sequences by protein.The one dimension character string that protein sequence is made up of 20 seed amino acids; It is very difficult to draw the biological nature that more lies in wherein; People have designed many methods and have converted gene order into digital signal, curve etc. for this reason, utilize signal processing method and fractal theory etc. to study again.Wherein, become invisible and make the researcher have one to get information about, give people new enlightenment, promoted the research of gene order their research work into visible because visualization technique becomes geometric manipulations with symbol transition.

Randic designed the method that a kind of protein sequence converts the two-dimensional space broken line in 2006, and he converts 20 amino acid to 20 different space vector (x _{I, 0}, y _{I, 0}), it is on 1 the two-dimentional circumference that these 20 vector point are evenly distributed in a radius.When converting protein sequence to the two-dimensional space broken line, according to the amino acid sequence order, with the amino acid in the sequence separately corresponding to the point in space; These points are coupled together resulting two-dimensional space broken line with straight line just represent protein sequence (Randic, M., Butina; D.; Zupan, J. (2006) .Novel 2-D graphical representation of proteins.Chemical Physics Letters 419,528-532.).These spatial point (x _n, y _n) calculating according to formula x _n=(x _N-1+ x _{I, 0})/2 and y _n=(y _N-1+ y _{I, 0})/2.This method for visualizing of Randic design just becomes a tangled skein of jute and is difficult to differentiate when protein sequence is long.The Yao Yuhua of China in 2008 has improved the design of Randic; The vector of 20 same corresponding 20 2 dimension spaces of amino acid; But in allocation vector, considered amino acid whose physicochemical characteristics; The pairing space vector of amino acid with close physicochemical property is more approaching, and these vectorial length also have nothing in common with each other.But the resulting broken line of this method is owing to having overlapping information (Yao, Y.H., Dai, the Q. that some original protein sequences comprise that lose; Li, C., He; P.A., Nan, Y.Y.; Zhang, Y.Z. (2008) .Analysis of similarity/dissimilarity of protein sequences.Proteins 73,864-871.).

Said method all is to convert protein sequence to 2 dimension space broken lines; Xiao Xuan in 2006 have proposed protein sequence is converted to the method for 2 dimension images; This method is based on cellular automaton; At first amino acid sequence is converted to " 0 ", " 1 " sequence, select for use specific cellular automaton evolution rule that " 0 ", " 1 " sequence after encoding are developed, form " 0 ", " a 1 " two-dimensional matrix after developing through several times; Two-dimensional matrix is converted into black white image and carries out convergent-divergent, obtain the protein Visualization Model.This image is owing to generating the character that does not have considered amino acid in the evolutionary process, so very difficult the gaining knowledge with biological information of the image that draws explained.

Convert protein sequence to image; For image processing techniques is applied to protein sequence analysis a kind of approach is provided; But image must reflect the characteristic of protein sequence; The method for visualizing of existing two-dimensional space broken line at most can only considered amino acid two kinds of physicochemical property, how designing the protein method for visualizing that can reflect the several amino acids physicochemical characteristics still is a new research topic.

Summary of the invention

The objective of the invention is to shortcomings such as the functional analysis that exists in traditional gene sequencing method are not comprehensive, the result is not directly perceived; Provide a kind of protein complete sequence feature visualization to shift to an earlier date method; Can from the protein sequence visual image that generates, obtain the characteristic that the different genes sequence has, and then its sequence signature of analysis and utilization carries out medical research.

For realizing such purpose, a kind of visual extraction method for protein sequence characteristics that the present invention proposes is characterized in that in turn including the following steps:

1) amino acid in the protein sequence is carried out numerical coding, the numerical coding model has reflected three kinds of physicochemical properties of amino acid, converts protein sequence to three different Serial No.s through encoding model;

2) based on three Haas matrixes that reflect the single character of protein sequence of partial order The Theory Construction; Element in these three Haas matrixes has only " 0 " and " 1 " two-digit; Through conversion these three Haas matrix conversion are become an improved Haas matrix again, the element in this improved Haas matrix is made up of " 0 ", " 1 ", " 2 ", " 3 ", " 4 ", " 5 ", " 6 ", " 7 " eight numerals;

3) with " 0 " in above-mentioned eight numerals expression black, " 1 " expression is blue, and " 2 " expression is green; " 3 " expression blue-green, " 4 " expression is red, " 5 " expression carmetta; " 6 " expression is yellow, and " 7 " expression white is through visualization technique; With above-mentioned improved Haas matrix conversion be eight kinds of colors image, the visual image that obtains having protein complete sequence characteristic.

According to the angle that difference considers a problem, existing many cover amino acid no word code models.Adopt numerical coding that following benefit is arranged: 1. numerical coding is simpler than character; 2. numerical coding can compressed information redundance and storage space; 3. good numerical coding can be represented amino acid whose various characteristics, like water wettability, electric polarity etc.; 4. numerical coding has strict magnitude relationship, has total order property; 5. after passing through numerical coding, amino acid sequence can utilize existing Digital Signal Processing to analyze.

Compare with traditional sequence alignment method, the inventive method has the characteristics of complete sequence analysis, intuitive, universality.At first this method is that complete sequence is analyzed, and can consider the permutation and combination characteristic that the long-range between sequence influences each other and acts on and provide the essence of sequence.And traditional sequential analysis can only draw the position and the content of sudden change through comparison, can not provide the compositing characteristic that sequence has.This method is that protein sequence is converted into two dimensional image, utilizes the characteristic of people's vision to the responsive characteristics discovery generation image of image.And traditional method is that one-dimensional sequence is directly analyzed, and obviously, this is very abstract loaded down with trivial details process.Through the parameter that the image that different types of protein sequence generated obtains, the correlativity of the sequence of calculation can obviously be classified to protein, explains that this method of the present invention has universality.

Description of drawings

Fig. 1 is a centractin P42025 visual image.

Fig. 2 is a centractin Q54179 visual image.

Fig. 3 is an acetyltransferase P22763 visual image.

Embodiment

The present invention is its concrete embodiment of example explanation with the protein sequence similarity.The present invention has downloaded three kinds of different protein sequences from UniProt; Two centractins wherein, respectively from the mankind and slime-fungi, an acetyltransferase; These different proteins sequences are carried out visualization processing, through their similarity of graphical analysis.Table 1 has been listed the relevant information of these sequences.

Table 1: three kinds of different proteins sequences

?Accession	Protein?names	Length
			?P42025	Beta-centractin	376
?Q54179	Centraction	383
			?P22763	Arylacetamide?deacetylase	399

Present embodiment carries out as follows:

1) amino acid no word code

The design adopts down three seed amino acid numerical coding models shown in the tabulation 2, representes amino acid whose hydrophobicity, water wettability and side chain molecular weight respectively.

Table 2: amino acid no word code model

Through above-mentioned amino acid no word code model, a protein character string is convertible into three Serial No.s.

2) based on the improved Haas matrix of partial order The Theory Construction

Partial order be divide and a kind of the noncomparabilities relation to be added to " greater than ", the method for going in " being less than or equal to " this classical hierarchical relationship.Can know that by table 1 order of the hydrophobicity value of 20 seed amino acids is:

I＞F＞V＞L＞W＞M＞A＞G＞C＞Y＞P＞T＞S＞H＞E＞N＞Q＞D＞K＞R；

The order of the hydrophilicity value of 20 seed amino acids is:

R＝K＝E＝D＞S＞Q＝N＞G＝P＞T＞A＝H＞C＞M＞V＞L＝I＞Y＞F＞W；

The order of the side chain molecular weight of 20 seed amino acids is:

W＞Y＞R＞F＞H＞M＞E＝K＞Q＞D＞N＞I＝L＞C＞T＞V＞P＞S＞A＞G。

Suppose that a protein sequence is S=s ₁s ₂S _N, according to certain amino acid physicochemical characteristics, amino acid in the protein sequence is compared in twos, can constitute the Haas matrix.When protein sequence length was N, the Haas matrix of formation was N * N.The Haas matrix is following:

Wherein:

h_{Ij} = \{\begin{matrix} 1 & s_{i} &GreaterEqual; s_{j} \\ 0 & Other \end{matrix}

(i=1,2 ..., N; J=1,2 ..., N)

If there be P amino acid physicochemical characteristics to compare, a protein sequence just can constitute P Haas matrix so, and the present invention constitutes three Haas matrix representations according to above-mentioned amino acid hydrophobicity, water wettability and side chain molecular weight (P=3) and is:

(i＝1，2，3) (2)

The Haas matrix conversion of above-mentioned three expression protein sequence single physical chemical characteristics are become a N * N matrix, be called improved Haas matrix H ', the element computing method among the H ' are following:

H_{i, j}^{'} = H_{i, j}^{1} \times 2^{(P - 1)} + H_{i, j}^{2} \times 2^{(P - 2)} + . . . + H_{i, j}^{P} - - - (3)

Owing to have only " 0 " and " 1 " two numerals in the Haas matrix of expression protein sequence single physical chemical property; When P equaled 3, the composition of improved Haas matrix element was made up of " 0 ", " 1 ", " 2 ", " 3 ", " 4 ", " 5 ", " 6 " and " 7 " 8 kinds of numerals.With a length is that 6 protein sequence is example: MGAPFV, and it improves the Haas matrix accordingly and is:

H^{'} = (\begin{matrix} 7 & 6 & 6 & 6 & 1 & 5 \\ 1 & 7 & 1 & 3 & 1 & 1 \\ 1 & 6 & 7 & 2 & 1 & 1 \\ 1 & 5 & 5 & 7 & 1 & 1 \\ 6 & 6 & 6 & 6 & 7 & 6 \\ 2 & 6 & 6 & 6 & 1 & 7 \end{matrix}) - - - (4)

3) generation of sequence image

Define " 0 " expression black in the improved Haas matrix, " 1 " expression is blue, and " 2 " expression is green; " 3 " expression blue-green, " 4 " expression is red, " 5 " expression carmetta; " 6 " expression is yellow; " 7 " expression white is used visualization technique, two-dimensional matrix is converted into the coloured image of 8 gray levels.

We can be clearly seen that from Fig. 1, Fig. 2 and Fig. 3 three width of cloth images, and are very alike during similar proteinogenous image, non-similar not alike.

Claims

1. visual extraction method for protein sequence characteristics is characterized in that in turn including the following steps:

1) amino acid in the protein sequence is carried out numerical coding, converts the protein character string to reflect the protein sequence physicochemical property three different Serial No.s through encoding model, described encoding model such as following table:

Shown amino acid no word code model;