CN101826132B - Visual extraction method for protein sequence characteristics - Google Patents

Visual extraction method for protein sequence characteristics Download PDF

Info

Publication number
CN101826132B
CN101826132B CN201010100242A CN201010100242A CN101826132B CN 101826132 B CN101826132 B CN 101826132B CN 201010100242 A CN201010100242 A CN 201010100242A CN 201010100242 A CN201010100242 A CN 201010100242A CN 101826132 B CN101826132 B CN 101826132B
Authority
CN
China
Prior art keywords
expression
protein
sequence
haas
protein sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010100242A
Other languages
Chinese (zh)
Other versions
CN101826132A (en
Inventor
肖绚
王普
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdezhen Ceramic Institute
Original Assignee
Jingdezhen Ceramic Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdezhen Ceramic Institute filed Critical Jingdezhen Ceramic Institute
Priority to CN201010100242A priority Critical patent/CN101826132B/en
Publication of CN101826132A publication Critical patent/CN101826132A/en
Application granted granted Critical
Publication of CN101826132B publication Critical patent/CN101826132B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a visual extraction method for protein sequence characteristics, which mainly comprises the following steps: firstly, numerically encoding every amino acid in the protein sequence; converting a protein character sequence into three digit sequences reflecting protein sequence physicochemical property through an encoding model; constructing three Hasse matrices on the basis of the partial ordering theory; transforming the three Hasse matrices into an improved Hasse matrix, wherein the elements in the Hasse matrix comprise 0, 1, 2, 3, 4, 5, 6 and 7; and finally, transforming the improved Hasse matrix into an 8-color picture to obtain a visual picture of protein long-sequence characteristics. The method has the characteristics of long sequence analysis, visuality and universality, and can obtain the characteristics of different protein sequences from the generated visual sequence picture.

Description

Visual extraction method for protein sequence characteristics
Technical field
The present invention is a kind of protein complete sequence feature visualization method for distilling; The technology that relates to Flame Image Process, pattern-recognition and conventional proteins sequential analysis; Different with traditional protein sequence compare of analysis method, the characteristic of ability comparison image ground reflection protein sequence.
Background technology
Along with increasing gene order is loaded into various biometric databases, these sequences are analyzed the requirement that becomes urgent, this is the challenge to the biologist, also is the challenge to the computing machine scholar.In traditional gene sequencing method, there is suitable part to compare and accomplishes through gene order.Traditional gene order comparison is main through the gene alignment, and base ratio is to what accomplish one by one, and wherein relatively typical method is to accomplish (http: ∥ www.ncbi.nlm.nih.gov/BLAST) with very ripe softwares such as BLAST.Can reflect disappearance, insertion, the variation of base with comparalive ease with this software.Though this method can obtain genetic mutation very simply, the result who obtains is not directly perceived.
Analyzing these gene orders can be from many levels, like base sequence, protein, genome etc., determines that analysis of amino acid sequence has certain advantage because many biological phenotype character and gene regulation all are amino acid sequences by protein.The one dimension character string that protein sequence is made up of 20 seed amino acids; It is very difficult to draw the biological nature that more lies in wherein; People have designed many methods and have converted gene order into digital signal, curve etc. for this reason, utilize signal processing method and fractal theory etc. to study again.Wherein, become invisible and make the researcher have one to get information about, give people new enlightenment, promoted the research of gene order their research work into visible because visualization technique becomes geometric manipulations with symbol transition.
Randic designed the method that a kind of protein sequence converts the two-dimensional space broken line in 2006, and he converts 20 amino acid to 20 different space vector (x I, 0, y I, 0), it is on 1 the two-dimentional circumference that these 20 vector point are evenly distributed in a radius.When converting protein sequence to the two-dimensional space broken line, according to the amino acid sequence order, with the amino acid in the sequence separately corresponding to the point in space; These points are coupled together resulting two-dimensional space broken line with straight line just represent protein sequence (Randic, M., Butina; D.; Zupan, J. (2006) .Novel 2-D graphical representation of proteins.Chemical Physics Letters 419,528-532.).These spatial point (x n, y n) calculating according to formula x n=(x N-1+ x I, 0)/2 and y n=(y N-1+ y I, 0)/2.This method for visualizing of Randic design just becomes a tangled skein of jute and is difficult to differentiate when protein sequence is long.The Yao Yuhua of China in 2008 has improved the design of Randic; The vector of 20 same corresponding 20 2 dimension spaces of amino acid; But in allocation vector, considered amino acid whose physicochemical characteristics; The pairing space vector of amino acid with close physicochemical property is more approaching, and these vectorial length also have nothing in common with each other.But the resulting broken line of this method is owing to having overlapping information (Yao, Y.H., Dai, the Q. that some original protein sequences comprise that lose; Li, C., He; P.A., Nan, Y.Y.; Zhang, Y.Z. (2008) .Analysis of similarity/dissimilarity of protein sequences.Proteins 73,864-871.).
Said method all is to convert protein sequence to 2 dimension space broken lines; Xiao Xuan in 2006 have proposed protein sequence is converted to the method for 2 dimension images; This method is based on cellular automaton; At first amino acid sequence is converted to " 0 ", " 1 " sequence, select for use specific cellular automaton evolution rule that " 0 ", " 1 " sequence after encoding are developed, form " 0 ", " a 1 " two-dimensional matrix after developing through several times; Two-dimensional matrix is converted into black white image and carries out convergent-divergent, obtain the protein Visualization Model.This image is owing to generating the character that does not have considered amino acid in the evolutionary process, so very difficult the gaining knowledge with biological information of the image that draws explained.
Convert protein sequence to image; For image processing techniques is applied to protein sequence analysis a kind of approach is provided; But image must reflect the characteristic of protein sequence; The method for visualizing of existing two-dimensional space broken line at most can only considered amino acid two kinds of physicochemical property, how designing the protein method for visualizing that can reflect the several amino acids physicochemical characteristics still is a new research topic.
Summary of the invention
The objective of the invention is to shortcomings such as the functional analysis that exists in traditional gene sequencing method are not comprehensive, the result is not directly perceived; Provide a kind of protein complete sequence feature visualization to shift to an earlier date method; Can from the protein sequence visual image that generates, obtain the characteristic that the different genes sequence has, and then its sequence signature of analysis and utilization carries out medical research.
For realizing such purpose, a kind of visual extraction method for protein sequence characteristics that the present invention proposes is characterized in that in turn including the following steps:
1) amino acid in the protein sequence is carried out numerical coding, the numerical coding model has reflected three kinds of physicochemical properties of amino acid, converts protein sequence to three different Serial No.s through encoding model;
2) based on three Haas matrixes that reflect the single character of protein sequence of partial order The Theory Construction; Element in these three Haas matrixes has only " 0 " and " 1 " two-digit; Through conversion these three Haas matrix conversion are become an improved Haas matrix again, the element in this improved Haas matrix is made up of " 0 ", " 1 ", " 2 ", " 3 ", " 4 ", " 5 ", " 6 ", " 7 " eight numerals;
3) with " 0 " in above-mentioned eight numerals expression black, " 1 " expression is blue, and " 2 " expression is green; " 3 " expression blue-green, " 4 " expression is red, " 5 " expression carmetta; " 6 " expression is yellow, and " 7 " expression white is through visualization technique; With above-mentioned improved Haas matrix conversion be eight kinds of colors image, the visual image that obtains having protein complete sequence characteristic.
According to the angle that difference considers a problem, existing many cover amino acid no word code models.Adopt numerical coding that following benefit is arranged: 1. numerical coding is simpler than character; 2. numerical coding can compressed information redundance and storage space; 3. good numerical coding can be represented amino acid whose various characteristics, like water wettability, electric polarity etc.; 4. numerical coding has strict magnitude relationship, has total order property; 5. after passing through numerical coding, amino acid sequence can utilize existing Digital Signal Processing to analyze.
Compare with traditional sequence alignment method, the inventive method has the characteristics of complete sequence analysis, intuitive, universality.At first this method is that complete sequence is analyzed, and can consider the permutation and combination characteristic that the long-range between sequence influences each other and acts on and provide the essence of sequence.And traditional sequential analysis can only draw the position and the content of sudden change through comparison, can not provide the compositing characteristic that sequence has.This method is that protein sequence is converted into two dimensional image, utilizes the characteristic of people's vision to the responsive characteristics discovery generation image of image.And traditional method is that one-dimensional sequence is directly analyzed, and obviously, this is very abstract loaded down with trivial details process.Through the parameter that the image that different types of protein sequence generated obtains, the correlativity of the sequence of calculation can obviously be classified to protein, explains that this method of the present invention has universality.
Description of drawings
Fig. 1 is a centractin P42025 visual image.
Fig. 2 is a centractin Q54179 visual image.
Fig. 3 is an acetyltransferase P22763 visual image.
Embodiment
The present invention is its concrete embodiment of example explanation with the protein sequence similarity.The present invention has downloaded three kinds of different protein sequences from UniProt; Two centractins wherein, respectively from the mankind and slime-fungi, an acetyltransferase; These different proteins sequences are carried out visualization processing, through their similarity of graphical analysis.Table 1 has been listed the relevant information of these sequences.
Table 1: three kinds of different proteins sequences
?Accession Protein?names Length
?P42025 Beta-centractin 376
?Q54179 Centraction 383
?P22763 Arylacetamide?deacetylase 399
Present embodiment carries out as follows:
1) amino acid no word code
The design adopts down three seed amino acid numerical coding models shown in the tabulation 2, representes amino acid whose hydrophobicity, water wettability and side chain molecular weight respectively.
Table 2: amino acid no word code model
Figure GSB00000820734200041
Figure GSB00000820734200051
Through above-mentioned amino acid no word code model, a protein character string is convertible into three Serial No.s.
2) based on the improved Haas matrix of partial order The Theory Construction
Partial order be divide and a kind of the noncomparabilities relation to be added to " greater than ", the method for going in " being less than or equal to " this classical hierarchical relationship.Can know that by table 1 order of the hydrophobicity value of 20 seed amino acids is:
I>F>V>L>W>M>A>G>C>Y>P>T>S>H>E>N>Q>D>K>R;
The order of the hydrophilicity value of 20 seed amino acids is:
R=K=E=D>S>Q=N>G=P>T>A=H>C>M>V>L=I>Y>F>W;
The order of the side chain molecular weight of 20 seed amino acids is:
W>Y>R>F>H>M>E=K>Q>D>N>I=L>C>T>V>P>S>A>G。
Suppose that a protein sequence is S=s 1s 2S N, according to certain amino acid physicochemical characteristics, amino acid in the protein sequence is compared in twos, can constitute the Haas matrix.When protein sequence length was N, the Haas matrix of formation was N * N.The Haas matrix is following:
Figure GSB00000820734200052
Wherein: h Ij = 1 s i ≥ s j 0 Other (i=1,2 ..., N; J=1,2 ..., N)
If there be P amino acid physicochemical characteristics to compare, a protein sequence just can constitute P Haas matrix so, and the present invention constitutes three Haas matrix representations according to above-mentioned amino acid hydrophobicity, water wettability and side chain molecular weight (P=3) and is:
(i=1,2,3) (2)
The Haas matrix conversion of above-mentioned three expression protein sequence single physical chemical characteristics are become a N * N matrix, be called improved Haas matrix H ', the element computing method among the H ' are following:
H i , j ′ = H i , j 1 × 2 ( P - 1 ) + H i , j 2 × 2 ( P - 2 ) + . . . + H i , j P - - - ( 3 )
Owing to have only " 0 " and " 1 " two numerals in the Haas matrix of expression protein sequence single physical chemical property; When P equaled 3, the composition of improved Haas matrix element was made up of " 0 ", " 1 ", " 2 ", " 3 ", " 4 ", " 5 ", " 6 " and " 7 " 8 kinds of numerals.With a length is that 6 protein sequence is example: MGAPFV, and it improves the Haas matrix accordingly and is:
H ′ = 7 6 6 6 1 5 1 7 1 3 1 1 1 6 7 2 1 1 1 5 5 7 1 1 6 6 6 6 7 6 2 6 6 6 1 7 - - - ( 4 )
3) generation of sequence image
Define " 0 " expression black in the improved Haas matrix, " 1 " expression is blue, and " 2 " expression is green; " 3 " expression blue-green, " 4 " expression is red, " 5 " expression carmetta; " 6 " expression is yellow; " 7 " expression white is used visualization technique, two-dimensional matrix is converted into the coloured image of 8 gray levels.
We can be clearly seen that from Fig. 1, Fig. 2 and Fig. 3 three width of cloth images, and are very alike during similar proteinogenous image, non-similar not alike.

Claims (1)

1. visual extraction method for protein sequence characteristics is characterized in that in turn including the following steps:
1) amino acid in the protein sequence is carried out numerical coding, converts the protein character string to reflect the protein sequence physicochemical property three different Serial No.s through encoding model, described encoding model such as following table:
Figure FSB00000820734100011
Shown amino acid no word code model;
2) based on three Haas matrixes that reflect the single character of protein sequence of partial order The Theory Construction; Element in these three Haas matrixes has only " 0 " and " 1 " two-digit; Through conversion these three Haas matrix conversion are become an improved Haas matrix again, the element in this improved Haas matrix is made up of " 0 ", " 1 ", " 2 ", " 3 ", " 4 ", " 5 ", " 6 ", " 7 " eight numerals;
3) with " 0 " in above-mentioned eight numerals expression black, " 1 " expression is blue, and " 2 " expression is green; " 3 " expression blue-green, " 4 " expression is red, " 5 " expression carmetta; " 6 " expression is yellow, and " 7 " expression white is through visualization technique; With above-mentioned improved Haas matrix conversion be eight kinds of colors image, the visual image that obtains having protein complete sequence characteristic.
CN201010100242A 2010-01-22 2010-01-22 Visual extraction method for protein sequence characteristics Expired - Fee Related CN101826132B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010100242A CN101826132B (en) 2010-01-22 2010-01-22 Visual extraction method for protein sequence characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010100242A CN101826132B (en) 2010-01-22 2010-01-22 Visual extraction method for protein sequence characteristics

Publications (2)

Publication Number Publication Date
CN101826132A CN101826132A (en) 2010-09-08
CN101826132B true CN101826132B (en) 2012-10-10

Family

ID=42690048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010100242A Expired - Fee Related CN101826132B (en) 2010-01-22 2010-01-22 Visual extraction method for protein sequence characteristics

Country Status (1)

Country Link
CN (1) CN101826132B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324002B (en) * 2011-06-03 2013-10-30 哈尔滨工程大学 Two-dimensional image representation method of digital image processing-based DNA sequence
CN105224825B (en) * 2015-10-30 2018-03-06 景德镇陶瓷大学 A kind of fusion is local and the RNA sequences of global characteristics describe method
CN108052800A (en) * 2017-12-19 2018-05-18 石家庄铁道大学 The visualization method for reconstructing and terminal of a kind of infective virus communication process
CN112270727B (en) * 2020-10-23 2022-09-23 内蒙古民族大学 Method for drawing strain protein image based on AI technology
WO2023004699A1 (en) * 2021-07-29 2023-02-02 西门子股份公司 Method and apparatus for presenting data integrity of transformer, and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
肖绚等.一种新颖的蛋白质序列可视化模型.《计算机工程》.2008,第34卷(第03期), *

Also Published As

Publication number Publication date
CN101826132A (en) 2010-09-08

Similar Documents

Publication Publication Date Title
CN101826132B (en) Visual extraction method for protein sequence characteristics
CN113469094B (en) Surface coverage classification method based on multi-mode remote sensing data depth fusion
CN113705588B (en) Twin network target tracking method and system based on convolution self-attention module
Callahan et al. Bioconductor workflow for microbiome data analysis: from raw reads to community analyses
Bremm et al. Interactive visual comparison of multiple trees
Hallin et al. The genome BLASTatlas—a GeneWiz extension for visualization of whole-genome homology
Hegarty et al. On the existence of accessible paths in various models of fitness landscapes
Hubert et al. MacroPCA: An all-in-one PCA method allowing for missing values as well as cellwise and rowwise outliers
CN102171720A (en) Graphics processing using culling on groups of vertices
CN105160352A (en) High-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution
Guzinski et al. Seascape genomics of the sugar kelp Saccharina latissima along the north eastern Atlantic latitudinal gradient
Casals et al. Microlocal theory of Legendrian links and cluster algebras
CN114266938A (en) Scene recognition method based on multi-mode information and global attention mechanism
CN102982561A (en) Method for detecting binary robust scale invariable feature of color of color image
CN108052799A (en) Multiple Sequence Alignment visualization method based on image procossing
Baird et al. Genome polarisation for detecting barriers to geneflow
CN114612666A (en) RGB-D semantic segmentation method based on multi-modal contrast learning
CN106991429A (en) The construction method of image recognition depth belief network structure
Lee et al. Taylor–Socolar hexagonal tilings as model sets
CN102122371B (en) Two-dimensional visualization encryption method for genetic information based on iteration function
CN114864004A (en) Deletion mark filling method based on sliding window sparse convolution denoising self-encoder
Devadoss et al. Split network polytopes and network spaces
Mazel-Gee et al. A relative Lubin–Tate theorem via higher formal geometry
Hajij et al. Graph based analysis for gene segment organization in a scrambled genome
CN107464273A (en) The implementation method and device of image style brush

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20121010

Termination date: 20150122

EXPY Termination of patent right or utility model