CN1257974C

CN1257974C - RNA sequential characteristic visual extracting method

Info

Publication number: CN1257974C
Application number: CN 200410025035
Authority: CN
Inventors: 王猛; 黄振德; 杨杰; 刘国平; 徐志节; 姚莉秀
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2004-06-10
Filing date: 2004-06-10
Publication date: 2006-05-31
Anticipated expiration: 2024-06-10
Also published as: CN1584027A

Abstract

The present invention relates to an RNA complete sequence characteristic visual extraction method which mainly comprises the steps: preprocessing data, selecting regulations, applying a cellular automaton CA method and generating sequence images, wherein 'A', 'T', 'G'and 'U' of every deoxyribonucleic acid in the obtained RNA sequence are encoded; a one-dimensional cellular automaton model is introduced; the specific cellular automaton regulations are adopted to evolve the encoded '0' and '1' gene sequences, specifically the state of the next time of an elementary cell is determined according to the evolution regulations of the state of the current time of the elementary cell and two adjacent elementary cells in the left and the right of the elementary cell; and a '0' and '1' two-dimensional matrix is formed after a plurality of times of evolution, and the two-dimensional matrix is converted into black and white images and contracted to obtain visual images having the RNA complete sequence characteristics. The method has the characteristics of complete sequence analysis, visualizability, sensitivity and universality, and furthermore, the characteristics of different gene sequences can be obtained from the generated visual sequence images.

Description

RNA complete sequence feature visualization extracting method

Technical field

The present invention is a kind of RNA complete sequence feature visualization extracting method, relates to the technology of picture processing, pattern recognition and traditional gene sequencing, and is different with traditional RNA sequence alignment analytical procedure, can comparison image ground reflects the characteristic of gene order.

Background technology

21st century is biological century, and after the Human Genome Project was finished, the biologist invested gene sequencing to more sight.In traditional gene sequencing method, there is suitable part to compare and finishes by gene order.And the comparison of traditional gene order is mainly alignd by gene, and base ratio is to what finish one by one, and wherein relatively typical method is to finish (http://www.ncbi.nlm.nih.gov/BLAST) with very sophisticated softwares such as BLAST.Can reflect disappearance, insertion, the variation of base with comparalive ease with this software.Though this method can obtain genovariation very simply, the result who obtains is not directly perceived.And other Gene Sequence Analysis method, as secondary protein structure [Kuo-ChenChou, 2000, Prediction of Protein structural classes and Subcellular locations, CurretnProtein and Peptide Science.2000], come the possible function of analyzing gene part by specific structure.These analytical procedures are too tended to partial function again.

The 1950's, computer originator, famous mathematician's von Neumann (Von Neumann) once wished by specific program realize on computers being similar to organism grow in the self-replacation of cell [Wolfram is New Kind of Science.Wolfram Media Inc. S.2002.A, Champaign, IL].He has proposed a simple pattern, a rectangular planes is divided into plurality of grids, each net point is represented the primitive of a cell or system, their state assignment is 0 or 1, in grid, represent with space or real lattice, under the rule of setting in advance, the evolution of cell or primitive is described with the real lattice of grid or the change in space.Such model is exactly a cellular automata.Cellular automata (CellularAutomation has then fully been showed in concentrating on studies of S.Wolfram, CA) great ability [Wolfram is as models of complexity.Nature 311 S.1984.Cellularautomation, 419-424.] of usefulness simple rule Simulation of Complex system.Cellular automata provides a kind of naive model for physics, biology and computer science, utilizes " repeated calculation " of these naive models just, discrete model that can the Simulation of Complex system.This method is using the application in the simple rule Simulation of Complex system very effective, but is not used for biological sequence analytically.For this unusual complex system of analyzing gene sequence, use the CA method that it is visual, analyze the image that generates then, thereby obtain the characteristics that the different genes sequence has, be a new research topic.

Summary of the invention

The objective of the invention is at the genovariation result who exists in traditional gene sequencing method not directly perceived, or shortcoming such as functional analysis is not comprehensive, a kind of RNA complete sequence feature visualization extracting method is provided, can from the gene visual image that generates, obtain the feature that the different genes sequence has, and then its sequence signature of analysis and utilization carries out medical research.

For realizing such purpose, the RNA complete sequence visual extraction method based on cellular automata of the present invention comprises that mainly data pre-treatment, rule are chosen, cellular automata CA method is used and sequence image generates four steps.At first each thymus nucleic acid in the RNA sequence that obtains " A " " T " " G " " U " is encoded, introduce one-dimensional element cellular automaton CA model, " 0 " " 1 " gene order after selecting for use specific cellular automata rule to coding develops, be next state constantly of cellular by cellular and it about the state of two adjacent cellular current times decide according to evolution rule, after developing, several times form " 0 " " 1 " two-dimensional matrix, two-dimensional matrix is converted into black white image and carries out convergent-divergent, obtain having RNA complete sequence feature visualization figure.

Method of the present invention is undertaken by following concrete steps:

1. data pre-treatment

At first each thymus nucleic acid in the RNA sequence that obtains " A " " T " " G " " U " is encoded, the RNA sequence is converted to " 0 " " 1 " sequence, be specially: A=00 U=01 G=10 T=11, and respectively fill one 0 at the two ends of sequence.

To the RNA series processing, if directly with the RNA series processing of primary ATGU character composition, calculated amount can be very big.If RNA is encoded, the RNA sequence is converted to 0,1 sequence, then calculated amount can be little many.After nucleotide sequence encoded in the manner described above, the length of new sequence was the twice of original series just.In order to allow the cellular at sequence two ends also participate in computing, can respectively fill one 0 at the two ends of sequence.

2. cellular automata CA method rule chooses

For " 0 " " 1 " behind the coding, the best rule of a selected property distinguished is as evolution rule in fixed any the rule of 3 of cellular automata CA methods.

For in the CA method 3 fixed any regular several one have 256, so after needing compare strictly all rules as the case may be, select a best rule of differentiation property to develop therein.The present invention mainly selects evolution rule No. 184.Described No. 184 rule specifically is defined as:

3. the application of cellular automata CA

The present invention has introduced the one dimension cellular Automation Model in gene order.In this model, all cellulars are distributed on the one dimension straight line.For gene order, develop according to " 0 " " 1 " gene order of selected cellular automata rule after to coding, promptly next of cellular state constantly by cellular and it about the state of two adjacent cellular current times decide according to evolution rule.

With the original gene sequence of having encoded as first row, the result that first row is developed as second row, the result that second row is developed as the third line, and the like.Must notice that except the first sequence two ends of going need zero padding, the new sequence behind each the evolution is also respectively filled one 0 at its sequence two ends, so that next step evolution computing.Through after the evolution of several times, just can form " 0 " " 1 " two-dimensional matrix.

4. the generation of sequence image

" 0 " expression black in the two-dimensional matrix of definition " 0 " " 1 ", " 1 " expression white is used visualization technique, and two-dimensional matrix is converted into a chequered with black and white bianry image.Because above-mentioned image is too big, to such an extent as to the characteristics of impossible direct analysis original image.Image is carried out level in the present invention and the vertical direction conversion is dwindled, and obtains having the visualized graphs of RNA complete sequence feature.

The genes involved sequence that the present invention is collected preferably can find relevant icp gene complete sequence mutually, and this point generally can be accomplished.In the time of selective rule, an acceptable segment can be selected, from the gene complete sequence as 3000 bases.Use selected suitable rule then, carry out gene order and develop, from the visual sequence image that generates, can seek and obtain the feature that the different genes sequence has, find rule.

Compare with traditional sequence alignment method, the inventive method has complete sequence analysis, intuitive, the characteristics of susceptibility and universality.At first present method is that complete sequence is analyzed, and can consider the permutation and combination feature that the long-range between sequence influences each other and acts on and provide sequence essence.And traditional sequence analysis method can only draw the position and the content of catastrophe point by comparison, can not provide the compositing characteristic that sequence has.Present method is that sequence is converted into two dimensional image, utilizes the feature of people's vision to the characteristics discovery generation image of image sensitivity.And traditional method is that one-dimensional sequence is directly analyzed, and obviously, this is very abstract loaded down with trivial details process.Present method has susceptibility for the minority catastrophe point in the sequence, the difference between just can amplification sequence.By the analysis to a large amount of virus sequences, present method can be distinguished different classes of virus by choosing different rules, that is to say that this method of the present invention has universality.

Description of drawings

Fig. 1 is the synoptic diagram of 184 rules in the cellular automata CA method rule.

8 kinds of permutation and combination that 3 of lastrows may occur in the sequence of " 0 " " 1 ", and the value that should get in the next line corresponding position have been described respectively among Fig. 1 from left to right.

Fig. 2 is the numeric representation form of 184 rules corresponding with Fig. 1.

Meaning is identical with figure one, just represents white and black respectively with

numeral

1,0.

Fig. 3 is the former figure of coronavirus 229E (non-SARS).

Fig. 4 is the former figure of coronavirus Sin2774 (SARS).

Embodiment

Below in conjunction with drawings and Examples technical scheme of the present invention is further described.

The present invention is that example illustrates the embodiment that it is concrete with the SARS virus sequential analysis.The SARS (Severe Acute Respiratory Syndrome) former by name of SARS is a kind of respiratory tract acute infection disease that coronavirus (coronavirus) causes.The present invention downloads the RNA sequence of 66 kinds of different SARS virus from the NCBI website, the length of every kind of virus sequence is greatly about about 29700.These SARS virus sequences are carried out visualization processing, analyze, seek the essential characteristic of SARS sequence, relatively which difference is arranged, thereby can utilize the sequence signature of SARS virus with non-sars coronavirus sequence.List the RNA sequence of SARS virus in the form 1, listed the RNA sequence of non-sars coronavirus in the form 2.

Form 1:SARS virus sequence

SARS	Accession	Length	SARS	Accession	Length
SARS	Accession	Length	SARS	Accession	Length	BJ01 BJ02 BJ03 BJ04 GZ01 ZJ01 HKU39849 CUHK W1 CUHK Su10 Sin2500 Sin2677 Sin2679 Sin2748 Sin2774 TW1 Urbani Tor2 GZ50 SZ16 SZ3 FRA GD01 TWC TWC2 TWC3 ZMY1 TWY TWS TWK TWJ	AY278488 AY278487 AY278490 AY279354 AY278489 AY297028 AY278491 AY278554 AY282752 AY283794 AY283795 AY283796 AY283797 AY283798 AY291451 AY278741 NC 004718 AY304495 AY304488 AY304486 AY310120 AY278489 AY321118 AY362698 AY362699 AY351680 AP006561 AP006560 AP006559 AP006558	29725 29745 29740 29732 29757 29714 29742 29736 29736 29711 29705 29711 29705 29729 29714 29727 29751 29720 29731 29741 29740 29757 29725 29727 29727 29749 29727 29727 29727 29725	TC1 HSR1 Frankfurt1 AS CUHK- CUHK- CUHK- GD69 PUMC01 PUMC02 PUMC03 Sino1-11 Sino3-11 SoD GZ02 ZS-C LC5 LC4 LC3 LC2 LC1 ZS-A ZS-B HSZ-Cc HSZ-Bc HGZ8L2 HZS2-C HZS2-Fc HZS2-E HZS2-D	AY338174 AY323977 AY291315 AY427439 AY345986 AY345987 AY345988 AY313906 AY350750 AY357075 AY357076 AY485277 AY485278 AY461660 AY390556 AY395003 AY395002 AY395001 AY395000 AY394999 AY394998 AY394997 AY394996 AY394995 AY394994 AY394993 AY394992 AY394991 AY394990 AY394989	29573 29751 29727 29711 29736 29736 29736 29754 29738 29738 29745 29741 29740 29715 29760 29647 29350 29350 29350 29350 29736 29683 29683 29765 29765 29736 29736 29736 29736 29736

TWH TC3 TC2

AP006557 AY348314 AY338175

29727 29573 29573

HZS2-Fb HSZ-Cb HSZ-Bb

AY394987 AY394986 AY394985

29709 29729 29530

Form 2: non-sars coronavirus

Non-SARS genome	Accession	Length	Non-SARS genome	Accession	Length
Non-SARS genome	Accession	Length	Non-SARS genome	Accession	Length	D13096 Avian AJ311317 Avian 1 U00735 Bovine AF220295 Bovine1 NC 003436 Porcine AF353511 Porcine1 NC 002645 229E NC 001846 Murine AF208067 Murine1 AF207902 Murine2 AF029248 Murine4 NC 002306 Tran S	D13096 AJ311317 U00735 AF220295 NC 003436 AF353511 NC 002645 NC 001846 AF208067 AF207902 AF029248 NC 002306	27608 27635 31032 31100 28033 28033 27317 31357 31233 31217 31357 28586	AY391777 HCoV- NC 005147 HCoV- AF304460 229E AF029248 Murine AF208066 Murine NC 003045 Bovine NC 001451 Avian AY319651 Avian AF391542 Bovine AF391541 Bovine AF201929 Murine AJ271965 Trans	AY391777 NC 005147 AF304460 AF029248 AF208066 NC 003045 NC 001451 AY319651 AF391542 AF391541 AF201929 AJ271965	30738 30738 27317 31357 31112 31028 27608 27733 31028 31028 31276 28586

The inventive method is carried out as follows:

1, data pre-treatment

The RNA sequence is encoded, the RNA sequence is converted to " 0 " " 1 " sequence, the concrete mode of encoding is: A=00 U=01 G=10 T=11. can respectively fill one 0 at the two ends of sequence in order to allow the cellular at sequence two ends also participate in computing.

2, CA method rule chooses

The present invention decides to select No. 184 rules to develop in 256 rules of any 3 of CA.No. 184 evolution rule as shown in Figure 1, wherein the white square presentation code 1, black square presentation code 0 is so the numeric representation form of the 184th evolution rule also can be by shown in Figure 2.With the third situation is example (from left number): black in 3 of lastrows are respectively white, during white combination, next line should extracting waste corresponding to the position of intermediate point.

3, the application of CA method

At first with the primary gene order as initial row, generate the next line corresponding points from left to right successively according to 184 rules, generate like this second the row sequence.Then second row that generates is developed according to rule 184 again.Repeat above step 2400 and time obtain " 0 " " 1 " matrix.Attention sequence both sides need mend 0, can carry out so that calculate.Use 3 principles of deciding any of 184 rules, operation obtains " 0 " " 1 " matrix of a two dimension for 2400 times to unidimensional " 0 " " 1 " sequence, and size is 2400*N, and N is the length of " 0 " " 1 " sequence.

4, the generation of sequence image

" 0 " expression black in the two-dimensional matrix of definition " 0 " " 1 ", " 1 " expression white is used visualization technique, and two-dimensional matrix is converted to a chequered with black and white bianry image, and size is 60Kb * 2.4Kb.The advantage of doing like this is to bring into play the susceptibility of people to image, thereby finds pattern, rule in the image etc. easily, studies gene order from the another one aspect.Because above-mentioned image is too big, to such an extent as to the characteristics of impossible direct analysis original image, thus also need carry out convergent-divergent to sequence, to find graphic feature.The present invention utilizes following step that all images are carried out conversion: (1) horizontal direction dwindles 1/4, and vertical direction dwindles 1/3.5; (2) horizontal direction dwindles 1/3.5 again.Influenced by computational accuracy, it is original 1/14.007 that whole minifications is that horizontal direction narrows down to, and vertical direction narrows down to original 1/2.Obtain having the visualized graphs of RNA complete sequence feature at last.

Finally according to The above results, from the image characteristics extraction to the dependency rule.Fig. 3 is the former figure of non-sars coronavirus 229E, and Fig. 4 is the former figure of sars coronavirus Sin2774.From Fig. 3,4, can significantly see very significantly V font intersection region is arranged in the formed image of SARS gene order, and the zone is bigger, but not the formed pattern of SARS gene order then not having many like this features, mainly is parallel zone.Such outstanding feature is given us a vision criteria of distinguishing SARS-CoV sequence and non-SARS sequence.The image that compares 66 SARS-CoV and 24 non-SARS sequences, can find all to contain in all SARS-CoV images 6 V font intersection regions, and the position distribution unanimity that these are regional is greatly about 84-2483nt, 3040-5439nt, 5592-7991nt, 12050-14449nt, 16412-18811nt and 19677-22076nt.These features are that SARS is peculiar, and promptly 6 V-arrangement zones can be considered to the feature of SARS virus.

Claims

1, a kind of RNA complete sequence feature visualization extracting method is characterized in that comprising following concrete steps:

1) at first each thymus nucleic acid in the RNA sequence that obtains " A " " T " " G " " U " is encoded, the RNA sequence is converted to " 0 " " 1 " sequence, be specially: A=00 U=01 G=10 T=11, and respectively fill one 0 at the two ends of sequence;

2) select No. 184 rule as evolution rule in fixed any the rule of 3 of cellular automata CA methods, described No. 184 rule specifically is defined as:

[\begin{matrix} 111 \\ 1 \end{matrix}] [\begin{matrix} 110 \\ 0 \end{matrix}] [\begin{matrix} 101 \\ 1 \end{matrix}] [\begin{matrix} 100 \\ 1 \end{matrix}] [\begin{matrix} 011 \\ 1 \end{matrix}] [\begin{matrix} 010 \\ 0 \end{matrix}] [\begin{matrix} 001 \\ 0 \end{matrix}] [\begin{matrix} 000 \\ 0 \end{matrix}];

3) in gene order, introduce the one dimension cellular Automation Model, in this model, all cellulars are distributed on the one dimension straight line, according to selected cellular automata rule " 0 " " 1 " gene order after encoding is developed, be next state constantly of cellular by cellular and it about the state of two adjacent cellular current times decide according to evolution rule, the original gene sequence of having encoded is gone as first, the result that first row is developed is gone as second, to second result that develop of row as the third line, and the like, new sequence behind each the evolution is respectively filled one 0 at its sequence two ends, through after the evolution of several times, form " 0 " " 1 " two-dimensional matrix;

4) " 0 " expression black in the two-dimensional matrix of definition " 0 " " 1 ", " 1 " expression white is used visualization technique, and two-dimensional matrix is converted into a chequered with black and white bianry image, and image carried out level and the vertical direction conversion is dwindled, obtain having the visualized graphs of RNA complete sequence feature.