CN1257974C - RNA sequential characteristic visual extracting method - Google Patents

RNA sequential characteristic visual extracting method Download PDF

Info

Publication number
CN1257974C
CN1257974C CN 200410025035 CN200410025035A CN1257974C CN 1257974 C CN1257974 C CN 1257974C CN 200410025035 CN200410025035 CN 200410025035 CN 200410025035 A CN200410025035 A CN 200410025035A CN 1257974 C CN1257974 C CN 1257974C
Authority
CN
China
Prior art keywords
sequence
rule
rna
cellular
evolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 200410025035
Other languages
Chinese (zh)
Other versions
CN1584027A (en
Inventor
王猛
黄振德
杨杰
刘国平
徐志节
姚莉秀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN 200410025035 priority Critical patent/CN1257974C/en
Publication of CN1584027A publication Critical patent/CN1584027A/en
Application granted granted Critical
Publication of CN1257974C publication Critical patent/CN1257974C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention relates to an RNA complete sequence characteristic visual extraction method which mainly comprises the steps: preprocessing data, selecting regulations, applying a cellular automaton CA method and generating sequence images, wherein 'A', 'T', 'G'and 'U' of every deoxyribonucleic acid in the obtained RNA sequence are encoded; a one-dimensional cellular automaton model is introduced; the specific cellular automaton regulations are adopted to evolve the encoded '0' and '1' gene sequences, specifically the state of the next time of an elementary cell is determined according to the evolution regulations of the state of the current time of the elementary cell and two adjacent elementary cells in the left and the right of the elementary cell; and a '0' and '1' two-dimensional matrix is formed after a plurality of times of evolution, and the two-dimensional matrix is converted into black and white images and contracted to obtain visual images having the RNA complete sequence characteristics. The method has the characteristics of complete sequence analysis, visualizability, sensitivity and universality, and furthermore, the characteristics of different gene sequences can be obtained from the generated visual sequence images.

Description

RNA complete sequence feature visualization extracting method
Technical field
The present invention is a kind of RNA complete sequence feature visualization extracting method, relates to the technology of picture processing, pattern recognition and traditional gene sequencing, and is different with traditional RNA sequence alignment analytical procedure, can comparison image ground reflects the characteristic of gene order.
Background technology
21st century is biological century, and after the Human Genome Project was finished, the biologist invested gene sequencing to more sight.In traditional gene sequencing method, there is suitable part to compare and finishes by gene order.And the comparison of traditional gene order is mainly alignd by gene, and base ratio is to what finish one by one, and wherein relatively typical method is to finish (http://www.ncbi.nlm.nih.gov/BLAST) with very sophisticated softwares such as BLAST.Can reflect disappearance, insertion, the variation of base with comparalive ease with this software.Though this method can obtain genovariation very simply, the result who obtains is not directly perceived.And other Gene Sequence Analysis method, as secondary protein structure [Kuo-ChenChou, 2000, Prediction of Protein structural classes and Subcellular locations, CurretnProtein and Peptide Science.2000], come the possible function of analyzing gene part by specific structure.These analytical procedures are too tended to partial function again.
The 1950's, computer originator, famous mathematician's von Neumann (Von Neumann) once wished by specific program realize on computers being similar to organism grow in the self-replacation of cell [Wolfram is New Kind of Science.Wolfram Media Inc. S.2002.A, Champaign, IL].He has proposed a simple pattern, a rectangular planes is divided into plurality of grids, each net point is represented the primitive of a cell or system, their state assignment is 0 or 1, in grid, represent with space or real lattice, under the rule of setting in advance, the evolution of cell or primitive is described with the real lattice of grid or the change in space.Such model is exactly a cellular automata.Cellular automata (CellularAutomation has then fully been showed in concentrating on studies of S.Wolfram, CA) great ability [Wolfram is as models of complexity.Nature 311 S.1984.Cellularautomation, 419-424.] of usefulness simple rule Simulation of Complex system.Cellular automata provides a kind of naive model for physics, biology and computer science, utilizes " repeated calculation " of these naive models just, discrete model that can the Simulation of Complex system.This method is using the application in the simple rule Simulation of Complex system very effective, but is not used for biological sequence analytically.For this unusual complex system of analyzing gene sequence, use the CA method that it is visual, analyze the image that generates then, thereby obtain the characteristics that the different genes sequence has, be a new research topic.
Summary of the invention
The objective of the invention is at the genovariation result who exists in traditional gene sequencing method not directly perceived, or shortcoming such as functional analysis is not comprehensive, a kind of RNA complete sequence feature visualization extracting method is provided, can from the gene visual image that generates, obtain the feature that the different genes sequence has, and then its sequence signature of analysis and utilization carries out medical research.
For realizing such purpose, the RNA complete sequence visual extraction method based on cellular automata of the present invention comprises that mainly data pre-treatment, rule are chosen, cellular automata CA method is used and sequence image generates four steps.At first each thymus nucleic acid in the RNA sequence that obtains " A " " T " " G " " U " is encoded, introduce one-dimensional element cellular automaton CA model, " 0 " " 1 " gene order after selecting for use specific cellular automata rule to coding develops, be next state constantly of cellular by cellular and it about the state of two adjacent cellular current times decide according to evolution rule, after developing, several times form " 0 " " 1 " two-dimensional matrix, two-dimensional matrix is converted into black white image and carries out convergent-divergent, obtain having RNA complete sequence feature visualization figure.
Method of the present invention is undertaken by following concrete steps:
1. data pre-treatment
At first each thymus nucleic acid in the RNA sequence that obtains " A " " T " " G " " U " is encoded, the RNA sequence is converted to " 0 " " 1 " sequence, be specially: A=00 U=01 G=10 T=11, and respectively fill one 0 at the two ends of sequence.
To the RNA series processing, if directly with the RNA series processing of primary ATGU character composition, calculated amount can be very big.If RNA is encoded, the RNA sequence is converted to 0,1 sequence, then calculated amount can be little many.After nucleotide sequence encoded in the manner described above, the length of new sequence was the twice of original series just.In order to allow the cellular at sequence two ends also participate in computing, can respectively fill one 0 at the two ends of sequence.
2. cellular automata CA method rule chooses
For " 0 " " 1 " behind the coding, the best rule of a selected property distinguished is as evolution rule in fixed any the rule of 3 of cellular automata CA methods.
For in the CA method 3 fixed any regular several one have 256, so after needing compare strictly all rules as the case may be, select a best rule of differentiation property to develop therein.The present invention mainly selects evolution rule No. 184.Described No. 184 rule specifically is defined as:
Figure C20041002503500051
3. the application of cellular automata CA
The present invention has introduced the one dimension cellular Automation Model in gene order.In this model, all cellulars are distributed on the one dimension straight line.For gene order, develop according to " 0 " " 1 " gene order of selected cellular automata rule after to coding, promptly next of cellular state constantly by cellular and it about the state of two adjacent cellular current times decide according to evolution rule.
With the original gene sequence of having encoded as first row, the result that first row is developed as second row, the result that second row is developed as the third line, and the like.Must notice that except the first sequence two ends of going need zero padding, the new sequence behind each the evolution is also respectively filled one 0 at its sequence two ends, so that next step evolution computing.Through after the evolution of several times, just can form " 0 " " 1 " two-dimensional matrix.
4. the generation of sequence image
" 0 " expression black in the two-dimensional matrix of definition " 0 " " 1 ", " 1 " expression white is used visualization technique, and two-dimensional matrix is converted into a chequered with black and white bianry image.Because above-mentioned image is too big, to such an extent as to the characteristics of impossible direct analysis original image.Image is carried out level in the present invention and the vertical direction conversion is dwindled, and obtains having the visualized graphs of RNA complete sequence feature.
The genes involved sequence that the present invention is collected preferably can find relevant icp gene complete sequence mutually, and this point generally can be accomplished.In the time of selective rule, an acceptable segment can be selected, from the gene complete sequence as 3000 bases.Use selected suitable rule then, carry out gene order and develop, from the visual sequence image that generates, can seek and obtain the feature that the different genes sequence has, find rule.
Compare with traditional sequence alignment method, the inventive method has complete sequence analysis, intuitive, the characteristics of susceptibility and universality.At first present method is that complete sequence is analyzed, and can consider the permutation and combination feature that the long-range between sequence influences each other and acts on and provide sequence essence.And traditional sequence analysis method can only draw the position and the content of catastrophe point by comparison, can not provide the compositing characteristic that sequence has.Present method is that sequence is converted into two dimensional image, utilizes the feature of people's vision to the characteristics discovery generation image of image sensitivity.And traditional method is that one-dimensional sequence is directly analyzed, and obviously, this is very abstract loaded down with trivial details process.Present method has susceptibility for the minority catastrophe point in the sequence, the difference between just can amplification sequence.By the analysis to a large amount of virus sequences, present method can be distinguished different classes of virus by choosing different rules, that is to say that this method of the present invention has universality.
Description of drawings
Fig. 1 is the synoptic diagram of 184 rules in the cellular automata CA method rule.
8 kinds of permutation and combination that 3 of lastrows may occur in the sequence of " 0 " " 1 ", and the value that should get in the next line corresponding position have been described respectively among Fig. 1 from left to right.
Fig. 2 is the numeric representation form of 184 rules corresponding with Fig. 1.
Meaning is identical with figure one, just represents white and black respectively with numeral 1,0.
Fig. 3 is the former figure of coronavirus 229E (non-SARS).
Fig. 4 is the former figure of coronavirus Sin2774 (SARS).
Embodiment
Below in conjunction with drawings and Examples technical scheme of the present invention is further described.
The present invention is that example illustrates the embodiment that it is concrete with the SARS virus sequential analysis.The SARS (Severe Acute Respiratory Syndrome) former by name of SARS is a kind of respiratory tract acute infection disease that coronavirus (coronavirus) causes.The present invention downloads the RNA sequence of 66 kinds of different SARS virus from the NCBI website, the length of every kind of virus sequence is greatly about about 29700.These SARS virus sequences are carried out visualization processing, analyze, seek the essential characteristic of SARS sequence, relatively which difference is arranged, thereby can utilize the sequence signature of SARS virus with non-sars coronavirus sequence.List the RNA sequence of SARS virus in the form 1, listed the RNA sequence of non-sars coronavirus in the form 2.
Form 1:SARS virus sequence
SARS Accession Length SARS Accession Length
BJ01 BJ02 BJ03 BJ04 GZ01 ZJ01 HKU39849 CUHK W1 CUHK Su10 Sin2500 Sin2677 Sin2679 Sin2748 Sin2774 TW1 Urbani Tor2 GZ50 SZ16 SZ3 FRA GD01 TWC TWC2 TWC3 ZMY1 TWY TWS TWK TWJ AY278488 AY278487 AY278490 AY279354 AY278489 AY297028 AY278491 AY278554 AY282752 AY283794 AY283795 AY283796 AY283797 AY283798 AY291451 AY278741 NC 004718 AY304495 AY304488 AY304486 AY310120 AY278489 AY321118 AY362698 AY362699 AY351680 AP006561 AP006560 AP006559 AP006558 29725 29745 29740 29732 29757 29714 29742 29736 29736 29711 29705 29711 29705 29729 29714 29727 29751 29720 29731 29741 29740 29757 29725 29727 29727 29749 29727 29727 29727 29725 TC1 HSR1 Frankfurt1 AS CUHK- CUHK- CUHK- GD69 PUMC01 PUMC02 PUMC03 Sino1-11 Sino3-11 SoD GZ02 ZS-C LC5 LC4 LC3 LC2 LC1 ZS-A ZS-B HSZ-Cc HSZ-Bc HGZ8L2 HZS2-C HZS2-Fc HZS2-E HZS2-D AY338174 AY323977 AY291315 AY427439 AY345986 AY345987 AY345988 AY313906 AY350750 AY357075 AY357076 AY485277 AY485278 AY461660 AY390556 AY395003 AY395002 AY395001 AY395000 AY394999 AY394998 AY394997 AY394996 AY394995 AY394994 AY394993 AY394992 AY394991 AY394990 AY394989 29573 29751 29727 29711 29736 29736 29736 29754 29738 29738 29745 29741 29740 29715 29760 29647 29350 29350 29350 29350 29736 29683 29683 29765 29765 29736 29736 29736 29736 29736
TWH TC3 TC2 AP006557 AY348314 AY338175 29727 29573 29573 HZS2-Fb HSZ-Cb HSZ-Bb AY394987 AY394986 AY394985 29709 29729 29530
Form 2: non-sars coronavirus
Non-SARS genome Accession Length Non-SARS genome Accession Length
D13096 Avian AJ311317 Avian 1 U00735 Bovine AF220295 Bovine1 NC 003436 Porcine AF353511 Porcine1 NC 002645 229E NC 001846 Murine AF208067 Murine1 AF207902 Murine2 AF029248 Murine4 NC 002306 Tran S D13096 AJ311317 U00735 AF220295 NC 003436 AF353511 NC 002645 NC 001846 AF208067 AF207902 AF029248 NC 002306 27608 27635 31032 31100 28033 28033 27317 31357 31233 31217 31357 28586 AY391777 HCoV- NC 005147 HCoV- AF304460 229E AF029248 Murine AF208066 Murine NC 003045 Bovine NC 001451 Avian AY319651 Avian AF391542 Bovine AF391541 Bovine AF201929 Murine AJ271965 Trans AY391777 NC 005147 AF304460 AF029248 AF208066 NC 003045 NC 001451 AY319651 AF391542 AF391541 AF201929 AJ271965 30738 30738 27317 31357 31112 31028 27608 27733 31028 31028 31276 28586
The inventive method is carried out as follows:
1, data pre-treatment
The RNA sequence is encoded, the RNA sequence is converted to " 0 " " 1 " sequence, the concrete mode of encoding is: A=00 U=01 G=10 T=11. can respectively fill one 0 at the two ends of sequence in order to allow the cellular at sequence two ends also participate in computing.
2, CA method rule chooses
The present invention decides to select No. 184 rules to develop in 256 rules of any 3 of CA.No. 184 evolution rule as shown in Figure 1, wherein the white square presentation code 1, black square presentation code 0 is so the numeric representation form of the 184th evolution rule also can be by shown in Figure 2.With the third situation is example (from left number): black in 3 of lastrows are respectively white, during white combination, next line should extracting waste corresponding to the position of intermediate point.
3, the application of CA method
At first with the primary gene order as initial row, generate the next line corresponding points from left to right successively according to 184 rules, generate like this second the row sequence.Then second row that generates is developed according to rule 184 again.Repeat above step 2400 and time obtain " 0 " " 1 " matrix.Attention sequence both sides need mend 0, can carry out so that calculate.Use 3 principles of deciding any of 184 rules, operation obtains " 0 " " 1 " matrix of a two dimension for 2400 times to unidimensional " 0 " " 1 " sequence, and size is 2400*N, and N is the length of " 0 " " 1 " sequence.
4, the generation of sequence image
" 0 " expression black in the two-dimensional matrix of definition " 0 " " 1 ", " 1 " expression white is used visualization technique, and two-dimensional matrix is converted to a chequered with black and white bianry image, and size is 60Kb * 2.4Kb.The advantage of doing like this is to bring into play the susceptibility of people to image, thereby finds pattern, rule in the image etc. easily, studies gene order from the another one aspect.Because above-mentioned image is too big, to such an extent as to the characteristics of impossible direct analysis original image, thus also need carry out convergent-divergent to sequence, to find graphic feature.The present invention utilizes following step that all images are carried out conversion: (1) horizontal direction dwindles 1/4, and vertical direction dwindles 1/3.5; (2) horizontal direction dwindles 1/3.5 again.Influenced by computational accuracy, it is original 1/14.007 that whole minifications is that horizontal direction narrows down to, and vertical direction narrows down to original 1/2.Obtain having the visualized graphs of RNA complete sequence feature at last.
Finally according to The above results, from the image characteristics extraction to the dependency rule.Fig. 3 is the former figure of non-sars coronavirus 229E, and Fig. 4 is the former figure of sars coronavirus Sin2774.From Fig. 3,4, can significantly see very significantly V font intersection region is arranged in the formed image of SARS gene order, and the zone is bigger, but not the formed pattern of SARS gene order then not having many like this features, mainly is parallel zone.Such outstanding feature is given us a vision criteria of distinguishing SARS-CoV sequence and non-SARS sequence.The image that compares 66 SARS-CoV and 24 non-SARS sequences, can find all to contain in all SARS-CoV images 6 V font intersection regions, and the position distribution unanimity that these are regional is greatly about 84-2483nt, 3040-5439nt, 5592-7991nt, 12050-14449nt, 16412-18811nt and 19677-22076nt.These features are that SARS is peculiar, and promptly 6 V-arrangement zones can be considered to the feature of SARS virus.

Claims (1)

1, a kind of RNA complete sequence feature visualization extracting method is characterized in that comprising following concrete steps:
1) at first each thymus nucleic acid in the RNA sequence that obtains " A " " T " " G " " U " is encoded, the RNA sequence is converted to " 0 " " 1 " sequence, be specially: A=00 U=01 G=10 T=11, and respectively fill one 0 at the two ends of sequence;
2) select No. 184 rule as evolution rule in fixed any the rule of 3 of cellular automata CA methods, described No. 184 rule specifically is defined as:
111 1 110 0 101 1 100 1 011 1 010 0 001 0 000 0 ;
3) in gene order, introduce the one dimension cellular Automation Model, in this model, all cellulars are distributed on the one dimension straight line, according to selected cellular automata rule " 0 " " 1 " gene order after encoding is developed, be next state constantly of cellular by cellular and it about the state of two adjacent cellular current times decide according to evolution rule, the original gene sequence of having encoded is gone as first, the result that first row is developed is gone as second, to second result that develop of row as the third line, and the like, new sequence behind each the evolution is respectively filled one 0 at its sequence two ends, through after the evolution of several times, form " 0 " " 1 " two-dimensional matrix;
4) " 0 " expression black in the two-dimensional matrix of definition " 0 " " 1 ", " 1 " expression white is used visualization technique, and two-dimensional matrix is converted into a chequered with black and white bianry image, and image carried out level and the vertical direction conversion is dwindled, obtain having the visualized graphs of RNA complete sequence feature.
CN 200410025035 2004-06-10 2004-06-10 RNA sequential characteristic visual extracting method Expired - Fee Related CN1257974C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200410025035 CN1257974C (en) 2004-06-10 2004-06-10 RNA sequential characteristic visual extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200410025035 CN1257974C (en) 2004-06-10 2004-06-10 RNA sequential characteristic visual extracting method

Publications (2)

Publication Number Publication Date
CN1584027A CN1584027A (en) 2005-02-23
CN1257974C true CN1257974C (en) 2006-05-31

Family

ID=34601109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200410025035 Expired - Fee Related CN1257974C (en) 2004-06-10 2004-06-10 RNA sequential characteristic visual extracting method

Country Status (1)

Country Link
CN (1) CN1257974C (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102122371B (en) * 2010-12-15 2014-08-06 西安交通大学 Two-dimensional visualization encryption method for genetic information based on iteration function
CN102546158B (en) * 2011-12-22 2014-05-07 河海大学 Block encryption method based on parity cellular automaton
CN102708308A (en) * 2012-03-31 2012-10-03 常熟市支塘镇新盛技术咨询服务有限公司 Method for realizing visualization of DNA (deoxyribonucleic acid) sequences
CN106295245B (en) * 2016-07-27 2019-08-30 广州麦仑信息科技有限公司 Method of the storehouse noise reduction based on Caffe from coding gene information feature extraction
CN107679551B (en) * 2017-09-11 2020-06-16 电子科技大学 Identification method of emergence phenomenon based on fractal

Also Published As

Publication number Publication date
CN1584027A (en) 2005-02-23

Similar Documents

Publication Publication Date Title
Sedgewick Algorithms in Java
Xiao et al. Using cellular automata to generate image representation for biological sequences
Herman et al. Graph visualization and navigation in information visualization: A survey
Bader et al. Designing scalable synthetic compact applications for benchmarking high productivity computing systems
Sedgewick Algorithms in Java, Parts 1-4
CN115116537B (en) Method and system for calculating multiple transformation paths of biomolecule functional dynamics
CN1257974C (en) RNA sequential characteristic visual extracting method
Sekanina et al. Evolutionary design of arbitrarily large sorting networks using development
Schuster Artificial life and molecular evolutionary biology
Shapiro et al. Graphical exploratory data analysis of RNA secondary structure dynamics predicted by the massively parallel genetic algorithm
Ronneseth et al. Merging covering arrays and compressing multiple sequence alignments
Cotta et al. A memetic-aided approach to hierarchical clustering from distance matrices: application to gene expression clustering and phylogeny
Dotan et al. Multiple sequence alignment as a sequence-to-sequence learning problem
Gillespie et al. RNA folding on the 3D triangular lattice
Jiang Approximation algorithms for predicting RNA secondary structures with arbitrary pseudoknots
Mazidi et al. PSPGA: A New Method for Protein Structure Prediction based on Genetic Algorithm
Safoury et al. Enriched dna strands classification using cgr images and convolutional neural network
Li et al. Pseudo-periodic partitions of biological sequences
Margenstern et al. A universal time-varying distributed H system of degree 1
Smith et al. Cellular automaton simulation of polymers
Sekanina Evolving constructors for infinitely growing sorting networks and medians
Ng et al. Factoring local sequence composition in motif significance analysis
Sirotkin et al. Simulation and analysis of physical mapping
Dawson et al. Mean free energy topology for nucleotide sequences of varying composition based on secondary structure calculations
Davidson et al. Robust methods for microarray analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20060531