CN103544406B

CN103544406B - A kind of one-dimensional cell neural network detects the method for DNA sequence dna similarity

Info

Publication number: CN103544406B
Application number: CN201310552472.3A
Authority: CN
Inventors: 纪禄平; 郝德水; 周龙; 黄青君; 尹力; 杨洁
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2013-11-08
Filing date: 2013-11-08
Publication date: 2016-03-23
Anticipated expiration: 2033-11-08
Also published as: CN103544406A

Abstract

The invention discloses a kind of method that one-dimensional cell neural network detects DNA sequence dna similarity, first design one-dimensional cell neural network basic model, then utilize the antithesis cell neural network of a this Model Construction one dimension; By two DNA sequence dna information to be detected, initialization is carried out to this network again, in network operation process, record the cell state in each moment network and output, form optimum output matrix accordingly; Again the element in optimum output matrix is traveled through, thus determine best align to path; Finally according to align to path, space update is carried out so that by two sequence global alignments to two sequences; After sequence alignment, then calculate its overall similarity according to the base quantity of aliging and total base quantity.Show through test comparison, the present invention detects accurately on basis in guarantee, for the DNA sequence dna that length is longer, obviously has required computing time reduce greatly than existing method.

Description

A kind of one-dimensional cell neural network detects the method for DNA sequence dna similarity

Technical field

The invention belongs to the DNA sequence dna similarity detection technique field in bioinformatics, more specifically say, relate to a kind of method that one-dimensional cell neural network detects DNA sequence dna similarity, for the detection to DNA double sequence overall situation similarity.

Background technology

20 century 70s, the appearance of DNA sequencing method produces many biomolecular sequence data, and these data just increase rapidly with geometry speed, and it has become human practice and has produced the maximum field of data volume.Draw successfully at human genomic sequence figure, people start again various vegeto-animal genome project in succession.But data are also not equal to knowledge and information, and the task of studying and processing these data is more and more heavier, and we must find method efficiently and solve this kind of problem.

DNA is connected by base pairing with double chain form existence, and the pairing of base exists specificity, and the bases G always on a chain is connected with the base C on another chain, and the base T on a chain is connected with the base A on another chain.DNA nucleotide sequence is exactly the character string be made up of this 4 fundamental elements.Therefore, in fact DNA sequence dna coupling is exactly the similarity between the sequence that is made up of any one character in these 4 characters of ACGT of coupling two.Sequence alignment is exactly one and finds maximum coupling between two or more pieces sequence by certain specific algorithm.The process of coupling base number is excavated between sequence in structure or similarity functionally by the method for sequence alignment, this is for the searching algorithm of biometric database, and the structure prediction of protein or DNA, evolutionary analysis and functional analysis have very important practice significance.

According to the difference of the number of the biological sequence of comparing, sequence alignment method can be divided into pairwise comparison method and Multiple Sequence Alignment Method.Pairwise comparison method can be divided into three kinds again, is dot matrix, dynamic programming algorithm and heuritic approach (BLAST algorithm, fasta algorithm etc.) respectively.Multiple Sequence Alignment is a np complete problem, and be a still unsolved difficult problem, it can be divided into following several: precise alignment algorithm, iteration alignment algorithm, progressive alignment algorithm, heuritic approach and the alignment algorithm etc. based on graph theory.

In pairwise comparison method, dot matrix is that first McIntyre and Gibbs in 1970 put forward, be the advantage of the most basic a kind of visual pairwise comparison method points tactical deployment of troops be directly to find all possible coupling between two sequences, but the comparison result that it obtains is accurate not, and be only applicable to two shorter sequences, in the face of the biological sequence data that nowadays data volume is huge obviously also exists defect.The basic thought of dynamic programming algorithm is exactly that PROBLEM DECOMPOSITION to be solved is become several subproblems, first respectively the solution of subproblem is solved out, then store the solution of subproblem and avoid double counting, finally by the solution of subproblem being combined the solution just obtaining former problem.Adopt dynamic programming algorithm to solve biological sequence alignment problem and can obtain optimum comparison result under given scoring systems, if but problem amount is large especially, so its computing velocity can be slowly, and the selection of this method to parameter is very sensitive, the minor modifications of parameter also can make the result of comparison have larger change.The dynamic programming algorithm solving biological sequence alignment problem mainly contains a kind of global sequence alignment algorithm-Needleman-Wunsch algorithm (being called for short NW algorithm) proposed by Needleman and Wunsch for 1970, it is a kind of with solving the Smith-Waterman algorithm (referred to as SW algorithm) found and have local similarity region that Smith and Waterman proposed in 1981, within 1985, first proposed and the heuritic approach of a kind of fasta algorithm pairwise comparison improved in 1988 by Pearsom and Lipman, the heuritic approach of a kind of BLAST algorithm pairwise comparison that nineteen ninety is proposed by people such as Altschul.

And traditional alignment algorithm is when the pairwise comparison problem that solution data volume is larger, required time and storage space along with sequence number and sequence length growth exponentially level increase, therefore, we need to study the better method upgraded to improve the search speed of algorithm, reduce computing time.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, provide a kind of one-dimensional cell neural network to detect the method for DNA sequence dna similarity, to reduce computing time.

For achieving the above object, the present invention's one-dimensional cell neural network detects the method for DNA sequence dna similarity, it is characterized in that, comprises the following steps:

(1), one-dimensional cell neural network basic model is designed

Carry out catenation by unicellular, each cell sequence number is used successively " ..., i-1, i, i+1 ... " represent, alphabetical i wherein represents the arrangement sequence number of cell;

In this basic model, cell state differential equation group represents:

{\begin{matrix} C \frac{\partial x_{i} (t)}{\partial t} = - \frac{x_{i} (t)}{R_{x}} + A &CircleTimes; Y_{i} (t) + B &CircleTimes; U_{i} (t) + I_{i} \\ y_{i} (t) = f (x_{i} (t)) \end{matrix} - - - (1)

Wherein, in system of equations (1), t represents the time, x _irepresent the state of cell i, A is feedback template, and B is Control architecture, I _i, R _xthree constants respectively with C, f (x _i(t)) be the output modulating function of cell state; Y _it () represents that cell i comprises the neighborhood output matrix of oneself, U _it () represents that cell i comprises the neighborhood input of oneself, be expressed as:

\begin{matrix} Y_{i} (t) = \{\begin{matrix} y_{i - 1} (t) \\ y_{i} (t) \\ y_{i} (t + 1) \end{matrix} & U_{i} (t) = \{\begin{matrix} u_{i - 1} (t) \\ u_{i} (t) \\ u_{i + 1} (t) \end{matrix} \end{matrix}

Y _i-1(t), y _i(t) and y _i+1t () represents the output of cell i-1, i and i+1 respectively, u _i-1, u _iand u _i+1represent the cell input that cell i, i-1 and i+1 receive, the convolution algorithm of representing matrix;

Cell exports modulating function f (x _i(t)) concrete form be:

y_{i} (t) = f (x_{i} (t)) = \frac{1}{2} (| x_{i} (t) + 1 | - | x_{i} (t) - 1 |) - - - (2)

(2) the symmetrical cell neural network of one dimension, is built

With the one-dimensional cell neural network model that step (1) designs, first generate boss respectively and net CNN1 and from subnet CNN2, then build an one dimension antithesis cell neural network by the two:

In one dimension antithesis cell neural network, it is fixed that boss nets CNN1, then can net CNN1 along boss to move in parallel from subnet CNN2, time, t often increased by 1, move from subnet CNN2 and move a step, and equal boss from the distance of subnet CNN2 movement at every turn and net distance CNN1 between two connected cells; Boss net CNN1 by cell 0,1,2 ..., m-1 composition, from subnet CNN2 by cell 0,1,2 ..., n composition;

In one dimension antithesis cell neural network, make C=1, R _x=1, then in order to represent that the differential equation of cell state is reduced to:

x_{i} (t + 1) = \underset{l &Element; L (i)}{Σ} A &CircleTimes; Y_{l} (t) + \underset{l &Element; L (i)}{Σ} B &CircleTimes; U_{l} (t) + I_{i} - - - (3)

In formula (3), L (i) represents that cell i nets CNN1 boss, namely comprises cell i oneself from the cell neighborhood subnet CNN2, the previous cell i-1 that boss nets CNN1, a rear cell i+1 and cell i be at cell corresponding from subnet CNN2, l then represents l cell in the cell neighborhood of cell i, i.e. l ∈ L (i);

During time T=t+1, the output y of cell i _i(t+1) corresponding be redefined into:

\begin{matrix} y_{i} (t + 1) = f (x_{i} (t + 1)) \\ = \frac{1}{2} (| x_{i} (t + 1) + 1 | - | x_{i} (t + 1) - 1 |) \end{matrix} - - - (4)

(3), the one dimension antithesis cell neural network that utilizes step (2) to build, the DNA sequence dna of two similarities to be detected is carried out to the base alignment of the overall situation;

3.1), the initialization of antithesis cellular network

Two DNA base sequence S to be matched ₁and S ₂base quantity be respectively K ₁and K ₂, the base code of base sequence is expressed as S ₁(k ₁) and S ₂(k ₂), and 0≤k ₁≤ K ₁-1 and 0≤k ₂≤ K ₂-1, then boss nets CNN1 and is initialized to K respectively from the cell quantity of subnet CNN2 ₁+ 1 and K ₂+ 1, i.e. cell quantity m=K ₁+ 1 and n=K ₂+ 1;

Use u ¹(i) and u ²j () represents that boss nets i-th cell input of CNN1 and the jth cell input from subnet CNN2, then meet 0≤i≤K ₁and 0≤j≤K ₂, boss nets CNN1 and the cell input of each cell carries out assignment by formula (5) and formula (6) respectively from subnet CNN2:

Wherein, symbol " * " represents that the input u of cell is set to null value;

Another constant parameter initialization assignment that boss nets in CNN1 is I _i=2; Boss nets the feedback template Α that uses in CNN1 and Control architecture B and is initialized as following two constant matricess respectively:

A=[010] and B=[01-1]

In addition, also boss to be netted cell i, i=0 in CNN1,1 .., K ₁, original state and t=0 time be set to x respectively _i(0)=0, y _i(0)=0; Boss net CNN1 the 0th cell and from subnet CNN2 K ₂individual cell alignment;

3.2), calculate boss iteratively and net cell in CNN1 in the state in each moment and output

Time, t often increased by 1, and the arrangement of netting CNN1 from subnet CNN2 along boss needs to increase direction and moves and move a step;

CNN1 is netted to boss, if that cell j immediately below cell i _lexist, then that cell j choosing its 3 neighborhood cells and cell i-1, i and be in from subnet CNN2 immediately below i _l; At time t, t=1,2 ..., during m+n-1, when time t and cell sequence number i satisfies condition 1≤t≤m+n-1 and 1≤i≤m+1 simultaneously, calculate the optimum state of each cell respectively export with optimum and if that cell j immediately below cell i _ldo not exist, then not calculate optimum cell state export with optimum value;

Described optimum state export with optimum respectively by following formulae discovery:

\overset{&OverBar;}{x_{i}} (t) = m a x {x_{i - 1} (t - 2) + 2 I_{i}, x_{i - 1} (t - 1) - I_{i}, x_{i} (t - 1) - I_{i}} - - - (7)

Wherein, function max (...) represents the maximal value asked in input parameter, x _i-1(t-2), x _i-1and x (t-1) _i(t-1) all calculate by formula (3);

3.3) the optimum output matrix of cell, is formed

According to step 3.2) calculate all cells that boss nets CNN1 each moment optimum state and optimumly to export, then according to the 1st be classified as cell 1 from t=1 to n moment optimum export, the 2nd be classified as cell 2 from t=2 to 1+n moment optimum export ..., m is classified as cell m moment optimum exports and obtains the final optimum cell output matrix S of master network CNN1 from t=m to m+n _y;

3.4), global alignment is carried out to the base of two DNA sequence dnas

According to step 3.3) the optimum output matrix S that obtains _y, from the element in the matrix upper left corner, from left to right, Ergodic Matrices from top to bottom, determine optimum output matrix S _yintermediate value is the matrix element position of 1, and each element determined is linked in sequence the align to path P forming base;

According to the base alignment path P determined, point three kinds of situations are to DNA base sequence S ₁and S ₂operate: from first element 1, if under the next one 1 is positioned at it, then at sequence S ₁current location insert symbol " * "; If next element 1 is positioned on the right side of it, then at sequence S ₂current location insert symbol " * "; If next 1 is just positioned at its bottom-right location, then not to sequence S ₁and S ₂current location do any operation.

Process S _yfirst element after, by aforesaid three kinds of situation continued process second element, until output matrix S _ywhole values be 1 element all processed complete, this time series S ₁and S ₂complete global alignment by putting in order of base;

(4), two DNA base sequence S are calculated ₁and S ₂overall similarity

Defined nucleotide sequence S ₁and S ₂overall similarity be SC (S ₁, S ₂), then the overall similarity of these two DNA base sequences calculates by following formula:

S C (S_{1}, S_{2}) = \frac{2 \times N_{m a t c h}}{L e n (S_{1}) + L e n (S_{2})} \times 100 % - - - (9)

Wherein, symbol N _matchrepresent two DNA base sequence S ₁and S ₂after global sequence's alignment, the base-pair quantity that the match is successful, Len (S ₁) and Len (S ₂) represent sequence S respectively ₁and S ₂physical length.

Goal of the invention of the present invention is achieved in that

The present invention's one-dimensional cell neural network detects the method for DNA sequence dna similarity, first designs one-dimensional cell neural network basic model, then utilizes the antithesis cell neural network of a this Model Construction one dimension; By two DNA sequence dna information to be detected, initialization is carried out to this network again, in network operation process, record the cell state in each moment network and output, form optimum output matrix accordingly; Again the element in optimum output matrix is traveled through, thus determine best align to path; Finally according to align to path, space update is carried out so that by two sequence global alignments to two sequences; After sequence alignment, then calculate its overall similarity according to the base quantity of aliging and total base quantity.Show through test comparison, the present invention detects accurately on basis in guarantee, for the DNA sequence dna that length is longer, obviously has required computing time reduce greatly than existing method.

Accompanying drawing explanation

Fig. 1 is the one-dimensional cell neural network basic model schematic diagram that the present invention relates to;

Fig. 2 is the structural drawing of individual cells in the one-dimensional cell neural network basic model shown in Fig. 1;

Fig. 3 is the one dimension antithesis cell coupled neural network schematic diagram that the present invention builds;

Fig. 4 is the base alignment process flow diagram of the overall situation in the present invention;

Fig. 5 is the location diagram of scheme of the present invention two subnets when original state (t=0);

Fig. 6 is the serial connection rule schema of cell state matrix of the present invention and output matrix;

Fig. 7 is three kinds of methods " base sum-computing time " curve comparison figure.

Embodiment

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described, so that those skilled in the art understands the present invention better.Requiring particular attention is that, in the following description, when perhaps the detailed description of known function and design can desalinate main contents of the present invention, these are described in and will be left in the basket here.

In this implementation column, carry out catenation by unicellular, the one-dimensional cell neural network basic model of design as shown in Figure 1.Each cell sequence number is used successively " ..., i-1, i, i+1 ... " represent, alphabetical i wherein represents the arrangement sequence number of cell.

In one-dimensional cell neural network basic model, the structure of individual cells as shown in Figure 2, wherein uses x _irepresent the state of cell i, y _i, y _i-1and y _i+1represent that the feedback received from its neighborhood cell i, i-1 and i+1 from cell i exports respectively, u _i, u _i+1and u _i-1represent the cell input that cell i receives from its neighborhood cell i, i-1 and i+1, A is feedback template, and B is Control architecture, I _i, R _xthree constants respectively with C, f (x _i) be the output modulating function of cell state.

In the present embodiment, based on one-dimensional cell neural network model as shown in Figure 1, first generate boss respectively and net CNN1 and from subnet CNN2, then build an one dimension antithesis cell neural network as shown in Figure 3 by the two.

As shown in Figure 3, in one dimension antithesis cell neural network, it is fixed that boss nets CNN1, then can move in parallel along CNN1 from subnet CNN2.As t=1, mobile from subnet CNN2, t often increases by 1, moves move a step from subnet CNN2, and moves from subnet CNN2 the distance moved a step at every turn equal boss and net distance CNN1 between two connected cells.CNN1 by cell 0,1,2 ..., m-1 composition, represent cell sequence number with i, then have 0≤i≤m-1.CNN2 by cell 0,1,2 ..., n composition, j represents cell sequence number, then have 0≤j≤n-1.In Fig. 3, character u and y (being designated as cell sequence number under in figure) still represents the input and output of cell, and solid arrow represents to there is connection input between cell, and dotted arrow represents there is not connection input between cell.

As shown in Figure 3, in formula (3), L (i) represents that cell i nets CNN1 boss, namely comprises cell i oneself from the cell neighborhood subnet CNN2, the previous cell i-1 that boss nets CNN1, a rear cell i+1 and cell i at cell corresponding from subnet CNN2, be in figure 3 and add four black cells.Wherein, l then represents l cell in the cell neighborhood of cell i, i.e. l ∈ L (i);

In the present embodiment, utilize the one dimension antithesis cell neural network that step (2) builds, the DNA sequence dna of two similarities to be detected is carried out to the base alignment of the overall situation, corresponding flow process as shown in Figure 4.Be a moment export the process generating optimum output matrix and global alignment, identical with summary of the invention, repeat no more.

To in its process, during the initialization of antithesis cellular network, boss nets CNN1 and from the relation of subnet CNN2 as shown in Figure 5, boss net CNN1 the 0th cell and from subnet CNN2 K ₂individual cell alignment.

According to step 3.2) calculate all cells that boss nets CNN1 in the state in each moment and output, by being linked in sequence shown in as shown in Figure 6, obtain the final optimum cell output matrix S that boss nets CNN1 _y.

Optimum cell output matrix S _yconcatenated sequences is:

Example

The DNA base sequence getting necessary being in ncbi database below in conjunction with two is respectively further described specifically implementation process of the present invention.

The DNA base sequence identifier chosen is S62051 and NM_008134 respectively.In order to show conveniently, only choose two sequence fragments wherein to carry out implementation process explanation, the details of these two sequence fragments is as shown in table 1:

DNA base sequence	Base number	Series fragment code
			S ₁	8	AAGCTCTG
S ₂	6	CAGCAT

Table 1

Two DNA sequencing fragment S as shown in table 1 ₁and S ₂, the base quantity that they comprise is respectively K ₁=8 and K ₂=6, press 4 sub-steps of step 3 subordinate respectively to S ₁and S ₂align, as follows respectively:

1. by the step 3.1 of step (3)), initialization is carried out to the one dimension antithesis cell neural network of design:

In m=8+1=9, n=6+1=7, CNN1 c=1, Rx=1, I _i=2.

u ¹(0)＝*，u ¹(1)＝S1(0)＝A、u ¹(2)＝S1(1)＝A，…,u ¹(8)＝S1(7)＝G；

u ²(0)＝*，u ²(1)＝S2(0)＝C、u ²(2)＝S2(1)＝A,…,u ²(6)＝S2(5)＝T；

2. by the step 3.2 of step (3)), antithesis cell neural network iteration is run, and calculates as time t=1 respectively according to formula (7) and (8), 2,3 ... when 14, boss nets the optimum state of each cell in CNN1 and optimum output;

3. by the step 3.3 of step (3)), the optimum cell in each moment is exported serial connection and forms optimum output matrix S _y, the S obtained _yas shown in table 2.

1	0	0	0	0	0	0	0	0
									1	0	0	0	0	0	0	0	0
0	1	1	0	0	0	0	0	0
									0	0	0	1	0	0	0	0	0
0	0	0	0	1	0	0	0	0
									0	0	0	0	1	0	0	0	0
0	0	0	0	0	1	1	1	1

Table 2

4. by the step 3.4 of step (3)), to DNA base sequence S ₁and S ₂base carry out global alignment, later two base sequences of global alignment are as shown in table 3.

*

A

G

C

*

T

C

T

G

C

A

*

G

C

A

T

*

Table 3

5. by the formula (9) in step (4), sequence of calculation S ₁and S ₂overall similarity.According to the global alignment result (shown in table 3) of sequence, N can be obtained _match=4, Len (S ₁)=8, and Len (S ₁)=8, the then overall similarity SC (S of these two sequences ₁, S ₂)=(2 × 4) ÷ (8+6)=57.14%

In the present embodiment, method of the present invention, a large amount of true DNA sequence dnas also in ncbi database has carried out implement checking, and contrasts with the MILP method of prior art and SPA method respectively.The leading indicator contrasted in implementation process is the sequence overall situation similarity that computing time of each scheme and scheme obtain, and detailed contrast situation is as shown in table 4 and table 5.

Table 4

Table 5

Wherein, table 4 is sequence similarity contrast tables of the present invention and prior art, and the computing time of the present invention and prior art table 5 contrast (unit: millisecond)

As shown in table 4 and table 5, two DNA base sequence length sums that from left to right each row are corresponding increase gradually, and two tables respectively show the similarity and required computing time that the 6 groups of DNA base sequences come from ncbi database differently calculate.From table 4, the similarity database of display can be found out for part short data records (as S62051: length 226 and NM_008134: length 625), it is consistent that the computing method introduced in the present invention obtain similarity and other two kinds of methods, but along with the increase (as NG_009301: length 42028 and NM_000405: length 3690) of sequence summation length, the similarity that the method in the present invention obtains is slightly higher than other two kinds of methods.Its main cause is exactly method of the present invention when aliging DNA sequence dna, can obtain the base pairs that more aligns than other two kinds of methods.As can be seen from table 5 display computing time data also, for short data records to (as S62051: length 226 and NM_008134: length 625), the computing time of three kinds of methods does not have marked difference, and only than other, two kinds of methods are few uses about 10 milliseconds for method of the present invention.But along with the increase (as NG_009301: length 42028 and NM_000405: length 3690) of sequence summation length, computing time required for the present invention is about 33% of SPA method, is also only about 45% of MILP method required time.

In order to show computing time and sequence length and between Changing Pattern, depict these three kinds of methods " base sum-computing time " curve separately respectively." base sum-computing time " correlation curve shown in Fig. 7 shows, when the length summation of two sequences is less (as being less than 5000), three curves overlap substantially, mean that the computing time now needed for three kinds of methods there is no significant difference.When the length summation of two DNA sequence dnas continues to increase, time curve corresponding to SPA and MILP method can precipitously climb, and meanwhile time curve of the present invention also can climb, but its steep climbed is more a lot of gently than another two curves.As can be seen here, when two sequences total length and larger time, obviously have than SPA and MILP method computing time required for the present invention and reduce greatly.

Although be described the illustrative embodiment of the present invention above; so that those skilled in the art understand the present invention; but should be clear; the invention is not restricted to the scope of embodiment; to those skilled in the art; as long as various change to limit and in the spirit and scope of the present invention determined, these changes are apparent, and all innovation and creation utilizing the present invention to conceive are all at the row of protection in appended claim.

Claims

1. detect a method for DNA sequence dna similarity by one-dimensional cell neural network, comprise the following steps:

(1), one-dimensional cell neural network basic model is designed

In this basic model, cell state differential equation group represents:

Cell exports modulating function f (x _i(t)) concrete form be:

(2) the symmetrical cell neural network of one dimension, is built

3.1), the initialization of antithesis cellular network

Wherein, symbol " * " represents that the input u of cell is set to null value;

A=[010] and B=[01-1];

3.3) the optimum output matrix of cell, is formed

3.4), global alignment is carried out to the base of two DNA sequence dnas

According to the base alignment path P determined, point three kinds of situations are to DNA base sequence S ₁and S ₂operate: from first element 1, if under the next one 1 is positioned at it, then at sequence S ₁current location insert symbol " * "; If next element 1 is positioned on the right side of it, then at sequence S ₂current location insert symbol " * "; If next 1 is just positioned at its bottom-right location, then not to sequence S ₁and S ₂current location do any operation;

(4), two DNA base sequence S are calculated ₁and S ₂overall similarity