CN114550827B

CN114550827B - Gene sequence comparison method and system

Info

Publication number: CN114550827B
Application number: CN202210044384.1A
Authority: CN
Inventors: 张庆科; 李天奇; 汪玉成; 高昊; 卜降龙; 来明旭; 张化祥
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-11-22
Anticipated expiration: 2042-01-14
Also published as: CN114550827A

Abstract

The invention provides a gene sequence comparison method and a system, comprising the following steps: coding parameters of the hidden Markov model into honey sources, and adopting the hidden Markov models corresponding to all the honey sources to obtain various hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met, if so, comparing every two hidden state sequences of all gene sequences obtained by a hidden Markov model corresponding to a honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence; otherwise, dividing all the honey sources into a plurality of populations based on the fitness value of each honey source, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until the termination condition is met. The randomness of parameter search of the hidden Markov model is enhanced, the solution is prevented from falling into local optimum, and the solution precision is improved when multiple sequences are compared.

Description

Gene sequence comparison method and system

Technical Field

The invention belongs to the technical field of sequence comparison, and particularly relates to a gene sequence comparison method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In recent years, the rapid development of biological science and technology, and how to analyze and process the implicit meaning of data in a biological database is a serious challenge for human beings. Sequence alignment reflects the information that biological sequences possess and has been widely used to identify related DNA and protein sequences. The development of sequence alignment has been over decades, and a large number of sequence alignment methods have been proposed, for example, a sequence alignment algorithm based on dynamic programming, but the algorithm consumes a lot of time and space, and cannot solve practical problems; the algorithm is a progressive alignment algorithm, but it tends to fall into local optimality and cannot be corrected.

To overcome the drawbacks of the above two types of algorithms, iterative alignment algorithms based on the generation of multiple sequence alignment algorithms have emerged. The iterative comparison algorithm mainly refers to a swarm intelligence algorithm constructed based on the swarm behaviors of organisms, such as a particle swarm algorithm, a genetic algorithm, an artificial bee swarm algorithm and the like. An Artificial Bee Colony Algorithm (ABC) is a Colony intelligent algorithm based on Bee Colony honey collection behavior. The method has the advantages of few control parameters, easy realization and the like, and has been focused and improved by more and more scholars in recent years, and has been successfully applied to optimization problems in many fields. However, with the intensive research of people on the ABC algorithm, it is found that the probability selection mechanism of the ABC algorithm in the bee following stage fails in the later iteration stage of the population, so that the algorithm is slow in convergence in the later iteration stage and low in solution accuracy.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a gene sequence comparison method and a gene sequence comparison system, which enhance the randomness of parameter search of a hidden Markov model, avoid the solution from being trapped in local optimization and improve the solution precision in multi-sequence comparison.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for gene sequence alignment, comprising:

obtaining a plurality of gene sequences;

coding parameters of the hidden Markov model into honey sources, and for each gene sequence, adopting the hidden Markov models corresponding to all the honey sources to obtain various hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met, if so, comparing every two hidden state sequences of all gene sequences obtained by a hidden Markov model corresponding to a honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence; otherwise, dividing all the honey sources into a plurality of populations based on the fitness value of each honey source, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until a termination condition is met.

Further, the performing difference learning between different populations includes:

for a certain honey source, acquiring global optimal solutions of all the honey sources, and randomly selecting one solution in different populations respectively;

calculating different difference values based on the global optimal solution and the random selected solution;

weighting and summing the different difference values, and then summing the weighted sum with the honey source to obtain leading bees;

and if the fitness value of the leading bee is larger than that of the honey source, replacing the honey source with the leading bee.

weighting and summing the different difference values and the honey source to obtain follower bees;

and if the fitness value of the follower bee is larger than that of the honey source, replacing the honey source with the follower bee.

and for a certain honey source, if the iteration failure times of the honey source reach the set times, generating a plurality of new solutions by adopting a plurality of new solution generation modes, and selecting the optimal solution from the plurality of new solutions according to a greedy selection strategy to replace the honey source.

Further, the new solution generation method is as follows: a new solution is randomly generated.

Further, the new solution generation method is as follows:

selecting a population with the maximum fitness value from a plurality of populations as a base population, and taking the rest populations as auxiliary populations;

randomly selecting a honey source from the basic population as a basic honey source;

randomly selecting a honey source in each auxiliary population as an auxiliary honey source;

and calculating the difference between the auxiliary honey sources, multiplying the difference by a random number, and summing the sum with the basic honey source to obtain a new solution.

Further, the new solution generation method is as follows:

selecting a minimum value, a maximum value and a global optimal solution from all honey sources;

and calculating the difference between the minimum value plus the maximum value and the global optimal solution to obtain a new solution.

In a second aspect of the present invention, there is provided a gene sequence alignment system comprising:

a gene sequence acquisition module configured to: obtaining a plurality of gene sequences;

a gene sequence alignment module configured to: coding parameters of the hidden Markov model into honey sources, and for each gene sequence, adopting the hidden Markov models corresponding to all the honey sources to obtain various hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met, if so, comparing every two hidden state sequences of all gene sequences obtained by a hidden Markov model corresponding to a honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence; otherwise, dividing all the honey sources into a plurality of populations based on the fitness value of each honey source, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until the termination condition is met.

A third aspect of the present invention provides a computer readable storage medium, on which a computer program is stored, which program, when executed by a processor, implements the steps in a method for gene sequence alignment as described above.

A fourth aspect of the present invention provides a computer device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the gene sequence alignment method.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a gene sequence comparison method, which optimizes parameters of a hidden Markov model based on an artificial bee colony algorithm of layered learning, can avoid the danger that the algorithm is trapped in local optimization, accelerates the convergence speed and improves the solution precision during multi-sequence comparison.

The invention provides a gene sequence comparison method, which constructs a new hierarchical ring topology structure based on an artificial bee colony algorithm of hierarchical learning, and populations among different levels can be subjected to difference learning; therefore, the search strategies in the two stages are improved, and the overall optimization capability and the search capability of the algorithm are enhanced; the defect that the ABC algorithm is easy to converge and stagnate at the later stage of iteration due to a probability selection mechanism is avoided, the convergence speed is increased, and the accuracy of the solution is improved; solutions in three different directions are generated in the bee investigation stage, the search randomness is enhanced, and the situation that the solution is trapped in local optimization is avoided.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are included to illustrate an exemplary embodiment of the invention and not to limit the invention.

FIG. 1 is a flow chart of a method according to a first embodiment of the present invention;

FIG. 2 (a) is a graph showing the convergence of the 1ad2_ref1 gene sequence according to the first embodiment of the present invention;

FIG. 2 (b) is a graph showing the convergence of the 1ivy _ref5gene sequence in the first embodiment of the present invention;

FIG. 2 (c) is a 451c _ref1gene sequence convergence graph in accordance with the first embodiment of the present invention;

FIG. 2 (d) is a graph showing the convergence of the kinase _ ref1 gene sequence according to the first embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

This example provides a gene sequence alignment method, as shown in fig. 1, which specifically includes the following steps:

step 1, obtaining a plurality of gene sequences;

step 2, initialization: including initialization of parameters and generation of initial honey sources. The initialized parameters comprise population size SN, number SN of honey sources, individual dimension D, threshold limit, maximum iteration number MCN, maximum evaluation number MFE and maximum UB _j And minimum value LB _j (ii) a The initial honey source is generated by randomly generating SN initial honey sources through an equation (1):

x _i，j ＝LB _j +rand(0，1)·(UB _j -LB _j ) (1)

wherein x is _i，j J-dimensional vector representing the ith honey source (individual), i =1,2,3.. SN, j =1,2,3.. D, { LB _j ，UB _j Denotes the value range of the j-dimension variable, and rand (0,1) denotesA random number between 0 and 1. Each honey source X _i Represents the parameters of a hidden Markov model, and D is the number of the parameters of the hidden Markov model.

Step 3, coding parameters of the hidden Markov models into honey sources, and for each gene sequence, adopting the hidden Markov models corresponding to all the honey sources to obtain a plurality of hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met or not (wherein the termination condition is that the iteration number reaches the maximum iteration number MCN), if so, comparing every two hidden state sequences of all gene sequences obtained by the hidden Markov model corresponding to the honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence (namely, the gene sequence corresponding to the hidden state sequence with the maximum similarity); otherwise, based on the fitness value of each honey source, dividing all the honey sources into a plurality of populations, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until a termination condition is met, namely executing steps 301-304:

301, obtaining hidden state sequences of all gene sequences based on each honey source (a parameter of a hidden Markov model), and calculating fitness value fit (X) of each individual according to formula (2) _i )，fit(X _i ) Value of (1) is honey source X _i The following hidden markov model yields an SPS of the sequence of hidden states:

wherein l _i Indicating the ith aligned hidden state sequence, l _j Representing the j hidden state sequence to be compared, D is a function representing the similarity of the two sequences, and in actual operation, a similarity score matrix is used for calculating D, so D (l) _i ,l _j ) Is generally expressed as _i And l _j The replacement score corresponding to the residual value. The higher the SPS score, the better the accuracy of the alignment representing the gene sequence.

According to the constructed hierarchical ring topology structure, dividing the population into a first population S1, a second population S2 and a third population S3 according to the fitness value by utilizing a hierarchical learning mode, wherein the number of the first population S1, the second population S2 and the third population S3 is integrated into SN, and the ratio of the number of the first population S1, the second population S2 and the third population S3 is 1: 7: 2; the fitness value of each individual in the first population S1 is greater than the fitness value of each individual in the second population S2, and the fitness value of each individual in the second population S2 is greater than the fitness value of each individual in the third population S3.

The idea of layered learning is as follows: in the whole population, the outer-layer population learns the difference of the inner-layer population, and the whole population is continuously close to a more excellent solution. Meanwhile, the innermost population is learned towards the global optimum, and a better solution is found.

Step 302, bee leading stage: for a certain honey source, acquiring global optimal solutions of all the honey sources, and randomly selecting one solution in different populations respectively; calculating different difference values based on the global optimal solution and the random selected solution; weighting and summing the different difference values, and then summing the weighted sum with the honey source to obtain a leading bee; and if the fitness value of the leading bee is larger than that of the honey source, replacing the honey source with the leading bee. Specifically, based on the idea of layered learning, a search equation at the stage of the leading bees is improved, the leading bees do not perform neighborhood search only on a single honey source any more, but perform differential learning between different population levels according to the formula (3) to obtain a high-quality solution.

Wherein v is _i，j For newly generated solutions, x _i，j Is the j-th dimension vector of the ith honey source, phi _i，j Is [ -1,1]A random number in between, and a random number,

is [0,1.5]Random number between, x _gbest，j Is a global optimal solution, x, of the j-th dimension _S1，j 、x _S2，j 、x _s3，j Are respectively randomly arranged in three layers S1, S2 and S3Three solutions were selected.

Calculating the fitness value (SPS value) fit (V) of the newly generated solution according to equation (2) _i ) (or new _ fit) if it is greater than the SPS value of the current individual, i.e., fit (X) _i )＜fit(V _i ) Replacing the current lead bee individual with the new individual, and real _i =0; otherwise, X is reserved _i ，trail _i ＝trail _i +1。

Step 303, following the bee stage: for a certain honey source, acquiring global optimal solutions of all the honey sources, and randomly selecting one solution in different populations respectively; calculating different difference values based on the global optimal solution and the random selected solution; weighting and summing the different difference values and the honey source to obtain follower bees; and if the adaptability value of the follower bee is greater than that of the honey source, replacing the honey source with the follower bee. In particular, in the following stage, three elite improvement operators r are introduced between three different layers ₁ 、r ₂ 、r ₃ The original probability selection mechanism is replaced by the layered learning.

v _i，j ＝r ₁ ·x _i，jr +r ₂ ·(xS _2，j -xS _3，j )+r ₃ ·(x _gbest，j -x _s1，r ) (4)

Wherein r is ₁ 、r ₂ 、r ₃ Is three [0,1]A random number in between, and r ₁ +r ₂ +r ₃ ＝1，x _gbest，j Is a global optimal solution, x, of the j-th dimension _S1，j 、x _S2，j 、x _S3，j Respectively, three solutions randomly selected among the three levels S1, S2, S3.

Calculating the SPS value of the newly generated solution according to the formula (2), if the SPS value of the current individual is larger than the SPS value of the current individual, replacing the current leading bee individual with the new individual, and obtaining the final _i =0; otherwise, X is reserved _i ，trail _i ＝trail _i +1。

Step 304, detecting bees: for a certain honey source, if the iteration failure times of the honey source reach the set times, a plurality of new solution generation modes are adopted to generate a plurality of new solutions, and a greedy selection strategy is adoptedAnd selecting the optimal solution from a plurality of new solutions to replace the honey source. Specifically, in the bee detection stage, a layered learning and opponent learning-based method is introduced, and trail is performed when iteration fails _i Reaches the set number of times limit _i Instead of using the original single new solution generation approach, three different solutions are generated.

Randomly generating a new solution: first solution m ₁ Still, it is generated according to (1).

Selecting a population with the maximum fitness value from a plurality of populations as a base population, and taking the rest populations as auxiliary populations; randomly selecting a honey source from the basic population as a basic honey source; randomly selecting a honey source in each auxiliary population as an auxiliary honey source; and calculating the difference between the auxiliary honey sources, multiplying the difference by a random number, and summing the sum with the base honey source to obtain a new solution. I.e. the second solution m ₂ The idea of learning by layering is generated according to the following formula:

m ₂ ＝x _S1 +φ _i，j ·(x _S2 -x _S3 ) (5)

wherein x is _S1，j 、x _S2，j 、x _S3，j Are respectively three solutions, phi, randomly selected from three layers S1, S2, S3 _i，j Is [ -1,1]A random number in between.

Selecting a minimum value, a maximum value and a global optimal solution from all honey sources; and calculating the difference between the minimum value plus the maximum value and the global optimal solution to obtain a new solution. I.e. the third solution m ₃ According to the thought of opponent learning, a solution is searched on the opposite surface of the global optimal solution to avoid the search from being trapped in local optimal, and the generation formula is as follows:

m ₃ ＝LB+UB-x _gbest (6)

wherein LB and UB are respectively the minimum value and the maximum value of the solution, x _gbest Is a globally optimal solution.

And (3) calculating the fitness values of the three newly generated solutions through a formula (2), and selecting the optimal solution as a newly generated solution according to a greedy selection strategy.

In this example, 4 groups of test experiments were performed, and the gene sequences were 1ad2_ref1, 1ivy _ref5, 451c _ref1, and kinase _ ref1, respectively, and were aligned with ABC and the present invention to find the same group of sequences. In the experiment, the ABC algorithm and the algorithm of the invention are operated under the same experimental conditions, each test function is independently operated for 10 times, iteration is carried out for 1000 generations, and the maximum value, the minimum value and the average value are recorded.

TABLE 1 accuracy of multiple sequence alignment test results

The algorithm of the present invention is significantly higher than the result of the ABC algorithm, both in terms of mean and optimum or worst value. Therefore, the superiority of the algorithm of the invention can be fully seen. In order to more fully express the performance of the algorithm of the present invention, besides giving the precision results shown in table 1, a convergence curve graph of the operation of the algorithm of the present invention (HLABC) and ABC is also shown in a form of a graph. As shown in fig. 2 (a), 2 (b), 2 (c) and 2 (d), the horizontal axis represents the number of iterations (Iteration) and the vertical axis represents the average value of the SPS (Score). From which it can be concluded that: the invention can avoid the danger that the algorithm is trapped in local optimum, quickens the convergence speed and improves the precision of the solution when the multiple sequences are compared.

The invention provides a novel ring topology structure based on the idea of layered learning, improves the original search strategy so as to improve the randomness of search, replaces the original probability selection mechanism with the layered learning method, and improves the optimizing capability and the convergence speed of the algorithm, thereby overcoming the defects of the original ABC algorithm, achieving the optimization effect of the ABC algorithm and improving the solution precision of multi-sequence comparison.

Example two

The embodiment provides a gene sequence comparison system, which specifically comprises the following modules:

a gene sequence alignment module configured to: coding parameters of the hidden Markov model into honey sources, and adopting the hidden Markov models corresponding to all the honey sources to obtain various hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met, if so, comparing every two hidden state sequences of all gene sequences obtained by a hidden Markov model corresponding to a honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence; otherwise, dividing all the honey sources into a plurality of populations based on the fitness value of each honey source, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until a termination condition is met.

It should be noted that, each module in the present embodiment corresponds to each step in the first embodiment one to one, and the specific implementation process is the same, which is not described herein again.

EXAMPLE III

This embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in a method for gene sequence alignment as described in the first embodiment above.

Example four

This embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the gene sequence alignment method according to the above embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a computer to implement the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of gene sequence alignment comprising:

obtaining a plurality of gene sequences;

coding parameters of the hidden Markov model into honey sources, and adopting the hidden Markov models corresponding to all the honey sources to obtain various hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met, if so, comparing every two hidden state sequences of all gene sequences obtained by a hidden Markov model corresponding to a honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence; otherwise, dividing all the honey sources into a plurality of populations based on the fitness value of each honey source, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until a termination condition is met;

wherein the fitness value is honey source X _i The following hidden markov model yields an SPS of the sequence of hidden states:

wherein l _i Indicating the ith aligned hidden state sequence, l _j Representing the j hidden state sequence to be compared, D is a function representing the similarity of the two sequences;

the differential learning between different populations comprises:

for a certain honey source, acquiring global optimal solutions of all the honey sources, and randomly selecting one solution in different populations respectively; calculating different difference values based on the global optimal solution and the random selected solution; weighting and summing the different difference values, and then summing the weighted sum with the honey source to obtain leading bees; if the fitness value of the leading bee is larger than that of the honey source, replacing the honey source with the leading bee;

for a certain honey source, acquiring global optimal solutions of all the honey sources, and randomly selecting one solution in different populations respectively; calculating different difference values based on the global optimal solution and the random selected solution; weighting and summing the different difference values and the honey source to obtain follower bees; if the fitness value of the follower bees is larger than that of the honey source, replacing the honey source with the follower bees;

2. The method of claim 1, wherein the new solution is generated by: a new solution is randomly generated.

3. The method of claim 1, wherein the new solution is generated by:

4. The method of claim 1, wherein the new solution is generated by:

5. A system for aligning gene sequences, comprising:

a gene sequence alignment module configured to: coding parameters of the hidden Markov model into honey sources, and adopting the hidden Markov models corresponding to all the honey sources to obtain various hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met, if so, comparing every two hidden state sequences of all gene sequences obtained by the hidden Markov model corresponding to the honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence; otherwise, dividing all the honey sources into a plurality of populations based on the fitness value of each honey source, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until a termination condition is met;

wherein l _i Representing the i-th aligned hidden state sequence, l _j Representing the j-th hidden state sequence to be compared, and D is a function representing the similarity of the two sequences;

the differential learning between different populations comprises:

for a certain honey source, acquiring global optimal solutions of all the honey sources, and randomly selecting one solution from different populations respectively; calculating different difference values based on the global optimal solution and the random selected solution; weighting and summing the different difference values, and then summing the weighted sum with the honey source to obtain leading bees; if the fitness value of the leading bee is larger than that of the honey source, replacing the honey source with the leading bee;

6. A computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, carries out the steps of a method of gene sequence alignment according to any one of claims 1 to 4.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to perform the steps of a method of gene sequence alignment according to any one of claims 1-4.