CN114550827A

CN114550827A - Gene sequence comparison method and system

Info

Publication number: CN114550827A
Application number: CN202210044384.1A
Authority: CN
Inventors: 张庆科; 李天奇; 汪玉成; 高昊; 卜降龙; 来明旭; 张化祥
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-05-27
Anticipated expiration: 2042-01-14
Also published as: CN114550827B

Abstract

The invention provides a gene sequence comparison method and a system, comprising the following steps: coding parameters of the hidden Markov model into honey sources, and adopting the hidden Markov models corresponding to all the honey sources to obtain various hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met, if so, comparing every two hidden state sequences of all gene sequences obtained by a hidden Markov model corresponding to a honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence; otherwise, dividing all the honey sources into a plurality of populations based on the fitness value of each honey source, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until the termination condition is met. The randomness of parameter search of the hidden Markov model is enhanced, the solution is prevented from falling into local optimum, and the solution precision is improved when multiple sequences are compared.

Description

Gene sequence comparison method and system

Technical Field

The invention belongs to the technical field of sequence comparison, and particularly relates to a gene sequence comparison method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

In recent years, the rapid development of biological science and technology, and how to analyze and process the implicit meaning of data in a biological database is a serious challenge for human beings. Sequence alignment reflects the information that biological sequences possess and has been widely used to identify related DNA and protein sequences. The development of sequence alignment has been in the history for decades, and a large number of sequence alignment methods have been proposed, for example, a sequence alignment algorithm based on dynamic programming, but the algorithm consumes a lot of time and space, and cannot solve the practical problem; the algorithm is a progressive alignment algorithm, but it tends to fall into local optima and cannot be corrected.

To overcome the drawbacks of the above two types of algorithms, iterative alignment algorithms based on the generation of multiple sequence alignment algorithms have emerged. The iterative comparison algorithm mainly refers to a swarm intelligence algorithm constructed based on the swarm behaviors of organisms, such as a particle swarm algorithm, a genetic algorithm, an artificial bee swarm algorithm and the like. An Artificial Bee Colony Algorithm (ABC) is a swarm intelligence algorithm based on the Bee Colony honey collection behavior. The method has the advantages of few control parameters, easy realization and the like, and has been focused and improved by more and more scholars in recent years, and has been successfully applied to optimization problems in many fields. However, with the intensive research on the ABC algorithm, it is found that the probability selection mechanism in the bee following stage fails in the later iteration stage of the population, which results in slow convergence and low solution accuracy of the algorithm in the later iteration stage.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a gene sequence comparison method and system, which enhance the randomness of parameter search of a hidden Markov model, avoid solving the solution to be trapped in local optimization and improve the precision of the solution when multiple sequences are compared.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for gene sequence alignment, comprising:

obtaining a plurality of gene sequences;

coding parameters of the hidden Markov model into honey sources, and adopting the hidden Markov models corresponding to all the honey sources to obtain various hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met, if so, comparing every two hidden state sequences of all gene sequences obtained by the hidden Markov model corresponding to the honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence; otherwise, dividing all the honey sources into a plurality of populations based on the fitness value of each honey source, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until a termination condition is met.

Further, the performing difference learning between different populations includes:

for a certain honey source, acquiring global optimal solutions of all the honey sources, and randomly selecting one solution in different populations respectively;

calculating different difference values based on the global optimal solution and the random selected solution;

weighting and summing the different difference values, and then summing the weighted sum with the honey source to obtain a leading bee;

and if the fitness value of the leading bee is larger than that of the honey source, replacing the honey source with the leading bee.

weighting and summing the different difference values and the honey source to obtain follower bees;

and if the fitness value of the follower bee is larger than that of the honey source, replacing the honey source with the follower bee.

and for a certain honey source, if the iteration failure times of the honey source reach the set times, generating a plurality of new solutions by adopting a plurality of new solution generation modes, and selecting the optimal solution from the plurality of new solutions according to a greedy selection strategy to replace the honey source.

Further, the new solution generation method is as follows: a new solution is randomly generated.

Further, the new solution generation method is as follows:

selecting a population with the maximum fitness value from a plurality of populations as a base population, and taking the rest populations as auxiliary populations;

randomly selecting a honey source in the basic population as a basic honey source;

randomly selecting a honey source in each auxiliary population as an auxiliary honey source;

and calculating the difference between the auxiliary honey sources, multiplying the difference by a random number, and summing the sum with the basic honey source to obtain a new solution.

Further, the new solution generation method is as follows:

selecting a minimum value, a maximum value and a global optimal solution from all honey sources;

and calculating the difference between the minimum value plus the maximum value and the global optimal solution to obtain a new solution.

In a second aspect, the present invention provides a gene sequence alignment system, comprising:

a gene sequence acquisition module configured to: obtaining a plurality of gene sequences;

a gene sequence alignment module configured to: coding parameters of the hidden Markov model into honey sources, and adopting the hidden Markov models corresponding to all the honey sources to obtain various hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met, if so, comparing every two hidden state sequences of all gene sequences obtained by a hidden Markov model corresponding to a honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence; otherwise, dividing all the honey sources into a plurality of populations based on the fitness value of each honey source, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until a termination condition is met.

A third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in a method of gene sequence alignment as described above.

A fourth aspect of the present invention provides a computer device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the gene sequence alignment method.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a gene sequence comparison method, which optimizes parameters of a hidden Markov model based on a manual bee colony algorithm of layered learning, can avoid the danger that the algorithm is trapped in local optimization, accelerates the convergence speed and improves the precision of multi-sequence comparison on solution.

The invention provides a gene sequence comparison method, which constructs a new hierarchical ring topology structure based on an artificial bee colony algorithm of hierarchical learning, and populations among different levels can be subjected to difference learning; therefore, the search strategies in the two stages are improved, and the overall optimization capability and the search capability of the algorithm are enhanced; the defect that the ABC algorithm is easy to converge and stagnate at the later stage of iteration due to a probability selection mechanism is avoided, the convergence speed is increased, and the accuracy of the solution is improved; solutions in three different directions are generated in the bee investigation stage, the search randomness is enhanced, and the situation that the solution is trapped in local optimization is avoided.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are included to illustrate an exemplary embodiment of the invention and not to limit the invention.

FIG. 1 is a flow chart of a method according to a first embodiment of the present invention;

FIG. 2(a) is a graph showing the convergence of the 1ad2_ ref1 gene sequence in the first embodiment of the present invention;

FIG. 2(b) is a graph showing the convergence of the 1ivy _ ref5 gene sequence in the first embodiment of the present invention;

FIG. 2(c) is a graph showing the convergence of the 451c _ ref1 gene sequence according to the first embodiment of the present invention;

FIG. 2(d) is a graph showing the convergence of the sequence of kinase _ ref1 gene according to the first embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

This example provides a gene sequence alignment method, as shown in fig. 1, which specifically includes the following steps:

step 1, obtaining a plurality of gene sequences;

step 2, initialization: including initialization of parameters and generation of initial honey sources. The initialized parameters comprise a population size SN, the number SN of honey sources, an individual dimension D, a threshold limit, a maximum iteration number MCN, a maximum evaluation number MFE and a maximum UB_jAnd minimum value LB_j(ii) a The initial honey source is generated by randomly generating SN initial honey sources through an equation (1):

x_i，j＝LB_j+rand(0，1)·(UB_j-LB_j) (1)

wherein x is_i，jA j-th dimension vector representing an i-th honey source (individual), i ═ 1, 2, 3.. SN, j ═ 1, 2, 3.. D, { LB_j，UB_jDenotes a value range of a variable of the j-th dimension, and rand (0, 1) denotes a random number between 0 and 1. Each honey source X_iRepresents a parameter of the hidden Markov model, and D is the number of the parameters of the hidden Markov model.

Step 3, coding parameters of the hidden Markov models into honey sources, and for each gene sequence, adopting the hidden Markov models corresponding to all the honey sources to obtain a plurality of hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met (wherein the termination condition is that the iteration times reach the maximum iteration times MCN), if so, pairwise comparing hidden state sequences of all gene sequences obtained by the hidden Markov model corresponding to the honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence (namely, a gene sequence corresponding to the hidden state sequence with the maximum similarity); otherwise, based on the fitness value of each honey source, dividing all the honey sources into a plurality of populations, and performing difference learning between different populations to optimize the parameters of the hidden markov model until the termination condition is met, i.e. executing step 301-:

301, obtaining hidden state sequences of all gene sequences based on each honey source (a parameter of a hidden Markov model), and calculating fitness value fit (X) of each individual according to formula (2)_i)，fit(X_i) Value of (1) is honey source X_iThe lower hidden markov model yields an SPS of the hidden state sequence:

wherein l_iIndicating the ith aligned hidden state sequence, l_jRepresenting the j hidden state sequence to be compared, D is a function representing the similarity of the two sequences, and in actual operation, a similarity score matrix is used for calculating D, so D (l)_i,l_j) Is generally expressed as_iAnd l_jThe replacement score corresponding to the residual value. The higher the SPS score, the better the accuracy of the alignment representing the gene sequence.

According to the constructed hierarchical ring topology structure, the population is divided into a first population S1, a second population S2 and a third population S3 according to the size of the fitness value by utilizing a hierarchical learning mode, wherein the number of the first population S1, the second population S2 and the third population S3 is integrated into SN, and the ratio of the number of the first population S1, the number of the second population S2 and the number of the third population S3 is 1: 7: 2; the fitness value of each individual in the first population S1 is greater than the fitness value of each individual in the second population S2, and the fitness value of each individual in the second population S2 is greater than the fitness value of each individual in the third population S3.

The idea of layered learning is as follows: in the whole population, the outer-layer population learns the difference of the inner-layer population, and the whole population is continuously close to a more excellent solution. Meanwhile, the innermost population is learned towards the global optimum, and a better solution is found.

Step 302, leading bee stage: for a certain honey source, acquiring global optimal solutions of all the honey sources, and randomly selecting one solution in different populations respectively; calculating different difference values based on the global optimal solution and the random selected solution; weighting and summing the different difference values, and then summing the weighted sum with the honey source to obtain leading bees; and if the fitness value of the leading bee is larger than that of the honey source, replacing the honey source with the leading bee. Specifically, based on the idea of layered learning, a search equation at the stage of the leading bees is improved, the leading bees do not perform neighborhood search only on a single honey source any more, but perform differential learning between different population levels according to the formula (3) to obtain a high-quality solution.

Wherein v is_i，jFor newly generated solutions, x_i，jIs the j-th dimension vector of the ith honey source, phi_i，jIs [ -1, 1 [ ]]A random number in between, and a random number,

is [0, 1.5 ]]Random number between, x_gbest，jIs a global optimal solution, x, of the j-th dimension_S1，j、x_S2，j、x_s3，jThree solutions are randomly selected among the three hierarchies S1, S2, S3, respectively.

Calculating the fitness value (SPS value) fit (V) of the newly generated solution according to equation (2)_i) (or new _ fit) if it is greater than the SPS value of the current individual, i.e., fit (X)_i)＜fit(V_i) Replacing the current lead bee individual with the new individual, and real_i0; otherwise, X is reserved_i，trail_i＝trail_i+1。

Step 303, following the bee stage: for a certain honey source, acquiring global optimal solutions of all the honey sources, and randomly selecting one solution in different populations respectively; calculating different difference values based on the global optimal solution and the random selected solution; weighting and summing the different difference values and the honey source to obtain follower bees; and if the fitness value of the follower bee is larger than that of the honey source, replacing the honey source with the follower bee. Specifically, in the following stage, three elite improvement operators r are introduced among three different layers₁、r₂、r₃The original probability selection mechanism is replaced by the layered learning.

v_i，j＝r₁·x_i，jr+r₂·(xS_2，j-xS_3，j)+r₃·(x_gbest，j-x_s1，r) (4)

Wherein r is₁、r₂、r₃Is three [0, 1 ]]A random number in between, and r₁+r₂+r₃＝1，x_gbest，jIs a global optimal solution, x, of the j-th dimension_S1，j、x_S2，j、x_S3，jThree solutions are randomly selected among the three hierarchies S1, S2, S3, respectively.

Calculating the SPS value of the newly generated solution according to the formula (2), if the SPS value of the current individual is larger than the SPS value of the current individual, replacing the current leading bee individual with the new individual, and obtaining the final_i0; otherwise, X is reserved_i，trail_i＝trail_i+1。

Step 304, detecting bees: and for a certain honey source, if the iteration failure times of the honey source reach the set times, generating a plurality of new solutions by adopting a plurality of new solution generation modes, and selecting the optimal solution from the plurality of new solutions according to a greedy selection strategy to replace the honey source. Specifically, in the bee detection stage, a layered learning and opponent learning-based method is introduced, and trail is performed when iteration fails_iReaches the set number of times limit_iInstead of using the original single new solution generation approach, three different solutions are generated.

Randomly generating a new solution: first solution m₁Still, it is generated according to (1).

Selecting a population with the maximum fitness value from a plurality of populations as a base population, and taking the rest populations as auxiliary populations; randomly selecting a honey source in the basic population as a basic honey source; randomly selecting a honey source in each auxiliary population as an auxiliary honey source; and calculating the difference between the auxiliary honey sources, multiplying the difference by a random number, and summing the sum with the basic honey source to obtain a new solution. I.e. the second solution m₂The idea of learning by layering is generated according to the following formula:

m₂＝x_S1+φ_i，j·(x_S2-x_S3) (5)

wherein x is_S1，j、x_S2，j、x_S3，jAre respectively three solutions, phi, randomly selected from three layers S1, S2, S3_i，jIs [ -1, 1 [ ]]A random number in between.

Selecting a minimum value, a maximum value and a global optimal solution from all honey sources; and calculating the difference between the minimum value plus the maximum value and the global optimal solution to obtain a new solution. I.e. the third solution m₃According to the thought of opponent learning, a solution is searched on the opposite side of the global optimal solution to avoid the search from being trapped in local optimal solutionThe generation formula is:

m₃＝LB+UB-x_gbest (6)

wherein LB and UB are respectively the minimum value and the maximum value of the solution, x_gbestIs a globally optimal solution.

And (3) calculating the fitness values of the three newly generated solutions through a formula (2), and selecting the optimal solution as a newly generated solution according to a greedy selection strategy.

In this example, 4 sets of test experiments were performed, and the gene sequences were 1ad2_ ref1, 1ivy _ ref5, 451c _ ref1, and kinase _ ref1, respectively, and were aligned with ABC and the same set of sequences obtained by the present invention. In the experiment, the ABC algorithm and the algorithm of the invention are operated under the same experimental conditions, each test function is independently operated for 10 times, 1000 iterations are carried out, and the maximum value, the minimum value and the average value are recorded.

TABLE 1 accuracy of multiple sequence alignment test results

The algorithm of the present invention is significantly higher than the result of the ABC algorithm, both in terms of mean and optimum or worst value. Therefore, the superiority of the algorithm of the invention can be fully seen. In order to more fully express the performance of the algorithm of the present invention, besides giving the precision results shown in table 1, a convergence curve graph of the operation of the algorithm of the present invention (HLABC) and ABC is also shown in a form of a graph. As shown in fig. 2(a), fig. 2(b), fig. 2(c) and fig. 2(d), wherein the horizontal axis represents the Iteration number (Iteration) and the vertical axis represents the average value (Score) of the SPS. From which it can be concluded that: the invention can avoid the danger that the algorithm is trapped in local optimum, quickens the convergence speed and improves the accuracy of the solution when the multiple sequences are compared.

The invention provides a novel ring topology structure based on the idea of layered learning, improves the original search strategy to improve the randomness of search, and replaces the original probability selection mechanism with the layered learning method to improve the optimization capability and the convergence speed of the algorithm, thereby overcoming the defects of the original ABC algorithm, achieving the optimization effect of the ABC algorithm and improving the precision of solution when multiple sequences are compared.

Example two

The embodiment provides a gene sequence alignment system, which specifically comprises the following modules:

It should be noted that, each module in the present embodiment corresponds to each step in the first embodiment one to one, and the specific implementation process is the same, which is not described herein again.

EXAMPLE III

This embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in a method for gene sequence alignment as described in the first embodiment above.

Example four

This embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the gene sequence alignment method according to the above embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of gene sequence alignment comprising:

obtaining a plurality of gene sequences;

coding parameters of the hidden Markov model into honey sources, and adopting the hidden Markov models corresponding to all the honey sources to obtain various hidden state sequences corresponding to each gene sequence; judging whether a termination condition is met, if so, comparing every two hidden state sequences of all gene sequences obtained by a hidden Markov model corresponding to a honey source with the maximum fitness value to obtain a gene sequence most similar to each gene sequence; otherwise, dividing all the honey sources into a plurality of populations based on the fitness value of each honey source, and performing difference learning among different populations to optimize the parameters of the hidden Markov model until a termination condition is met.

2. The method of claim 1, wherein the learning the differences between different populations comprises:

weighting and summing the different difference values, and then summing the weighted sum with the honey source to obtain leading bees;

3. The method of claim 1, wherein the learning the differences between different populations comprises:

4. The method of claim 1, wherein the learning the differences between different populations comprises:

5. The method of claim 4, wherein the new solution is generated by: a new solution is randomly generated.

6. The method of claim 4, wherein the new solution is generated by:

7. The method of claim 4, wherein the new solution is generated by:

8. A system for aligning gene sequences, comprising:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of a method for gene sequence alignment according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to perform the steps of a method of gene sequence alignment according to any one of claims 1-7.