CN115148292A

CN115148292A - Artificial intelligence-based DNA (deoxyribonucleic acid) motif prediction method, device, equipment and medium

Info

Publication number: CN115148292A
Application number: CN202210814889.1A
Authority: CN
Inventors: 王文; 王健宗; 黄章成
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2022-10-04

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a DNA motif prediction method, a device, equipment and a medium based on artificial intelligence. The method includes the steps of counting bases in a DNA sequence to be processed, inputting a counting result into a motif prediction model to obtain base probability distribution of preset sites, determining an initial motif, calculating a motif evaluation function according to the initial motif, updating model parameters according to calculated gradients, determining an updated motif, iterating the updated motif as the initial motif until the model converges, using a trained model to obtain a target motif, and performing online learning to adaptively adjust the motif prediction model parameters aiming at different DNA sequences, so that the accuracy of DNA motif prediction is improved.

Description

Artificial intelligence-based DNA (deoxyribonucleic acid) motif prediction method, device, equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a DNA motif prediction method, a device, equipment and a medium based on artificial intelligence.

Background

The method is characterized in that potential DNA motifs are searched from DNA sequences with the same biological function, which is beneficial to judging a potential mechanism for regulating and controlling the biological function and studying transcriptional regulation omics.

However, the base arrangement of the DNA sequence is complex, and heuristic search is easy to fall into the local optimal condition, so that the optimal result cannot be searched, the accuracy of the DNA motif prediction is low, and if the heuristic search is avoided to fall into the local optimal condition, repeated search needs to be carried out for many times, and the efficiency of the DNA motif prediction is greatly reduced. Therefore, how to improve the accuracy of the DNA motif prediction becomes an urgent problem to be solved under the condition of ensuring that the DNA motif prediction efficiency is high.

Disclosure of Invention

In view of this, embodiments of the present invention provide a DNA motif prediction method, apparatus, device, and medium based on artificial intelligence, so as to solve the problem that the accuracy of DNA motif prediction is low under the condition that it is ensured that the DNA motif prediction efficiency is high.

In a first aspect, an embodiment of the present invention provides a DNA motif prediction method based on artificial intelligence, where the DNA motif prediction method includes:

obtaining a DNA sequence to be processed, and counting the number of each type of basic group in the DNA sequence to be processed to obtain a statistical result;

obtaining the base probability distribution of each preset site in the motif by using a motif prediction model according to the statistical result, and determining an initial motif according to the base probability distribution;

calculating the initial motif by using a motif evaluation function, calculating the gradient of the motif evaluation function according to the calculation result and combining a strategy gradient algorithm, and updating the parameters of the motif prediction model according to the gradient to obtain an updated motif prediction model;

according to the statistical result, obtaining updated base probability distribution of each preset position in the motif by using the updated motif prediction model, determining an updated motif according to the updated base probability distribution, taking the updated motif as an initial motif, and executing the step of calculating the initial motif by using a motif evaluation function again until the motif evaluation function is converged to obtain a trained motif prediction model;

and according to the statistical result, obtaining the target base probability distribution of each preset site in the motif by using the trained motif prediction model, determining a target motif according to the target base probability distribution, and determining the target motif as a DNA motif prediction result.

In a second aspect, an embodiment of the present invention provides an artificial intelligence-based DNA motif prediction device, including:

the sequence counting module is used for acquiring a DNA sequence to be processed, and counting the number of each type of base in the DNA sequence to be processed to obtain a counting result;

the motif prediction module is used for obtaining the base probability distribution of each preset site in the motif by using the motif prediction model according to the statistical result and determining an initial motif according to the base probability distribution;

the model updating module is used for calculating the initial model body by using a model body evaluation function, calculating the gradient of the model body evaluation function according to the calculation result and combining a strategy gradient algorithm, and updating the parameters of the model body prediction model according to the gradient to obtain an updated model body prediction model;

the iteration training module is used for predicting the updated base probability distribution of each preset position in the motif by using the updated motif prediction model according to the statistical result, determining an updated motif according to the updated base probability distribution, taking the updated motif as an initial motif, and performing the step of calculating the initial motif by using the motif evaluation function again until the motif evaluation function is converged to obtain a trained motif prediction model;

and the motif determining module is used for obtaining the target base probability distribution of each preset position in the motif by using the trained motif prediction model according to the statistical result, determining a target motif according to the target base probability distribution and determining the target motif as a DNA motif prediction result.

In a third aspect, an embodiment of the present invention provides a computer device, where the computer device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, and the processor, when executing the computer program, implements the DNA motif prediction method according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for predicting DNA motifs according to the first aspect is implemented.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

counting the number of each type of base in the obtained DNA sequence to be processed, obtaining the base probability distribution of each preset position in a die body by using a die body prediction model according to the statistical result, determining an initial die body according to the base probability distribution, calculating the initial die body to obtain a gradient by using a die body evaluation function in combination with a strategy gradient algorithm, updating the parameters of the die body prediction model according to the gradient to obtain an updated die body prediction model, obtaining an updated die body by using the updated die body prediction model, using the updated die body prediction model as the initial die body, performing the step of calculating the initial die body by using the die body evaluation function again until the die body evaluation function converges to obtain a trained die body prediction model, determining a target die body by using the trained die body prediction model, performing online learning by using the die body prediction model, adaptively adjusting the parameters of the die body prediction model according to different DNA sequences to be processed, thereby improving the accuracy of the die body prediction of the DNA die body prediction model, evaluating the initial die body by using the evaluation function and providing the gradient for the parameter updating of the prediction model, and being capable of synchronizing the process of the die body prediction process with the prediction process of the die body prediction model to be processed, thereby ensuring the efficiency of the repeated DNA prediction.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic diagram of an application environment of a DNA motif prediction method based on artificial intelligence according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a DNA motif prediction method based on artificial intelligence according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a DNA motif prediction method based on artificial intelligence according to a second embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an artificial intelligence-based DNA motif prediction device according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present specification and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing a relative importance or importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

It should be understood that, the sequence numbers of the steps in the following embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by the function and the internal logic thereof, and should not limit the implementation process of the embodiments of the present invention in any way.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

The DNA motif prediction method based on artificial intelligence provided in the first embodiment of the present invention can be applied to the application environment shown in fig. 1, in which a client communicates with a server. The client includes, but is not limited to, a palm top computer, a desktop computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cloud terminal device, a Personal Digital Assistant (PDA), and other computer devices. The server can be implemented by an independent server or a server cluster composed of a plurality of servers.

Referring to fig. 2, which is a schematic flow diagram of a DNA motif prediction method based on artificial intelligence according to an embodiment of the present invention, the DNA motif prediction method may be applied to the client in fig. 1, where a computer device corresponding to the client is connected to a server, and obtains a DNA sequence to be processed from the server, and a motif prediction model is deployed at the client, where the motif prediction model may be used to predict a DNA motif corresponding to the DNA sequence to be processed. As shown in fig. 2, the DNA motif prediction method may include the following steps:

step S201, obtaining a DNA sequence to be processed, and counting the number of each type of base in the DNA sequence to be processed to obtain a statistical result.

The DNA sequence to be processed may refer to a DNA sequence having a target biological function, and usually the length of the DNA sequence to be processed is several tens to several hundreds of base pairs, and a batch subjected to DNA motif prediction usually contains several thousands of DNA fragments.

The base can refer to a constituent base of a DNA sequence, the base can include Adenine (Adenine, a), thymine (Thymine, T), cytosine (C), guanine (Guanine, G), and the statistical result can refer to a specific base arrangement in the DNA sequence to be processed.

Specifically, the statistical result may be in the form of a matrix, the size of the statistical result correspondence matrix is N × L × 4, N may refer to the number of DNA sequences to be processed, L may refer to the uniform length of the DNA sequences to be processed, and 4 may refer to the number of DNA bases.

To unify the matrix format, the length of the longest DNA sequence to be processed among the DNA sequences to be processed is taken as L, i.e., L = max (L) ₁ ,l ₂ ,…,l _n ,…,l _N ) Where max may refer to the maximum value, l may refer to the actual length of the DNA sequence to be treated, l _n Can refer to the actual length of the nth DNA sequence to be processed, and the value range of n is [1, N ]]The integral number between the two is to be noted that, for a DNA sequence to be processed whose actual length is smaller than a uniform length, when the length is expanded to a statistical length, the expansion is performed from the end, and the element values corresponding to the expansion sites are all 0.

For example, the base sequence in the statistical result correspondence matrix is arranged according to the ATCG sequence, and for an element with coordinates of [ n, l,2] in the statistical result correspondence matrix, an element value of the element indicates whether the l-th site of the nth DNA sequence to be processed is a T base, when the element value of the element is 0, the l-th site of the nth DNA sequence to be processed is not a T base, and when the element value of the element is 1, the l-th site of the nth DNA sequence to be processed is a T base.

The step of obtaining the DNA sequence to be processed and counting the number of each type of base in the DNA sequence to be processed to obtain the statistical result takes the specific arrangement of the DNA sequence to be processed as the statistical result, so that complete DNA sequence information to be processed can be provided for a subsequent motif prediction model, and the accuracy of motif prediction can be improved.

Step S202, according to the statistical result, a motif prediction model is used to obtain the base probability distribution of each preset position in the motif, and the initial motif is determined according to the base probability distribution.

The motif prediction model may refer to a neural network model, a logistic regression model, a multilayer perceptron model, etc., and the motif may refer to a shorter fragment on DNA, generally 8 to 20 base pairs in length, and is enriched on a DNA fragment having the same biological function (such as a promoter region of a gene and around a transcription factor binding site).

In this embodiment, the length of the motif is determined to be a preset length, and accordingly, the motif includes a fixed number of preset sites, for example, the length of the motif can be preset to 10 base pairs, and then the motif includes 10 preset sites.

The base probability distribution may refer to the probability of four bases constituting DNA in each preset site, and the initial motif may refer to a DNA motif to be evaluated.

Specifically, the base probability distribution may be in the form of a matrix, where the size of the base probability distribution matrix is M × 4, M may refer to a predetermined length of the motif, and 4 may refer to the number of DNA bases.

It should be noted that, since each row in the matrix represents the probability distribution of each type of base in the row, the sum of the element values of each row should be 1, for example, the base probability distribution corresponds to the m-th behavior [0.1,0.2,0.4,0.3] of the matrix, which means that the m-th predetermined site has a probability of 0.1 being an a base, a probability of 0.2 being a T base, a probability of 0.4 being a C base, and a probability of 0.3 being a G base.

Optionally, the phantom prediction model includes a convolutional layer and a fully-connected layer;

according to the statistical result, obtaining the base probability distribution of each preset position in the motif by using a motif prediction model, and determining the initial motif according to the base probability distribution comprises the following steps:

inputting the statistical result into the convolutional layer for feature extraction to obtain statistical features;

inputting the statistical characteristics into a full-connection layer for characteristic mapping to obtain base probability distribution of each preset site;

and aiming at the base probability distribution of any preset site, determining the base corresponding to the maximum probability in the base probability distribution as the initial base of the preset site, and splicing all the initial bases into an initial motif according to the site sequence.

The statistical characteristics may refer to a characteristic tensor obtained by convolution and aggregation of statistical results, and the initial base may refer to a base type with a most probable probability of a preset site.

Specifically, the single-layer convolution layer may include a convolution calculation layer, a batch normalization layer, and an active layer, and the convolution layer may be used to extract the input features, in this embodiment, the number of layers of the convolution layer is set to 2, and an implementer may adjust the number of layers of the convolution layer according to actual conditions, it should be noted that the number of layers of the convolution layer is not too deep to avoid the degradation phenomenon, and a setting range of the number of layers is suggested [1,4]. The convolution calculation layer adopts a convolution kernel to perform sliding extraction on features, the parameter initialization of the convolution kernel adopts a random generation mode, and the convolution calculation layer plays a role in feature aggregation. The batch normalization layer is used for carrying out normalization operation on the features extracted by the convolution calculation layer, so that the network training and convergence speed is increased. The active layer is used to improve the nonlinear characterization capability of the model, and in this embodiment, the active layer may use a linear rectification function (ReLU function).

The full-link layer may be configured to map the features to an output space, in this embodiment, the output space is also a base probability distribution space of the preset locus, and a normalization index function (Softmax function) is connected behind the full-link layer, where the normalization index function is configured to map an output value of the full-link layer to a probability value.

After the base probability distribution of the preset sites is obtained, the base type corresponding to the maximum probability of each preset site is determined by using a maximum value function (Max function), for example, if the base probability distribution of the preset site 1 is [0.1,0.2,0.4,0.3], the maximum probability of the preset site 1 is 0.4, and the base type corresponding to the maximum probability of the preset site 1 is C base.

Splicing can be a linkage (concatemate, concat) mode, that is, a part to be spliced is spliced to the end of a spliced part, when initial splicing is performed, a base type corresponding to a first preset site is taken as the spliced part according to a preset site sequence, a base type corresponding to a second preset site is taken as the part to be spliced, a splicing result is updated to the spliced part, and accordingly, the part to be spliced is determined according to the preset site sequence until all the base types corresponding to the preset sites are spliced completely. For example, if the base type corresponding to the first predetermined site is C base, the base type corresponding to the second predetermined site is A base, and the base type corresponding to the third predetermined site is G base, the initial motif obtained after splicing is [ A, C, G ].

In one embodiment, the activation layer may also use an S-curve function (Sigmoid function), a Tanh function (Tanh function), or the like.

In the embodiment, the DNA motif prediction is carried out in the form of the convolutional neural network, so that the algorithm complexity only linearly increases along with the size of the input data, the condition of exponential increase caused by repeated search is avoided, and the efficiency of the DNA motif prediction is improved.

According to the statistical result, the base probability distribution of each preset position in the die body is obtained by using the die body prediction model, the initial die body step is determined according to the base probability distribution, label marking is not needed in advance, the initial die body is obtained in a prediction mode, and adjustment is carried out in the subsequent step according to the feedback of the predicted initial die body, so that the problem of difficult labeling of the DNA die body is avoided, and the efficiency of predicting the DNA die body is effectively improved.

Step S203, calculating an initial motif by using a motif evaluation function, calculating the gradient of the motif evaluation function according to the calculation result and in combination with a strategy gradient algorithm, and updating the parameters of the motif prediction model according to the gradient to obtain an updated motif prediction model.

The motif evaluation function can be used for evaluating the interpretation ability of the motif on the DNA sequence to be processed, and the interpretation ability can be characterized by the occurrence frequency of the motif on the DNA sequence to be processed.

The strategy gradient algorithm may be a calculation method for calculating a gradient of a model body prediction model parameter, and the gradient may be a gradient for guiding updating of the model body prediction model parameter, that is, a gradient descent method and a back propagation algorithm are adopted to update the model body prediction model parameter. The parameters may be weight parameters and bias parameters of neurons in the phantom prediction model. The updated motif prediction model may refer to a parameter-updated motif prediction model.

Specifically, the motif evaluation function provides an evaluation of the motif, and the higher the evaluation value is, the greater the interpretation ability of the motif with respect to the DNA sequence to be processed is, the more likely the predicted motif is to be a true motif, and the lower the evaluation value is, the lower the interpretation ability of the motif with respect to the DNA sequence to be processed is, the less likely the predicted motif is to be a true motif.

The formula of the strategy gradient algorithm is as follows:

wherein the content of the first and second substances,

theta is the gradient, theta is the model parameter of the motif prediction model, pi (theta) is the strategy of the motif prediction model, G is the function value of the motif evaluation function, E _π(θ) [G]And (4) the expectation of a function value, wherein A is an initial motif and S is a statistical result.

π (θ) may refer to the output strategy that the model takes, given the model parameters, which may include the softmax strategy, the Gaussian strategy, etc.

G can be the function value of the motif evaluation function calculated by adopting the currently output motif, because G depends on the decision of the model when the statistical result is fixed, when the expected gradient of the function value is adopted, the right side of the equation acts on the model parameter, so that the gradient of the model parameter can be obtained, the gradient of the model parameter is calculated by adopting a strategy gradient algorithm, the model training can be guided under the condition of no label, and the feasibility of the motif prediction model training is improved.

Optionally, the calculating the initial phantom using the phantom evaluation function includes:

comparing the initial die body with each DNA sequence to be processed, and determining an optimal matching segment according to a comparison result;

counting the number of bases of each site in the optimal matching fragment to obtain a base counting matrix of the corresponding site;

and calculating a motif evaluation function of the corresponding site according to the base counting matrix to obtain a site evaluation value, and determining the sum of all the site evaluation values as a calculation result.

The comparison may refer to similarity calculation, and the similarity may be euclidean distance, cosine similarity, or the like. The best match fragment may refer to the DNA sub-fragment in the DNA sequence to be processed that is most similar to the original motif.

The base count matrix may refer to statistics of the number of bases of the best matching segments. The locus valuation value may refer to a discrete analog value of a K-order Kullback-Leibler (KL) divergence, K may refer to an integer greater than or equal to zero, and in this embodiment, K is 0, that is, a locus is used as a calculation object. The calculation result may refer to the interpretability of the optimally matched fragment to the DNA sequence to be processed.

Specifically, the size of the base counting matrix is also M × 4, M may refer to the preset length of the motif, 4 may refer to the number of DNA bases, and since the optimally matched segment may not be completely consistent with the initial motif, the base counting needs to be performed on the optimally matched segment again to obtain the base counting matrix corresponding to the optimally matched segment. And aiming at the M-bit initial motif, the optimal matching segment is also M, each bit needs to be calculated by a motif evaluation function, and the sum of the evaluation values of the positions of the M-bit positions is taken as a calculation result.

In the embodiment, the DNA sub-segment most similar to the initial motif is used as the optimal matching segment, the base number statistics is carried out to obtain the base counting matrix, and the initial motif can be effectively evaluated when the initial motif is predicted to have small deviation, so that the iterative updating times of the initial motif are greatly reduced, and the efficiency of predicting the DNA motif is improved.

The phantom evaluation function was:

wherein alpha is a locus evaluation value, N is the number of the DNA sequence to be processed, k is the type of basic group contained in the DNA sequence to be processed, and x _i The number of occurrences of the i-th base at the site, q _i The ratio of the i-th base in all DNA sequences to be processed.

Specifically, to avoid the meaningless function calculation, the logarithm function operation in the phantom evaluation function is added one after the input quantity, such as logN! The actual operation is log (N! + 1).

The larger the locus evaluation value is, the larger the difference between the base distribution on the optimal matching fragment and the base distribution of the DNA sequence to be processed is, the more the optimal matching fragment does not conform to the characteristics of the motif, and conversely, the smaller the locus evaluation value is, the more similar the base distribution on the optimal matching fragment and the base distribution of the DNA sequence to be processed is, the more the optimal matching fragment conforms to the characteristics of the motif.

In the embodiment, the initial motif is evaluated by adopting the motif evaluation function, the evaluation value is determined according to the interpretability of the motif on the DNA sequence, a training gradient is provided for the motif prediction model, and the accuracy of the DNA motif prediction is improved.

Optionally, after performing base number statistics on each site in the optimally matched fragment to obtain a base count matrix of the corresponding site, the method further includes:

aiming at any site, forming a K-order sub-fragment by using 2K adjacent sites adjacent to the site and the site;

carrying out base number statistics on each K-order sub-segment in the optimal matching segment to obtain a second base counting matrix corresponding to the K-order sub-segment;

correspondingly, calculating a motif evaluation function of the corresponding site according to the base counting matrix to obtain a site evaluation value, and determining the sum of all the site evaluation values as the motif evaluation value comprises the following steps:

calculating a locus evaluation value of the corresponding locus according to the base counting matrix;

and calculating the sub-segment evaluation value corresponding to the K-order sub-segment according to the second base counting matrix, and determining the sum of all the locus evaluation values and all the sub-segment evaluation values as the motif evaluation value.

In this embodiment, K is an integer greater than or equal to zero, for example, K takes an integer greater than zero, for example, K takes a value of 1, the adjacent site may refer to a site closer to the site, the K-order sub-segment may refer to a sub-segment in the optimally matched segment, and the second base count matrix may refer to a matrix obtained by performing base number statistics on the K-order sub-segment. The sub-segment evaluation value may refer to a calculation result of performing a motif evaluation function calculation on the K-order sub-segment.

Specifically, in the calculation of the higher-order motif evaluation function, the base type is considered not only in the base itself but also in the case of the adjacent base. For the first-order motif evaluation function, the site of the target base is set as C, C represents the serial number of the site, the corresponding first-order sub-segment considers the condition of the target base and the adjacent base, namely the first-order sub-segment is (C-1, C + 1), and at the moment, the first-order sub-segment is taken as a whole, and the base type number M' of the first-order sub-segment is 4^3, namely 64.

In this embodiment, the DNA sequence to be processed, the initial motif, and the new base type in the form of the base class number M' of the first-order sub-segment are all converted into a tensor form, so that for the motif evaluation function of K order, the value of K does not need to be considered, data matching is performed by using two-dimensional convolution processing, and the number of occurrences of the matching result is counted by a summation function (Sum function), so that the calculation time does not significantly increase with the number of surrounding bases considered, for example, the two-dimensional tensor of the initial motif is regarded as a convolution kernel to perform sliding convolution on the tensor corresponding to the DNA sequence to be processed, it is known that if a segment completely matched with the initial motif exists, the convolution result is a fixed value, and the closer to the fixed value the convolution result is in the sliding convolution, the more similar the slide-selected segment is to the initial motif.

In the embodiment, the motif evaluation is carried out by adopting a high-order motif evaluation function, so that the situation that the motif evaluation is influenced by the preference of the probability of the occurrence of the base combination of the DNA sequence is avoided, the motif can be better distinguished from the background, the training process is accelerated, and the accuracy of the DNA motif prediction is improved.

The step of calculating the initial die body by using the die body evaluation function, calculating the gradient of the die body evaluation function by combining a strategy gradient algorithm according to the calculation result, updating the parameter of the die body prediction model according to the gradient to obtain the updated die body prediction model, wherein under the unsupervised condition, the gradient is calculated by using the die body evaluation function and the strategy gradient algorithm to guide the updating of the parameter of the die body prediction model, so that the smooth updating of the parameter of the die body prediction model can be ensured, the condition that the die body prediction model is difficult to converge in the training process is avoided, and the accuracy of the updated die body prediction model is improved.

And S204, obtaining the updated base probability distribution of each preset position in the motif by using the updated motif prediction model according to the statistical result, determining the updated motif according to the updated base probability distribution, taking the updated motif as the initial motif, and performing the step of calculating the initial motif by using the motif evaluation function again until the motif evaluation function is converged to obtain the trained motif prediction model.

Wherein, updating the base probability distribution may refer to outputting the statistical result after inputting the statistical result into the updated motif prediction model, and updating the motif may refer to the motif determined according to the updated base probability distribution.

The trained motif prediction model may be a motif prediction model that stably outputs the same updated motif, and at this time, the parameters of the motif prediction model are stable, that is, the training process is completed.

Specifically, the updated base probability distribution may refer to the updated probability of four bases constituting DNA in each predetermined site, and the updated base probability distribution may still be represented in a matrix form, that is, the size of the matrix corresponding to the updated base probability distribution is also M × 4, and the sum of the element values of each row of elements should also be 1.

And according to the statistical result, obtaining the updated base probability distribution of each preset position in the motif by using the updated motif prediction model, determining the updated motif according to the updated base probability distribution, taking the updated motif as the initial motif, performing the step of calculating the initial motif by using the motif evaluation function again until the motif evaluation function is converged to obtain the trained motif prediction model, updating the motif prediction model by adopting an iteration mode, and obtaining the corresponding updated base probability distribution, so that the optimal motif prediction model is obtained step by step, the situation of trapping in local optimization is avoided, and a more accurate DNA motif can be obtained when a small amount of data is input.

And S205, obtaining the target base probability distribution of each preset position in the motif by using the trained motif prediction model according to the statistical result, determining the target motif according to the target base probability distribution, and determining the target motif as a DNA motif prediction result.

Wherein, the target base probability distribution can refer to the optimal base probability distribution, and the target motif can refer to the optimal DNA motif.

Specifically, the target base probability distribution may refer to target probabilities of four bases constituting DNA in each predetermined site, and the target base probability distribution may still be represented in a matrix form, that is, the size of the matrix corresponding to the target base probability distribution is also M × 4, and the sum of the element values of each row of elements should also be 1.

And obtaining the target base probability distribution of each preset site in the motif by using the trained motif prediction model according to the statistical result, and determining the target motif according to the target base probability distribution, so that the DNA motif prediction process and the training process of the motif prediction model are synchronously carried out, repeated processing is avoided, and the efficiency of the DNA motif prediction is ensured.

In the embodiment, the die body prediction model is adopted for online learning, and the parameters of the die body prediction model can be adaptively adjusted according to different DNA sequences to be processed, so that the accuracy of the DNA die body prediction of the die body prediction model is improved, the initial die body is evaluated by adopting the die body evaluation function, and the gradient is provided for updating the parameters of the die body prediction model, so that the DNA die body prediction process and the training process of the die body prediction model can be synchronously carried out, repeated processing is avoided, and the efficiency of the DNA die body prediction is ensured.

Referring to fig. 3, a schematic flow chart of a DNA motif prediction method based on artificial intelligence according to a second embodiment of the present invention is shown, in which the DNA motif may be of a fixed preset length or of an adjustable preset length.

When the DNA motif has a fixed preset length, the preset length is set by the implementer, and cannot be adjusted after the setting is completed, and the process of setting the preset length is described in the first embodiment, which is not described herein.

When the DNA die body adopts the adjustable preset length, the preset length is set by an implementer, but can be adjusted according to the actual situation, and the length adjusting process of the DNA die body comprises the following steps:

step S301, comparing the convergence value of the motif evaluation function with a preset threshold value;

step S302, if the convergence value is smaller than a preset threshold value, adjusting the number of preset positions;

step S303, the step of obtaining the base probability distribution of each preset position in the motif by using the motif prediction model according to the statistical result and obtaining the initial motif according to the base probability distribution is executed again.

The convergence value may be a function value obtained after the motif evaluation function converges, and the preset threshold may be used to represent whether the DNA motif has a better interpretation capability, for example, in this embodiment, the threshold is set to be 2.

The initial number of the predetermined sites can be set by the implementer, and the setting range is suggested as [8, 20], accordingly, the adjustment mode can adopt a mode of increasing one bit or decreasing one bit, and it should be noted that the number of the sites after the adjustment should also satisfy the setting range.

Optionally, adjusting the number of the predetermined loci includes:

acquiring the number of current sites and a preset site value range, and determining a bit reduction sampling probability and a bit increase sampling probability according to the number of the current sites and the site value range;

sampling is carried out according to the bit reduction sampling probability and the bit increase sampling probability, and the number of the sites is adjusted according to the sampling result.

The current location number may refer to the location number before adjustment, the location value range may refer to a preset range interval, for example, [8, 20], the bit-reduction sampling probability may refer to the probability from sampling to reducing a location, and the bit-increase sampling probability may refer to the probability from sampling to increasing a location.

Specifically, the number of the acquired current sites is set to be p, and the preset site value range is set to be [ min, max [ ]]Then the bit-subtracted sampling probability can be expressed as

The probability of an up-sampling can be expressed as

It should be noted that, when the number of loci is adjusted to the number of loci already involved in training, the number of loci needs to be adjusted again until the adjusted number of loci does not participate in training.

The method updates the number of the sites in a probability sampling mode, avoids the condition of invalid updating of the number of the sites, and improves the updating efficiency of the number of the sites.

In the embodiment, the number of the sites of the DNA motif is dynamically adjusted, so that the condition that the optimal DNA motif is difficult to obtain due to the fixed number of the sites is avoided, and the accuracy of the prediction of the DNA motif can be effectively improved.

Corresponding to the DNA motif prediction method based on artificial intelligence in the foregoing embodiment, fig. 4 shows a structural block diagram of a DNA motif prediction apparatus based on artificial intelligence provided in the third embodiment of the present invention, where the DNA motif prediction apparatus is applied to a client, a computer device corresponding to the client is connected to a server, and obtains a DNA sequence to be processed from the server, and a motif prediction model is deployed at the client, where the motif prediction model may be used to predict a DNA motif corresponding to the DNA sequence to be processed, and for convenience of description, only a part related to the embodiment of the present invention is shown.

Referring to fig. 4, the DNA motif prediction apparatus includes:

the sequence counting module 41 is configured to obtain a DNA sequence to be processed, and count the number of each type of base in the DNA sequence to be processed to obtain a statistical result;

the motif prediction module 42 is used for obtaining the base probability distribution of each preset site in the motif by using the motif prediction model according to the statistical result and determining an initial motif according to the base probability distribution;

the model updating module 43 is configured to calculate an initial model body by using a model body evaluation function, calculate a gradient of the model body evaluation function according to a calculation result and by combining a strategy gradient algorithm, and update parameters of the model body prediction model according to the gradient to obtain an updated model body prediction model;

the iterative training module 44 is configured to predict, according to the statistical result, an updated base probability distribution of each preset position in the phantom using the updated phantom prediction model, determine an updated phantom according to the updated base probability distribution, use the updated phantom as an initial phantom, and perform the step of calculating the initial phantom using the phantom evaluation function again until the phantom evaluation function converges to obtain a trained phantom prediction model;

and the motif determining module 45 is used for obtaining the target base probability distribution of each preset position in the motif by using the trained motif prediction model according to the statistical result, determining the target motif according to the target base probability distribution, and determining the target motif as the DNA motif prediction result.

the phantom prediction module 42 includes:

the characteristic extraction unit is used for inputting the statistical result into the convolutional layer to carry out characteristic extraction to obtain statistical characteristics;

the characteristic mapping unit is used for inputting the statistical characteristics into the full-connection layer to perform characteristic mapping to obtain the base probability distribution of each preset locus;

and the base splicing unit is used for determining a base corresponding to the maximum probability in the base probability distribution as an initial base of the preset site according to the base probability distribution of any preset site and splicing all the initial bases into an initial motif according to the site sequence.

Optionally, the DNA motif prediction apparatus further includes:

the threshold comparison module is used for comparing the convergence value of the motif evaluation function with a preset threshold;

the bit number adjusting module is used for adjusting the number of the preset bits if the convergence value is smaller than a preset threshold value;

and the re-prediction module is used for re-executing the steps of obtaining the base probability distribution of each preset position in the motif by using the motif prediction model according to the statistical result and obtaining the initial motif according to the base probability distribution.

Optionally, the site number adjusting module includes:

the probability determination unit is used for acquiring the number of the current sites and a preset site value range, and determining the bit reduction sampling probability and the bit increase sampling probability according to the number of the current sites and the site value range;

and the probability sampling unit is used for sampling according to the bit reduction sampling probability and the bit increase sampling probability and adjusting the number of the sites according to the sampling result.

Optionally, the DNA motif prediction apparatus further includes:

the sequence comparison module is used for comparing the initial motif with each DNA sequence to be processed and determining an optimal matching segment according to a comparison result;

the site counting module is used for counting the number of bases of each site in the optimal matching fragment to obtain a base counting matrix of the corresponding site;

and the site evaluation module is used for calculating a motif evaluation function of the corresponding site according to the base counting matrix to obtain a site evaluation value and determining the sum of all the site evaluation values as a calculation result.

Optionally, the phantom evaluation function is:

wherein alpha is a locus evaluation value, N is the number of the DNA sequence to be processed, M is the type of basic group contained in the DNA sequence to be processed, and x _i The number of occurrences of the i-th base at the site, q _i The ratio of the i-th base in all DNA sequences to be processed.

Optionally, the DNA motif prediction apparatus further includes:

a sub-fragment composing module for composing a K-order sub-fragment with 2K adjacent sites adjacent to a site and the site aiming at any site, wherein K is an integer larger than zero;

the sub-fragment counting module is used for counting the base number of each K-order sub-fragment in the optimal matching fragment to obtain a second base counting matrix corresponding to the K-order sub-fragment;

accordingly, the locus evaluation module comprises:

the site evaluation unit is used for calculating a site evaluation value of the corresponding site according to the base counting matrix;

and the motif evaluation unit is used for calculating the sub-segment evaluation value corresponding to the K-order sub-segment according to the second base counting matrix, and determining the sum of all the locus evaluation values and all the sub-segment evaluation values as the motif evaluation value.

It should be noted that, because the above-mentioned modules and units are based on the same concept, and their specific functions and technical effects are brought about by the method embodiment of the present invention, reference may be made to the method embodiment part specifically, and details are not described here again.

Fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. As shown in fig. 5, the computer apparatus of this embodiment includes: at least one processor (only one shown in FIG. 5), a memory, and a computer program stored in the memory and executable on the at least one processor, the computer program when executed by the processor implementing the steps in any of the various DNA motif prediction method embodiments described above.

The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 5 is merely an example of a computer device and is not intended to be limiting, and that a computer device may include more or fewer components than those shown, or some components may be combined, or different components may be included, such as a network interface, a display screen, and input devices, etc.

The Processor may be a CPU, or other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory includes readable storage medium, internal memory, etc., where the internal memory may be a memory of the computer device, and the internal memory provides an environment for the operating system and the execution of computer-readable instructions in the readable storage medium. The readable storage medium may be a hard disk of the computer device, and in other embodiments may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the computer device. Further, the memory may also include both internal storage units and external storage devices of the computer device. The memory is used for storing an operating system, application programs, a BootLoader (BootLoader), data, and other programs, such as program codes of a computer program, and the like. The memory may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention. For the specific working processes of the units and modules in the above-mentioned apparatus, reference may be made to the corresponding processes in the foregoing method embodiments, which are not described herein again. The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method of the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and used by a processor to implement the steps of the above method embodiments. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code, recording medium, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier signals, telecommunications signals, and software distribution media. Such as a usb-drive, a removable hard drive, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

The present invention can also be implemented by a computer program product, which when executed on a computer device causes the computer device to implement all or part of the processes in the method of the above embodiments.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus/computer device and method may be implemented in other ways. For example, the above-described apparatus/computer device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and may be implemented in other ways, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A DNA motif prediction method based on artificial intelligence, characterized in that the method comprises:

according to the statistical result, obtaining updated base probability distribution of each preset position in the motif by using the updated motif prediction model, determining an updated motif according to the updated base probability distribution, taking the updated motif as an initial motif, and performing the step of calculating the initial motif by using a motif evaluation function again until the motif evaluation function is converged to obtain a trained motif prediction model;

2. The DNA motif prediction method of claim 1, wherein the motif prediction model includes a convolutional layer and a fully-connected layer;

the step of obtaining the base probability distribution of each preset position in the motif by using a motif prediction model according to the statistical result, and the step of determining the initial motif according to the base probability distribution comprises the following steps:

inputting the statistical characteristics into the full-connection layer to perform characteristic mapping to obtain base probability distribution of each preset locus;

and aiming at the base probability distribution of any preset site, determining a base corresponding to the maximum probability in the base probability distribution as an initial base of the preset site, and splicing all the initial bases into the initial motif according to the site sequence.

3. The DNA motif prediction method of claim 1, further comprising, after convergence of the motif evaluation function:

comparing the convergence value of the die body evaluation function with a preset threshold value;

and if the convergence value is smaller than the preset threshold value, adjusting the number of the preset sites, executing the step of obtaining the base probability distribution of each preset site in the motif by using the motif prediction model according to the statistical result again, and obtaining the initial motif according to the base probability distribution.

4. The method of claim 3, wherein adjusting the number of predetermined sites comprises:

sampling is carried out according to the bit reduction sampling probability and the bit increase sampling probability, and the number of the sites is adjusted according to a sampling result.

5. The DNA motif prediction method of any one of claims 1 to 4, wherein the calculating the initial motif using a motif evaluation function comprises:

comparing the initial motif with each DNA sequence to be processed, and determining an optimal matching segment according to a comparison result;

and calculating a motif evaluation function of the corresponding site according to the base counting matrix to obtain a site evaluation value, and determining the sum of all the site evaluation values as the calculation result.

6. The DNA motif prediction method of claim 5, wherein the motif evaluation function is:

wherein alpha is the evaluation value of the locus, N is the number of the DNA sequence to be processed, M is the type of the basic group contained in the DNA sequence to be processed, and x _i The number of occurrences of the i-th base at said site, q _i The ratio of the i-th base in all DNA sequences to be processed.

7. The method of predicting DNA motifs according to claim 5, wherein after said counting the number of bases at each site in said optimally matched segment to obtain a base count matrix of the corresponding site, further comprising:

aiming at any site, forming a K-order sub-fragment by using 2K adjacent sites adjacent to the site and the site, wherein K is an integer greater than zero;

performing base number statistics on each K-order sub-segment in the optimal matching segment to obtain a second base counting matrix corresponding to the K-order sub-segment;

correspondingly, the calculating a motif evaluation function of the corresponding site according to the base counting matrix to obtain a site evaluation value, and determining the sum of all the site evaluation values as the motif evaluation value includes:

calculating a locus evaluation value of a corresponding locus according to the base counting matrix;

and calculating the sub-fragment evaluation value corresponding to the K-order sub-fragment according to the second base counting matrix, and determining the sum of all the site evaluation values and all the sub-fragment evaluation values as the motif evaluation value.

8. The utility model provides a based on artificial intelligence DNA die body prediction unit which characterized in that, DNA die body prediction unit includes:

the motif prediction module is used for obtaining the base probability distribution of each preset site in the motif by using a motif prediction model according to the statistical result and determining an initial motif according to the base probability distribution;

and the motif determining module is used for obtaining the target base probability distribution of each preset site in the motif by using the trained motif prediction model according to the statistical result, determining the target motif according to the target base probability distribution and determining the target motif as the DNA motif prediction result.

9. A computer device comprising a processor, a memory, and a computer program stored in the memory and executable on the processor, the computer program when executed by the processor implementing the DNA motif prediction method of any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program which, when executed by a processor, implements the DNA motif prediction method of any one of claims 1 to 7.