CN113409891A

CN113409891A - Method, device, equipment and storage medium for predicting DNA6mA modification class

Info

Publication number: CN113409891A
Application number: CN202110606033.0A
Authority: CN
Inventors: 邹权; 张昊宇
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-09-17
Anticipated expiration: 2041-05-25
Also published as: CN113409891B

Abstract

The application provides a method, a device, equipment and a storage medium for predicting a DNA6mA modification class. The method comprises the following steps: acquiring a DNA6mA characteristic data set; determining a similarity matrix between each sequence in the DNA6mA feature data set; carrying out logarithm processing on the similarity matrix to obtain a first matrix among the sequences; carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement; and taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the DNA6mA modification category of the sequence to be predicted based on a support vector machine model. The class of DNA6mA modifications that can predict sequence.

Description

Method, device, equipment and storage medium for predicting DNA6mA modification class

Technical Field

The application relates to the technical field of bioinformatics, in particular to a method, a device, equipment and a storage medium for predicting a DNA6mA modification class.

Background

One of the earliest epigenetic regulatory mechanisms found in humans was DNA methylation. The most prominent DNA modification in mammals is 5mC (5-methylcytosine), which accounts for 3% -6% of the total cytosine in human DNA. In contrast, 5mC is rare in prokaryotes, while 6mA (N6-methyladenine) is the most representative DNA modification in prokaryotes, mainly involved in restriction-modification systems, protecting individuals from foreign DNA invasion. The 6mA modification was first found in bacteria in 1951. However, it is not as important as 5 mC. One important reason is that the 6mA modification is thought to be widespread only in prokaryotes and unicellular eukaryotes, but rarely found in multicellular eukaryotes. However, in recent years, 6mA has been identified in eukaryotic organisms, even including mammalian and plant genomes, and found to play an important role in growth and development and disease regulation by experimental methods. These studies have opened up new sections for epigenetic modifications of eukaryotes. However, with the continuous increase of data volume and higher requirement of accuracy, the disadvantages of high time consumption and high cost of experimental methods are revealed, and some computational methods are emerged. Machine learning-based prediction tools are continuously developed, including iDNA6mA-PseKNC, i6mA-Pred, etc., but few studies have been conducted on the distance between sequences as the main basis for classification prediction. Therefore, it is necessary to investigate how to classify DNA6mA by sequence distance.

Disclosure of Invention

The application provides a method, a device and a storage medium for predicting a DNA6mA modification class, which can predict a DNA6mA modification class of a sequence.

The first aspect of the embodiments of the present application provides a method for predicting a modification category of DNA6mA, including:

acquiring a DNA6mA characteristic data set;

determining a similarity matrix between each sequence in the DNA6mA feature data set;

carrying out logarithm processing on the similarity matrix to obtain a first matrix among the sequences;

carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;

and taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the modification category of the DNA6mA based on a support vector machine model.

Optionally, determining a similarity matrix between the sequences in the DNA6mA feature dataset comprises:

and obtaining a similarity matrix between all sequences in the DNA6mA characteristic data set based on a double sequence alignment model of a suffix tree.

Optionally, obtaining a similarity matrix between sequences in the DNA6mA feature data set based on a suffix tree-based two-sequence alignment model, including:

constructing a first input sequence as a first suffix tree;

obtaining a second input sequence which is compared with the first input sequence;

determining common substrings of the first input sequence and the second input sequence by adopting an LCS model based on the first suffix tree and the second input sequence;

based on a preset qualified standard, rejecting unqualified substrings from the public substrings;

adopting a Needleman-Wunsch model to compare unmatched substrings in the first input sequence and the second input sequence, and forming a comparison result sequence based on a comparison result;

and determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.

Optionally, the DNA6mA feature dataset includes a positive case dataset that is DNA6mA sequences and a negative case dataset that is non-DNA 6mA sequences.

In a second aspect of the embodiments of the present application, there is provided an apparatus for predicting a modification type of drug DNA6mA, including:

a first acquisition unit for acquiring a DNA6mA feature data set;

a first determining unit, configured to determine a similarity matrix between sequences in the DNA6mA feature data set;

the logarithm processing unit is used for carrying out logarithm processing on the similarity matrix to obtain a first matrix among the sequences;

the Gaussian processing unit is used for carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;

and the prediction unit is used for taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of the support vector machine and predicting the modification category of the DNA6mA based on the support vector machine model.

Optionally, the first determining unit includes:

the first determining subunit is used for obtaining a similarity matrix between each sequence in the DNA6mA characteristic data set based on a double sequence alignment model of a suffix tree.

Optionally, the first determining unit includes:

a first construction subunit for constructing the first input sequence as a first suffix tree;

a first obtaining subunit, configured to obtain a second input sequence aligned with the first input sequence;

a second determining subunit, configured to determine, based on the first suffix tree and the second input sequence, a common substring of the first input sequence and the second input sequence by using an LCS model;

the first removing unit is used for removing unqualified substrings from the public substrings based on a preset qualified standard;

the first comparison unit is used for comparing the unmatched substrings in the first input sequence and the second input sequence by adopting a Needleman-Wunsch model and forming a comparison result sequence based on the comparison result;

and the third determining subunit is used for determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.

A third aspect of embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, performs the steps in the method according to the first aspect of the present application.

A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the present application when executed.

By adopting the method for predicting the drug-target interaction provided by the embodiment of the application, the prediction of the modification class of the DNA6mA is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of a method for predicting the modification category of DNA6mA provided in the examples herein;

FIG. 2 is a schematic diagram of data file types supported by a method for predicting a modification category of DNA6mA provided in an embodiment of the present application.

Fig. 3 is a schematic diagram illustrating comparison of the effect of the m.musculus data set prediction method in the prediction method of the DNA6mA modification class provided in the embodiment of the present application.

FIG. 4 is a schematic diagram illustrating comparison of the effect of the Rice data set prediction method in the prediction method of the DNA6mA modification category provided in the embodiment of the present application.

Fig. 5 is a schematic diagram illustrating comparison of effects of Cross data set prediction methods in the prediction methods of DNA6mA modification classes provided in the examples of the present application.

Fig. 6 is a schematic structural diagram of a prediction apparatus for DNA6mA modification provided in the embodiments of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

Referring to fig. 1, a flow chart of a method for predicting DNA6mA modification class according to the present application is shown.

As shown in fig. 1, the method comprises the steps of:

s101, acquiring a DNA6mA characteristic data set.

In some alternative embodiments, the DNA6mA signature dataset includes a positive case dataset that is a DNA6mA sequence and a negative case dataset that is a non-DNA 6mA sequence.

In some alternative embodiments, there are 3 total DNA6mA sequence data files, DNA6mA m. musculus (number of sequences of positive example DNA6mA is 1934 and number of sequences of negative example non-DNA 6mA is 1934), DNA6mA Rice (number of sequences of positive example DNA6mA is 880 and number of sequences of negative example non-DNA 6mA is 880), and DNA6mA Cross (number of sequences of positive example DNA6mA is 2768 and number of sequences of negative example non-DNA 6mA is 2716).

In some alternative embodiments, the downloaded DNA6mA sequence data file may need to be formatted and content determined before the raw DNA6mA feature data set to be processed is obtained. The specific method for judging the format comprises the following steps: when the line of the read DNA6mA sequence data file begins with the character string ">", the data added by one line is taken as the sequence text data. The specific method for content judgment comprises the following steps: whether the content of the read sequence text data is composed of four letters of 'A', 'T', 'C' or 'G', if any, the input text is prompted to include the letters of 'A', 'T', 'C' and 'G'. The raw data set that was acquired as satisfactory is shown in fig. 2.

S102, determining a similarity matrix among all sequences in the DNA6mA characteristic data set.

In some alternative embodiments, a suffix tree based two-sequence alignment model yields a similarity matrix between individual sequences in the DNA6mA feature dataset. The method specifically comprises the following steps:

a, constructing a first input sequence seq1 as a first suffix tree 1;

b, obtaining a second input sequence seq2 aligned with the first input sequence seq 1;

c, determining common substrings of the first input sequence seq1 and the second input sequence seq2 by using an LCS model based on the first suffix tree1 and the second input sequence seq 2;

d, based on a preset qualified standard, rejecting unqualified substrings from the public substrings; the preset qualification standard is that two common substrings matched with each other can not be far away, namely, the difference between the starting positions is less than or equal to the length of the substrings.

e, adopting a Needleman-Wunsch model to compare unmatched substrings in the first input sequence seq1 and the second input sequence seq2, and forming an alignment result sequence based on the alignment result;

f, determining the similarity calculation formula between the first input sequence seq1 and the second input sequence seq2 based on the length of the common substring and the length of the alignment result sequence as follows:

s103, carrying out logarithm processing on the similarity matrix to obtain a first matrix among the sequences.

In some optional embodiments, the similarity matrix is logarithmized to obtain a distance matrix between each sequence, and a calculation formula of the distance matrix is as follows:

D₁₂＝-log(S₁₂)

wherein S₁₂Representing the similarity between seq1 and seq2, D₁₂Representing the distance between seq1 and seq 2.

And S104, carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement.

In some optional embodiments, the distance matrix is gaussian processed to obtain a distance matrix satisfying the positive qualitative requirement, and the calculation formula is as follows:

wherein D_ijDenotes the distance between sequence i and sequence j, α is a Gaussian constant, G_ijThe value of the ith row and the jth column of the distance matrix for satisfying the positive qualitative requirement.

And S105, taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the modification category of the DNA6mA based on the support vector machine model.

In some optional embodiments, a support vector machine algorithm is adopted, and the distance matrix is used as a custom kernel matrix of the support vector machine to perform classification prediction on the DNA6mA modification, wherein the algorithm flow comprises the following steps:

s51 construction of Lagrangian function

S52, calculating the partial derivatives of w, b and making them equal to 0:

s53, substituting the primitive function to obtain the dual problem of the primitive problem:

α_i≥0，i＝1，2，…l；

s54, solving the dual problem to obtain alpha and w and further obtain an equation for classifying the hyperplane:

f(x_i)＝sgn(w^Tx+b)；

s55, carrying out classification prediction on DNA6mA data according to the equation, f (x)_i)>0 is positive case, f (x)_i)<0 is the opposite example.

Wherein w and b are parameter vectors of the classification hyperplane, alpha is a parameter vector of the constructed Lagrangian function, L is the constructed Lagrangian function, and f is a classification hyperplane equation.

In the embodiment of the present invention, the indexes for evaluating the classification effect include SE, SP, ACC, MCC, and F1, and the calculation formula thereof is as follows:

wherein TP represents the amount of DNA6mA predicted correctly, FP represents the amount of non-DNA 6mA predicted correctly, TN represents the amount of DNA6mA predicted incorrectly, and FN represents the amount of non-DNA 6mA predicted incorrectly.

The predicted effect of the present invention is further described below in a set of specific experimental examples.

Compared with the research results of the existing excellent prediction algorithm, the method uses consistent evaluation indexes (namely SE, SP, ACC and MCC) on the basis of ensuring the consistency of the used data sets during comparison.

We first compare the predicted results of our invention on m.musculus datasets with existing machine learning methods, as shown in fig. 3. As can be seen from fig. 3, the present invention achieves higher accuracy in the classification effect. On the M.musculus data set, the support vector machine classifier based on the distance obtains a classification accuracy ACC value of 0.982, which is higher than the classification accuracy ACC values of 0.966 and iLM-CNN of csDMA, and experiments show that the prediction accuracy of the M.musculus data is effectively improved. Meanwhile, the distance-based support vector machine classifier also obtained the highest MCC value of 0.982 and F1 value of 0.982, indicating that the prediction accuracy of the distance-based support vector machine classifier is high even when processing unbalanced data sets.

The prediction results of the present invention on Rice data sets were then compared to existing machine learning methods, as shown in fig. 4. As can be seen from fig. 4, the present invention achieves higher accuracy in the classification effect. On the Rice data set, a support vector machine classifier based on distance obtains a classification accuracy ACC value of 0.943, which is higher than the classification accuracy ACC values of 0.861 and 0.875 of iLM-CNN of csDMA, and experiments show that the prediction accuracy of the Rice data is effectively improved. Meanwhile, the distance-based SVM classifier also obtains the highest MCC value of 0.944 and the highest F1 value of 0.942, which shows that the distance-based SVM classifier has higher prediction accuracy even when processing unbalanced data sets, and provides a new idea for processing unbalanced Rice data.

Finally, the results of the present invention are compared to the results of prior art machine learning methods on Cross data sets, as shown in fig. 5. As can be seen from fig. 5, the present invention achieves higher accuracy in the classification effect. On a Cross data set, the support vector machine classifier based on the distance obtains an MCC value of 0.838 which is far higher than 0.651 of 0.603 and iLM-CNN of csDMA, and experiments show that the prediction precision of the invention for the unbalanced Cross data set is obviously improved, and the invention is of great benefit for the research of the data set. Meanwhile, the support vector machine classifier based on the distance also obtains the highest F1 value of 0.84, which shows that the invention has better balance for the data and provides a certain reference for the research of Cross data.

The invention has the beneficial effects that:

(1) the invention provides a brand-new DNA6mA prediction method, which utilizes the distance between DNA6mA sequences to carry out classification prediction on the sequences and provides a lead support for corresponding theoretical research.

(2) When the support vector machine algorithm is applied, the invention adopts the self-defined kernel matrix, thereby effectively improving the processing efficiency.

(3) According to the invention, the similarity matrix between DNA6mA sequences is converted into a positive distance matrix, so that a support vector machine classifier is constructed, and the prediction effect of DNA6mA is improved

Based on the same inventive concept, one embodiment of the present application provides a device for predicting the modification category of DNA6 mA. Referring to fig. 6, fig. 6 is a schematic diagram of a prediction apparatus for DNA6mA modification class according to an embodiment of the present application. As shown in fig. 6, the apparatus includes:

a first obtaining unit 601, configured to obtain a DNA6mA feature data set;

a first determining unit 602, configured to determine a similarity matrix between sequences in the DNA6mA feature data set;

a logarithm processing unit 603, configured to perform logarithm processing on the similarity matrix to obtain a first matrix among the sequences;

a gaussian processing unit 604, configured to perform gaussian processing on the distance matrix to obtain a distance matrix meeting the positive qualitative requirement;

and the prediction unit 605 is configured to use the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of the support vector machine, and predict the DNA6mA modification category based on the support vector machine model.

Optionally, the first determining unit includes:

Based on the same inventive concept, another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method according to any of the above-mentioned embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the electronic device implements the steps of the method according to any of the above embodiments of the present application.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The plant resistance protein identification method, device, equipment and storage medium provided by the application are described in detail above, and the principle and the implementation mode of the application are explained by applying specific examples, and the description of the above examples is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for predicting a modification category of DNA6mA, comprising:

acquiring a DNA6mA characteristic data set;

2. The method of predicting according to claim 1, wherein determining a similarity matrix between sequences in the DNA6mA signature dataset comprises:

3. The prediction method of claim 2, wherein the obtaining of the similarity matrix between the sequences in the DNA6mA feature data set based on a suffix tree-based two-sequence alignment model comprises:

constructing a first input sequence as a first suffix tree;

4. The prediction method of claim 1, wherein the DNA6mA signature dataset comprises a positive case dataset and a negative case dataset, the positive case dataset being a DNA6mA sequence and the negative case dataset being a non-DNA 6mA sequence.

5. A device for predicting a modification type of DNA6mA, comprising:

a first acquisition unit for acquiring a DNA6mA feature data set;

6. The prediction apparatus according to claim 5, wherein the first determination unit includes:

7. The prediction method of claim 6, wherein the first determination unit comprises:

8. The prediction method of claim 5, wherein the DNA6mA signature dataset comprises a positive case dataset and a negative case dataset, the positive case dataset being a DNA6mA sequence and the negative case dataset being a non-DNA 6mA sequence.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executed implements the steps of the method according to any of claims 1-4.