CN113409891B

CN113409891B - Method, device, equipment and storage medium for predicting DNA6mA modification class

Info

Publication number: CN113409891B
Application number: CN202110606033.0A
Authority: CN
Inventors: 邹权; 张昊宇
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2023-02-03
Anticipated expiration: 2041-05-25
Also published as: CN113409891A

Abstract

The application provides a method, a device, equipment and a storage medium for predicting a DNA6mA modification class. The method comprises the following steps: obtaining a DNA6mA characteristic data set; determining a similarity matrix among sequences in the DNA6mA characteristic data set; carrying out logarithm processing on the similarity matrix to obtain a first matrix distance matrix among the sequences; carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting positive qualitative requirements; and taking the distance matrix meeting the positive qualitative requirement as a custom core matrix of a support vector machine, and predicting the DNA6mA modification category of the sequence to be predicted based on a support vector machine model. The DNA6mA modification class that enables prediction of sequence.

Description

Method, device, equipment and storage medium for predicting DNA6mA modification class

Technical Field

The application relates to the technical field of bioinformatics, in particular to a method, a device, equipment and a storage medium for predicting a DNA6mA modification category.

Background

One of the earliest epigenetic regulatory mechanisms found in humans was DNA methylation. The most prominent DNA modification in mammals is 5mC (5-methylcytosine), which accounts for 3% -6% of the total cytosine in human DNA. In contrast, 5mC is rare in prokaryotes, while 6mA (N6-methyladenine) is the most representative DNA modification in prokaryotes, and is mainly involved in restriction-modification systems to protect individuals from foreign DNA invasion. The 6mA modification was first found in bacteria in 1951. However, it is not as appreciated as 5 mC. One important reason is that the 6mA modification is thought to be widespread only in prokaryotes and unicellular eukaryotes, but rarely found in multicellular eukaryotes. However, in recent years, 6mA was identified in eukaryotic organisms including even mammalian and plant genomes by experimental methods, and it was found that 6mA plays an important role in growth and development and disease regulation. These studies have opened up new sections for epigenetic modifications of eukaryotes. However, with the continuous increase of data volume and higher requirement of accuracy, the disadvantages of high time consumption and high cost of experimental methods are revealed, and some computational methods are emerged. Machine learning-based prediction tools are continuously developed, including iDNA6mA-PseKNC, i6mA-Pred, etc., but few studies have been conducted on the distance between sequences as the main basis for classification prediction. Therefore, it is necessary to study how to classify DNA6mA by sequence distance.

Disclosure of Invention

The application provides a method, a device and a storage medium for predicting a DNA6mA modification class, which can predict a DNA6mA modification class of a sequence.

In a first aspect, the embodiments of the present application provide a method for predicting a DNA6mA modification category, including:

obtaining a DNA6mA characteristic data set;

determining a similarity matrix among all sequences in the DNA6mA characteristic data set;

carrying out logarithm processing on the similarity matrix to obtain a distance matrix among all the sequences;

carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;

and taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the DNA6mA modification category based on the support vector machine model.

Optionally, determining a similarity matrix between the sequences in the DNA6mA feature dataset comprises:

and obtaining a similarity matrix among all sequences in the DNA6mA characteristic data set based on a double sequence alignment model of a suffix tree.

Optionally, obtaining a similarity matrix between sequences in the DNA6mA feature data set based on a double sequence alignment model of a suffix tree, including:

constructing the first input sequence as a first suffix tree;

obtaining a second input sequence which is compared with the first input sequence;

determining common substrings of the first input sequence and the second input sequence by adopting an LCS model based on the first suffix tree and the second input sequence;

based on a preset qualified standard, rejecting unqualified substrings from the public substrings;

adopting a Needleman-Wunsch model to compare unmatched substrings in the first input sequence and the second input sequence, and forming a comparison result sequence based on a comparison result;

and determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.

Optionally, the DNA6mA feature data set comprises a positive case data set and a negative case data set, the positive case data set is a DNA6mA sequence, and the negative case data set is a non-DNA 6mA sequence.

In a second aspect of the embodiments of the present application, there is provided an apparatus for predicting a6mA modification class of drug DNA, including:

the first acquisition unit is used for acquiring a DNA6mA characteristic data set;

the first determination unit is used for determining a similarity matrix among sequences in the DNA6mA characteristic data set;

the logarithm processing unit is used for carrying out logarithm processing on the similarity matrix to obtain a distance matrix among all the sequences;

the Gaussian processing unit is used for carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;

and the prediction unit is used for taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of the support vector machine and predicting the DNA6mA modification category based on the support vector machine model.

Optionally, the first determining unit includes:

and the first determining subunit is used for obtaining a similarity matrix among sequences in the DNA6mA characteristic data set based on a double-sequence comparison model of a suffix tree.

Optionally, the first determining unit includes:

a first construction subunit for constructing the first input sequence as a first suffix tree;

a first obtaining subunit, configured to obtain a second input sequence aligned with the first input sequence;

a second determining subunit, configured to determine, based on the first suffix tree and the second input sequence, a common substring of the first input sequence and the second input sequence by using an LCS model;

the first removing unit is used for removing unqualified substrings from the public substrings based on a preset qualified standard;

the first comparison unit is used for comparing unmatched substrings in the first input sequence and the second input sequence by adopting a Needleman-Wunsch model and forming a comparison result sequence based on a comparison result;

and the third determining subunit is used for determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.

A third aspect of embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, performs the steps in the method according to the first aspect of the present application.

A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the present application when executed.

By adopting the method for predicting the DNA6mA modification category provided by the embodiment of the application, the prediction of the DNA6mA modification category is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of a method for predicting the 6mA modification class of DNA provided in the examples of the present application.

FIG. 2 is a schematic diagram of data file types supported by a method for predicting a DNA6mA modification category provided in an embodiment of the present application.

Fig. 3 is a schematic diagram illustrating comparison of the effect of the prediction method using m.musculus data set in the prediction method of DNA6mA modification class provided in the embodiment of the present application.

FIG. 4 is a schematic diagram illustrating comparison of the effect of the Rice data set prediction method in the prediction method for the DNA6mA modification category provided in the embodiment of the present application.

FIG. 5 is a comparison of the effect of the Cross data set prediction method in the prediction method for the 6mA modified DNA class provided in the examples of the present application.

FIG. 6 is a schematic structural diagram of a prediction device for a DNA6mA modification category provided in an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

Referring to FIG. 1, a flow chart of a method for predicting the 6mA modification class of DNA according to the present application is shown. As shown in fig. 1, the method comprises the steps of:

s101, obtaining a DNA6mA feature data set.

In some alternative embodiments, the DNA6mA signature dataset comprises a positive case dataset that is a DNA6mA sequence and a negative case dataset that is a non-DNA 6mA sequence.

In some alternative embodiments, there are a total of 3 DNA6mA sequence data files, DNA6mA M.musculus (1934 for positive example DNA6mA sequence number, 1934 for counter example non-DNA 6mA sequence number), DNA6mA Rice (880 for positive example DNA6mA sequence number, 880 for counter example non-DNA 6mA sequence number), and DNA6mA Cross (2768 for positive example DNA6mA sequence number, 2716 for counter example non-DNA 6mA sequence number).

In some alternative embodiments, the downloaded DNA6mA sequence data file needs to be subjected to format judgment and content judgment before the raw DNA6mA feature data set to be processed is acquired. The specific method for judging the format comprises the following steps: when the line of the read DNA6mA sequence data file is headed by a character string ">", the data added by one line is taken as sequence text data. The specific method for content judgment comprises the following steps: whether the content of the read sequence text data is composed of four letters of 'A', 'T', 'C' or 'G', if any, the input text is prompted to include the letters of 'A', 'T', 'C' and 'G'. The raw data set that was acquired as satisfactory is shown in fig. 2.

S102, determining a similarity matrix among all sequences in the DNA6mA characteristic data set.

In some alternative embodiments, a suffix tree based two-sequence alignment model yields a similarity matrix between individual sequences in the DNA6mA feature dataset. The method specifically comprises the following steps:

a, constructing a first input sequence seq1 as a first suffix tree1;

b, acquiring a second input sequence seq2 which is compared with the first input sequence seq 1;

c, determining a common substring of the first input sequence seq1 and the second input sequence seq2 by adopting an LCS model based on the first suffix tree1 and the second input sequence seq2;

d, based on a preset qualified standard, rejecting unqualified substrings from the public substrings; the preset qualification standard is that two common substrings matched with each other can not be far away, namely, the difference between the starting positions is less than or equal to the length of the substrings.

e, adopting a Needleman-Wunsch model to compare unmatched substrings in the first input sequence seq1 and the second input sequence seq2, and forming a comparison result sequence based on the comparison result;

f, based on the length of the common substring and the length of the alignment result sequence, determining that a similarity calculation formula between the first input sequence seq1 and the second input sequence seq2 is as follows:

。

s103, carrying out logarithm processing on the similarity matrix to obtain a distance matrix among the sequences.

In some optional embodiments, the similarity matrix is logarithmized to obtain a distance matrix between each sequence, and a calculation formula of the distance matrix is as follows:

wherein

Representing the similarity between seq1 and seq2,

representing the distance between seq1 and seq 2.

And S104, carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement.

In some optional embodiments, the distance matrix is gaussian processed to obtain a distance matrix satisfying the positive qualitative requirement, and the calculation formula is as follows:

wherein

Representing the distance between sequence i and sequence j,

is a constant for the gaussian transformation, and is,

the value of the ith row and the jth column of the distance matrix for satisfying the positive qualitative requirement.

And S105, taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the DNA6mA modification category based on the support vector machine model.

In some optional embodiments, a support vector machine algorithm is adopted, the distance matrix is used as a custom kernel matrix of the support vector machine, and the classification prediction is performed on the DNA6mA modification, and the algorithm flow comprises the following steps:

s51, constructing a Lagrangian function

；

S52, pair

The partial derivatives are calculated and made equal to 0:

；

s53, substituting the primitive function to obtain a dual problem of the primitive problem:

；

s54, solving the dual problem to obtain

And

and further obtaining an equation for classifying the hyperplane:

；

s55, classifying and predicting the DNA6mA data according to the equation,

for the sake of a positive example,

as a counter example.

Wherein

And

in order to classify the parameter vector of the hyperplane,

a parameter vector for the constructed lagrangian function, L for the constructed lagrangian function,

is classified hyperplane equation.

In the embodiment of the invention, the indexes for evaluating the classification effect comprise SE, SP, ACC, MCC and F1, and the calculation formula is as follows:

wherein TP represents the predicted correct DNA6mA amount, FP represents the predicted correct non-DNA 6mA amount, TN represents the predicted wrong DNA6mA amount, and FN represents the predicted wrong non-DNA 6mA amount.

The predicted effect of the present invention is further described below in a set of specific experimental examples.

Compared with the research results of the existing excellent prediction algorithm, the method uses consistent evaluation indexes (namely SE, SP, ACC and MCC) on the basis of ensuring the consistency of the used data sets during comparison.

We first compare the predicted results of our invention on m.musculus datasets with existing machine learning methods, as shown in fig. 3. As can be seen from fig. 3, the present invention achieves higher accuracy in the classification effect. On the M.musculus data set, the support vector machine classifier based on the distance obtains a classification accuracy ACC value of 0.982, which is higher than the classification accuracy ACC values of 0.966 of csDMA and 0.969 of iLM-CNN. Meanwhile, the distance-based SVM classifier also obtained the highest MCC value of 0.982 and F1 value of 0.982, indicating that the prediction accuracy of the distance-based SVM classifier is high even when processing unbalanced data sets.

Comparison the present invention then compares the prediction results on Rice datasets with existing machine learning methods, as shown in figure 4. As can be seen from fig. 4, the present invention achieves higher accuracy in the classification effect. On the Rice data set, a support vector machine classifier based on distance obtains a classification accuracy ACC value of 0.943, which is higher than the classification accuracy ACC values of 0.861 of csDMA and 0.875 of iLM-CNN, and experiments show that the prediction accuracy of the Rice data is effectively improved. Meanwhile, the distance-based SVM classifier also obtains the highest MCC value of 0.944 and the highest F1 value of 0.942, which shows that the distance-based SVM classifier has higher prediction accuracy even when processing an unbalanced data set, and provides a new idea for processing unbalanced Rice data.

Finally, the results of the present invention are compared to the results of prior art machine learning methods on Cross data sets, as shown in fig. 5. As can be seen from fig. 5, the present invention achieves higher accuracy in the classification effect. On a Cross data set, the support vector machine classifier based on distance obtains an MCC value of 0.838, which is far higher than 0.603 of csDMA and 0.651 of iLM-CNN, and experiments show that the prediction accuracy of the unbalanced Cross data set is obviously improved, and the method is of great benefit to the research of the data set. Meanwhile, the support vector machine classifier based on the distance also obtains the highest F1 value of 0.84, which shows that the invention has better balance for the data and provides a certain reference for the research of Cross data.

The invention has the advantages that

(1) The invention provides a brand-new DNA6mA prediction method, which utilizes the distance between DNA6mA sequences to carry out classification prediction on the sequences and provides a pilot support for corresponding theoretical research.

(2) When the support vector machine algorithm is applied, the invention adopts the self-defined kernel matrix, thereby effectively improving the processing efficiency.

(3) According to the invention, the similarity matrix between DNA6mA sequences is converted into a positive distance matrix, so that a support vector machine classifier is constructed, and the prediction effect of DNA6mA is improved.

Based on the same inventive concept, one embodiment of the present application provides a device for predicting a DNA6mA modification category. Referring to fig. 6, fig. 6 is a schematic diagram of a prediction device for DNA6mA modification class provided in an embodiment of the present application. As shown in fig. 6, the apparatus includes:

a first obtaining unit 601, configured to obtain a DNA6mA feature data set;

a first determining unit 602, configured to determine a similarity matrix between sequences in the DNA6mA feature data set;

a logarithm processing unit 603, configured to perform logarithm processing on the similarity matrix to obtain a distance matrix between each sequence;

a gaussian processing unit 604, configured to perform gaussian processing on the distance matrix to obtain a distance matrix meeting the positive qualitative requirement;

and the predicting unit 605 is configured to take the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of the support vector machine, and predict the DNA6mA modification category based on the support vector machine model.

Optionally, the first determining unit includes:

the first comparison unit is used for comparing the unmatched substrings in the first input sequence and the second input sequence by adopting a Needleman-Wunsch model and forming a comparison result sequence based on the comparison result;

Based on the same inventive concept, another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method according to any of the above-mentioned embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the electronic device implements the steps of the method according to any of the above embodiments of the present application.

For the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising one of \ 8230; \8230;" does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.

The plant resistance protein identification method, device, equipment and storage medium provided by the application are described in detail above, and the principle and the implementation mode of the application are explained by applying specific examples, and the description of the above examples is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for predicting the 6mA modification class of DNA, comprising:

acquiring a DNA6mA characteristic data set;

obtaining a similarity matrix among sequences in the DNA6mA characteristic data set based on a double-sequence alignment model of a suffix tree, wherein the similarity matrix comprises the following steps:

constructing a first input sequence as a first suffix tree;

determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence;

2. The prediction method of claim 1, wherein the DNA6mA feature data set comprises a positive case data set and a negative case data set, the positive case data set being a DNA6mA sequence and the negative case data set being a non-DNA 6mA sequence.

3. A device for predicting the 6mA modification class of DNA, comprising:

a first determination unit for determining a similarity matrix between the sequences in the DNA6mA characteristic data set,

the first determination unit includes:

the first determining subunit is used for obtaining a similarity matrix among all sequences in the DNA6mA characteristic data set based on a double-sequence comparison model of a suffix tree;

the first determination unit includes:

a third determining subunit, configured to determine a similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence;

4. The apparatus of claim 3, wherein the DNA6mA modification class prediction apparatus comprises a positive case data set and a negative case data set, the positive case data set being a DNA6mA sequence and the negative case data set being a non-DNA 6mA sequence.

5. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1-2.

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing, carries out the steps of the method according to any of claims 1-2.