CN113409891A - Method, device, equipment and storage medium for predicting DNA6mA modification class - Google Patents

Method, device, equipment and storage medium for predicting DNA6mA modification class Download PDF

Info

Publication number
CN113409891A
CN113409891A CN202110606033.0A CN202110606033A CN113409891A CN 113409891 A CN113409891 A CN 113409891A CN 202110606033 A CN202110606033 A CN 202110606033A CN 113409891 A CN113409891 A CN 113409891A
Authority
CN
China
Prior art keywords
dna6ma
sequence
input sequence
matrix
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110606033.0A
Other languages
Chinese (zh)
Other versions
CN113409891B (en
Inventor
邹权
张昊宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Yangtze River Delta Research Institute of UESTC Huzhou
Priority to CN202110606033.0A priority Critical patent/CN113409891B/en
Publication of CN113409891A publication Critical patent/CN113409891A/en
Application granted granted Critical
Publication of CN113409891B publication Critical patent/CN113409891B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biotechnology (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a method, a device, equipment and a storage medium for predicting a DNA6mA modification class. The method comprises the following steps: acquiring a DNA6mA characteristic data set; determining a similarity matrix between each sequence in the DNA6mA feature data set; carrying out logarithm processing on the similarity matrix to obtain a first matrix among the sequences; carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement; and taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the DNA6mA modification category of the sequence to be predicted based on a support vector machine model. The class of DNA6mA modifications that can predict sequence.

Description

Method, device, equipment and storage medium for predicting DNA6mA modification class
Technical Field
The application relates to the technical field of bioinformatics, in particular to a method, a device, equipment and a storage medium for predicting a DNA6mA modification class.
Background
One of the earliest epigenetic regulatory mechanisms found in humans was DNA methylation. The most prominent DNA modification in mammals is 5mC (5-methylcytosine), which accounts for 3% -6% of the total cytosine in human DNA. In contrast, 5mC is rare in prokaryotes, while 6mA (N6-methyladenine) is the most representative DNA modification in prokaryotes, mainly involved in restriction-modification systems, protecting individuals from foreign DNA invasion. The 6mA modification was first found in bacteria in 1951. However, it is not as important as 5 mC. One important reason is that the 6mA modification is thought to be widespread only in prokaryotes and unicellular eukaryotes, but rarely found in multicellular eukaryotes. However, in recent years, 6mA has been identified in eukaryotic organisms, even including mammalian and plant genomes, and found to play an important role in growth and development and disease regulation by experimental methods. These studies have opened up new sections for epigenetic modifications of eukaryotes. However, with the continuous increase of data volume and higher requirement of accuracy, the disadvantages of high time consumption and high cost of experimental methods are revealed, and some computational methods are emerged. Machine learning-based prediction tools are continuously developed, including iDNA6mA-PseKNC, i6mA-Pred, etc., but few studies have been conducted on the distance between sequences as the main basis for classification prediction. Therefore, it is necessary to investigate how to classify DNA6mA by sequence distance.
Disclosure of Invention
The application provides a method, a device and a storage medium for predicting a DNA6mA modification class, which can predict a DNA6mA modification class of a sequence.
The first aspect of the embodiments of the present application provides a method for predicting a modification category of DNA6mA, including:
acquiring a DNA6mA characteristic data set;
determining a similarity matrix between each sequence in the DNA6mA feature data set;
carrying out logarithm processing on the similarity matrix to obtain a first matrix among the sequences;
carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;
and taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the modification category of the DNA6mA based on a support vector machine model.
Optionally, determining a similarity matrix between the sequences in the DNA6mA feature dataset comprises:
and obtaining a similarity matrix between all sequences in the DNA6mA characteristic data set based on a double sequence alignment model of a suffix tree.
Optionally, obtaining a similarity matrix between sequences in the DNA6mA feature data set based on a suffix tree-based two-sequence alignment model, including:
constructing a first input sequence as a first suffix tree;
obtaining a second input sequence which is compared with the first input sequence;
determining common substrings of the first input sequence and the second input sequence by adopting an LCS model based on the first suffix tree and the second input sequence;
based on a preset qualified standard, rejecting unqualified substrings from the public substrings;
adopting a Needleman-Wunsch model to compare unmatched substrings in the first input sequence and the second input sequence, and forming a comparison result sequence based on a comparison result;
and determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.
Optionally, the DNA6mA feature dataset includes a positive case dataset that is DNA6mA sequences and a negative case dataset that is non-DNA 6mA sequences.
In a second aspect of the embodiments of the present application, there is provided an apparatus for predicting a modification type of drug DNA6mA, including:
a first acquisition unit for acquiring a DNA6mA feature data set;
a first determining unit, configured to determine a similarity matrix between sequences in the DNA6mA feature data set;
the logarithm processing unit is used for carrying out logarithm processing on the similarity matrix to obtain a first matrix among the sequences;
the Gaussian processing unit is used for carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;
and the prediction unit is used for taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of the support vector machine and predicting the modification category of the DNA6mA based on the support vector machine model.
Optionally, the first determining unit includes:
the first determining subunit is used for obtaining a similarity matrix between each sequence in the DNA6mA characteristic data set based on a double sequence alignment model of a suffix tree.
Optionally, the first determining unit includes:
a first construction subunit for constructing the first input sequence as a first suffix tree;
a first obtaining subunit, configured to obtain a second input sequence aligned with the first input sequence;
a second determining subunit, configured to determine, based on the first suffix tree and the second input sequence, a common substring of the first input sequence and the second input sequence by using an LCS model;
the first removing unit is used for removing unqualified substrings from the public substrings based on a preset qualified standard;
the first comparison unit is used for comparing the unmatched substrings in the first input sequence and the second input sequence by adopting a Needleman-Wunsch model and forming a comparison result sequence based on the comparison result;
and the third determining subunit is used for determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.
Optionally, the DNA6mA feature dataset includes a positive case dataset that is DNA6mA sequences and a negative case dataset that is non-DNA 6mA sequences.
A third aspect of embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, performs the steps in the method according to the first aspect of the present application.
A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the present application when executed.
By adopting the method for predicting the drug-target interaction provided by the embodiment of the application, the prediction of the modification class of the DNA6mA is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flow chart of a method for predicting the modification category of DNA6mA provided in the examples herein;
FIG. 2 is a schematic diagram of data file types supported by a method for predicting a modification category of DNA6mA provided in an embodiment of the present application.
Fig. 3 is a schematic diagram illustrating comparison of the effect of the m.musculus data set prediction method in the prediction method of the DNA6mA modification class provided in the embodiment of the present application.
FIG. 4 is a schematic diagram illustrating comparison of the effect of the Rice data set prediction method in the prediction method of the DNA6mA modification category provided in the embodiment of the present application.
Fig. 5 is a schematic diagram illustrating comparison of effects of Cross data set prediction methods in the prediction methods of DNA6mA modification classes provided in the examples of the present application.
Fig. 6 is a schematic structural diagram of a prediction apparatus for DNA6mA modification provided in the embodiments of the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.
Referring to fig. 1, a flow chart of a method for predicting DNA6mA modification class according to the present application is shown.
As shown in fig. 1, the method comprises the steps of:
s101, acquiring a DNA6mA characteristic data set.
In some alternative embodiments, the DNA6mA signature dataset includes a positive case dataset that is a DNA6mA sequence and a negative case dataset that is a non-DNA 6mA sequence.
In some alternative embodiments, there are 3 total DNA6mA sequence data files, DNA6mA m. musculus (number of sequences of positive example DNA6mA is 1934 and number of sequences of negative example non-DNA 6mA is 1934), DNA6mA Rice (number of sequences of positive example DNA6mA is 880 and number of sequences of negative example non-DNA 6mA is 880), and DNA6mA Cross (number of sequences of positive example DNA6mA is 2768 and number of sequences of negative example non-DNA 6mA is 2716).
In some alternative embodiments, the downloaded DNA6mA sequence data file may need to be formatted and content determined before the raw DNA6mA feature data set to be processed is obtained. The specific method for judging the format comprises the following steps: when the line of the read DNA6mA sequence data file begins with the character string ">", the data added by one line is taken as the sequence text data. The specific method for content judgment comprises the following steps: whether the content of the read sequence text data is composed of four letters of 'A', 'T', 'C' or 'G', if any, the input text is prompted to include the letters of 'A', 'T', 'C' and 'G'. The raw data set that was acquired as satisfactory is shown in fig. 2.
S102, determining a similarity matrix among all sequences in the DNA6mA characteristic data set.
In some alternative embodiments, a suffix tree based two-sequence alignment model yields a similarity matrix between individual sequences in the DNA6mA feature dataset. The method specifically comprises the following steps:
a, constructing a first input sequence seq1 as a first suffix tree 1;
b, obtaining a second input sequence seq2 aligned with the first input sequence seq 1;
c, determining common substrings of the first input sequence seq1 and the second input sequence seq2 by using an LCS model based on the first suffix tree1 and the second input sequence seq 2;
d, based on a preset qualified standard, rejecting unqualified substrings from the public substrings; the preset qualification standard is that two common substrings matched with each other can not be far away, namely, the difference between the starting positions is less than or equal to the length of the substrings.
e, adopting a Needleman-Wunsch model to compare unmatched substrings in the first input sequence seq1 and the second input sequence seq2, and forming an alignment result sequence based on the alignment result;
f, determining the similarity calculation formula between the first input sequence seq1 and the second input sequence seq2 based on the length of the common substring and the length of the alignment result sequence as follows:
Figure BDA0003082001320000061
Figure BDA0003082001320000062
s103, carrying out logarithm processing on the similarity matrix to obtain a first matrix among the sequences.
In some optional embodiments, the similarity matrix is logarithmized to obtain a distance matrix between each sequence, and a calculation formula of the distance matrix is as follows:
D12=-log(S12)
wherein S12Representing the similarity between seq1 and seq2, D12Representing the distance between seq1 and seq 2.
And S104, carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement.
In some optional embodiments, the distance matrix is gaussian processed to obtain a distance matrix satisfying the positive qualitative requirement, and the calculation formula is as follows:
Figure BDA0003082001320000063
wherein DijDenotes the distance between sequence i and sequence j, α is a Gaussian constant, GijThe value of the ith row and the jth column of the distance matrix for satisfying the positive qualitative requirement.
And S105, taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the modification category of the DNA6mA based on the support vector machine model.
In some optional embodiments, a support vector machine algorithm is adopted, and the distance matrix is used as a custom kernel matrix of the support vector machine to perform classification prediction on the DNA6mA modification, wherein the algorithm flow comprises the following steps:
s51 construction of Lagrangian function
Figure BDA0003082001320000064
Figure BDA0003082001320000065
S52, calculating the partial derivatives of w, b and making them equal to 0:
Figure BDA0003082001320000071
Figure BDA0003082001320000072
s53, substituting the primitive function to obtain the dual problem of the primitive problem:
Figure BDA0003082001320000073
Figure BDA0003082001320000074
αi≥0,i=1,2,…l;
s54, solving the dual problem to obtain alpha and w and further obtain an equation for classifying the hyperplane:
f(xi)=sgn(wTx+b);
s55, carrying out classification prediction on DNA6mA data according to the equation, f (x)i)>0 is positive case, f (x)i)<0 is the opposite example.
Wherein w and b are parameter vectors of the classification hyperplane, alpha is a parameter vector of the constructed Lagrangian function, L is the constructed Lagrangian function, and f is a classification hyperplane equation.
In the embodiment of the present invention, the indexes for evaluating the classification effect include SE, SP, ACC, MCC, and F1, and the calculation formula thereof is as follows:
Figure BDA0003082001320000075
Figure BDA0003082001320000076
Figure BDA0003082001320000077
Figure BDA0003082001320000078
Figure BDA0003082001320000079
wherein TP represents the amount of DNA6mA predicted correctly, FP represents the amount of non-DNA 6mA predicted correctly, TN represents the amount of DNA6mA predicted incorrectly, and FN represents the amount of non-DNA 6mA predicted incorrectly.
The predicted effect of the present invention is further described below in a set of specific experimental examples.
Compared with the research results of the existing excellent prediction algorithm, the method uses consistent evaluation indexes (namely SE, SP, ACC and MCC) on the basis of ensuring the consistency of the used data sets during comparison.
We first compare the predicted results of our invention on m.musculus datasets with existing machine learning methods, as shown in fig. 3. As can be seen from fig. 3, the present invention achieves higher accuracy in the classification effect. On the M.musculus data set, the support vector machine classifier based on the distance obtains a classification accuracy ACC value of 0.982, which is higher than the classification accuracy ACC values of 0.966 and iLM-CNN of csDMA, and experiments show that the prediction accuracy of the M.musculus data is effectively improved. Meanwhile, the distance-based support vector machine classifier also obtained the highest MCC value of 0.982 and F1 value of 0.982, indicating that the prediction accuracy of the distance-based support vector machine classifier is high even when processing unbalanced data sets.
The prediction results of the present invention on Rice data sets were then compared to existing machine learning methods, as shown in fig. 4. As can be seen from fig. 4, the present invention achieves higher accuracy in the classification effect. On the Rice data set, a support vector machine classifier based on distance obtains a classification accuracy ACC value of 0.943, which is higher than the classification accuracy ACC values of 0.861 and 0.875 of iLM-CNN of csDMA, and experiments show that the prediction accuracy of the Rice data is effectively improved. Meanwhile, the distance-based SVM classifier also obtains the highest MCC value of 0.944 and the highest F1 value of 0.942, which shows that the distance-based SVM classifier has higher prediction accuracy even when processing unbalanced data sets, and provides a new idea for processing unbalanced Rice data.
Finally, the results of the present invention are compared to the results of prior art machine learning methods on Cross data sets, as shown in fig. 5. As can be seen from fig. 5, the present invention achieves higher accuracy in the classification effect. On a Cross data set, the support vector machine classifier based on the distance obtains an MCC value of 0.838 which is far higher than 0.651 of 0.603 and iLM-CNN of csDMA, and experiments show that the prediction precision of the invention for the unbalanced Cross data set is obviously improved, and the invention is of great benefit for the research of the data set. Meanwhile, the support vector machine classifier based on the distance also obtains the highest F1 value of 0.84, which shows that the invention has better balance for the data and provides a certain reference for the research of Cross data.
The invention has the beneficial effects that:
(1) the invention provides a brand-new DNA6mA prediction method, which utilizes the distance between DNA6mA sequences to carry out classification prediction on the sequences and provides a lead support for corresponding theoretical research.
(2) When the support vector machine algorithm is applied, the invention adopts the self-defined kernel matrix, thereby effectively improving the processing efficiency.
(3) According to the invention, the similarity matrix between DNA6mA sequences is converted into a positive distance matrix, so that a support vector machine classifier is constructed, and the prediction effect of DNA6mA is improved
Based on the same inventive concept, one embodiment of the present application provides a device for predicting the modification category of DNA6 mA. Referring to fig. 6, fig. 6 is a schematic diagram of a prediction apparatus for DNA6mA modification class according to an embodiment of the present application. As shown in fig. 6, the apparatus includes:
a first obtaining unit 601, configured to obtain a DNA6mA feature data set;
a first determining unit 602, configured to determine a similarity matrix between sequences in the DNA6mA feature data set;
a logarithm processing unit 603, configured to perform logarithm processing on the similarity matrix to obtain a first matrix among the sequences;
a gaussian processing unit 604, configured to perform gaussian processing on the distance matrix to obtain a distance matrix meeting the positive qualitative requirement;
and the prediction unit 605 is configured to use the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of the support vector machine, and predict the DNA6mA modification category based on the support vector machine model.
Optionally, the first determining unit includes:
the first determining subunit is used for obtaining a similarity matrix between each sequence in the DNA6mA characteristic data set based on a double sequence alignment model of a suffix tree.
Optionally, the first determining unit includes:
a first construction subunit for constructing the first input sequence as a first suffix tree;
a first obtaining subunit, configured to obtain a second input sequence aligned with the first input sequence;
a second determining subunit, configured to determine, based on the first suffix tree and the second input sequence, a common substring of the first input sequence and the second input sequence by using an LCS model;
the first removing unit is used for removing unqualified substrings from the public substrings based on a preset qualified standard;
the first comparison unit is used for comparing the unmatched substrings in the first input sequence and the second input sequence by adopting a Needleman-Wunsch model and forming a comparison result sequence based on the comparison result;
and the third determining subunit is used for determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.
Optionally, the DNA6mA feature dataset includes a positive case dataset that is DNA6mA sequences and a negative case dataset that is non-DNA 6mA sequences.
Based on the same inventive concept, another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method according to any of the above-mentioned embodiments of the present application.
Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the electronic device implements the steps of the method according to any of the above embodiments of the present application.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The plant resistance protein identification method, device, equipment and storage medium provided by the application are described in detail above, and the principle and the implementation mode of the application are explained by applying specific examples, and the description of the above examples is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A method for predicting a modification category of DNA6mA, comprising:
acquiring a DNA6mA characteristic data set;
determining a similarity matrix between each sequence in the DNA6mA feature data set;
carrying out logarithm processing on the similarity matrix to obtain a first matrix among the sequences;
carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;
and taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the modification category of the DNA6mA based on a support vector machine model.
2. The method of predicting according to claim 1, wherein determining a similarity matrix between sequences in the DNA6mA signature dataset comprises:
and obtaining a similarity matrix between all sequences in the DNA6mA characteristic data set based on a double sequence alignment model of a suffix tree.
3. The prediction method of claim 2, wherein the obtaining of the similarity matrix between the sequences in the DNA6mA feature data set based on a suffix tree-based two-sequence alignment model comprises:
constructing a first input sequence as a first suffix tree;
obtaining a second input sequence which is compared with the first input sequence;
determining common substrings of the first input sequence and the second input sequence by adopting an LCS model based on the first suffix tree and the second input sequence;
based on a preset qualified standard, rejecting unqualified substrings from the public substrings;
adopting a Needleman-Wunsch model to compare unmatched substrings in the first input sequence and the second input sequence, and forming a comparison result sequence based on a comparison result;
and determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.
4. The prediction method of claim 1, wherein the DNA6mA signature dataset comprises a positive case dataset and a negative case dataset, the positive case dataset being a DNA6mA sequence and the negative case dataset being a non-DNA 6mA sequence.
5. A device for predicting a modification type of DNA6mA, comprising:
a first acquisition unit for acquiring a DNA6mA feature data set;
a first determining unit, configured to determine a similarity matrix between sequences in the DNA6mA feature data set;
the logarithm processing unit is used for carrying out logarithm processing on the similarity matrix to obtain a first matrix among the sequences;
the Gaussian processing unit is used for carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;
and the prediction unit is used for taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of the support vector machine and predicting the modification category of the DNA6mA based on the support vector machine model.
6. The prediction apparatus according to claim 5, wherein the first determination unit includes:
the first determining subunit is used for obtaining a similarity matrix between each sequence in the DNA6mA characteristic data set based on a double sequence alignment model of a suffix tree.
7. The prediction method of claim 6, wherein the first determination unit comprises:
a first construction subunit for constructing the first input sequence as a first suffix tree;
a first obtaining subunit, configured to obtain a second input sequence aligned with the first input sequence;
a second determining subunit, configured to determine, based on the first suffix tree and the second input sequence, a common substring of the first input sequence and the second input sequence by using an LCS model;
the first removing unit is used for removing unqualified substrings from the public substrings based on a preset qualified standard;
the first comparison unit is used for comparing the unmatched substrings in the first input sequence and the second input sequence by adopting a Needleman-Wunsch model and forming a comparison result sequence based on the comparison result;
and the third determining subunit is used for determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.
8. The prediction method of claim 5, wherein the DNA6mA signature dataset comprises a positive case dataset and a negative case dataset, the positive case dataset being a DNA6mA sequence and the negative case dataset being a non-DNA 6mA sequence.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executed implements the steps of the method according to any of claims 1-4.
CN202110606033.0A 2021-05-25 2021-05-25 Method, device, equipment and storage medium for predicting DNA6mA modification class Active CN113409891B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110606033.0A CN113409891B (en) 2021-05-25 2021-05-25 Method, device, equipment and storage medium for predicting DNA6mA modification class

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110606033.0A CN113409891B (en) 2021-05-25 2021-05-25 Method, device, equipment and storage medium for predicting DNA6mA modification class

Publications (2)

Publication Number Publication Date
CN113409891A true CN113409891A (en) 2021-09-17
CN113409891B CN113409891B (en) 2023-02-03

Family

ID=77675555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110606033.0A Active CN113409891B (en) 2021-05-25 2021-05-25 Method, device, equipment and storage medium for predicting DNA6mA modification class

Country Status (1)

Country Link
CN (1) CN113409891B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101680872A (en) * 2007-04-13 2010-03-24 塞昆纳姆股份有限公司 Comparative sequence analysis processes and systems
US20160357917A1 (en) * 2008-07-01 2016-12-08 The Board Of Trustees Of The Leland Stanford Junior University Methods and Systems for Assessment of Clinical Infertility
CN107491734A (en) * 2017-07-19 2017-12-19 苏州闻捷传感技术有限公司 Semi-supervised Classification of Polarimetric SAR Image method based on multi-core integration Yu space W ishart LapSVM
US9932640B1 (en) * 2013-05-02 2018-04-03 George Wyndham Cook, Jr. Clinical use of an Alu element based bioinformatics methodology for the detection and treatment of cancer
CN109961093A (en) * 2019-03-07 2019-07-02 北京工业大学 A kind of image classification method based on many intelligence integrated studies
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101680872A (en) * 2007-04-13 2010-03-24 塞昆纳姆股份有限公司 Comparative sequence analysis processes and systems
US20160357917A1 (en) * 2008-07-01 2016-12-08 The Board Of Trustees Of The Leland Stanford Junior University Methods and Systems for Assessment of Clinical Infertility
US9932640B1 (en) * 2013-05-02 2018-04-03 George Wyndham Cook, Jr. Clinical use of an Alu element based bioinformatics methodology for the detection and treatment of cancer
CN107491734A (en) * 2017-07-19 2017-12-19 苏州闻捷传感技术有限公司 Semi-supervised Classification of Polarimetric SAR Image method based on multi-core integration Yu space W ishart LapSVM
CN109961093A (en) * 2019-03-07 2019-07-02 北京工业大学 A kind of image classification method based on many intelligence integrated studies
CN111161793A (en) * 2020-01-09 2020-05-15 青岛科技大学 Stacking integration based N in RNA6Method for predicting methyladenosine modification site

Also Published As

Publication number Publication date
CN113409891B (en) 2023-02-03

Similar Documents

Publication Publication Date Title
US11620567B2 (en) Method, apparatus, device and storage medium for predicting protein binding site
Camargo et al. RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences
Al-Ajlan et al. CNN-MGP: convolutional neural networks for metagenomics gene prediction
Liu et al. PEDLA: predicting enhancers with a deep learning-based algorithmic framework
Li et al. Predicting long noncoding RNA and protein interactions using heterogeneous network model
Yang et al. Prediction of aptamer–protein interacting pairs based on sparse autoencoder feature extraction and an ensemble classifier
Chen et al. CRNET: an efficient sampling approach to infer functional regulatory networks by integrating large-scale ChIP-seq and time-course RNA-seq data
Zeng et al. Developing a multi-layer deep learning based predictive model to identify DNA N4-methylcytosine modifications
Ge et al. Prediction of disease-associated nsSNPs by integrating multi-scale ResNet models with deep feature fusion
Zeng et al. 4mCPred-MTL: accurate identification of DNA 4mC sites in multiple species using multi-task deep learning based on multi-head attention mechanism
Kim et al. A method to identify differential expression profiles of time-course gene data with Fourier transformation
Li et al. AngClust: angle feature-based clustering for short time series gene expression profiles
Beck et al. Signal analysis for genome-wide maps of histone modifications measured by ChIP-seq
Jia et al. EMDL-ac4C: identifying N4-acetylcytidine based on ensemble two-branch residual connection DenseNet and attention
CN113409891B (en) Method, device, equipment and storage medium for predicting DNA6mA modification class
Tuna et al. Inference from low precision transcriptome data representation
Qin et al. An efficient method to identify differentially expressed genes in microarray experiments
Qu et al. Deep learning approach to biogeographical ancestry inference
Moskowitz et al. Nonparametric analysis of contributions to variance in genomics and epigenomics data
CN111832815A (en) Scientific research hotspot prediction method and system
Polushina et al. Change-point detection in binary Markov DNA sequences by the Cross-Entropy method
Sun et al. A miRNA target prediction model based on distributed representation learning and deep learning
Alam et al. Unveiling the Potential Pattern Representation of RNA 5-Methyluridine Modification Sites Through a Novel Feature Fusion Model Leveraging Convolutional Neural Network and Tetranucleotide Composition
Danda Identification of Cell-types in scRNA-seq Data via Enhanced Local Embedding and Clustering
Ide et al. Function prediction of disease-related long intergenic non-coding RNA using random forest

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant