CN113409891B - Method, device, equipment and storage medium for predicting DNA6mA modification class - Google Patents

Method, device, equipment and storage medium for predicting DNA6mA modification class Download PDF

Info

Publication number
CN113409891B
CN113409891B CN202110606033.0A CN202110606033A CN113409891B CN 113409891 B CN113409891 B CN 113409891B CN 202110606033 A CN202110606033 A CN 202110606033A CN 113409891 B CN113409891 B CN 113409891B
Authority
CN
China
Prior art keywords
dna6ma
sequence
input sequence
data set
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110606033.0A
Other languages
Chinese (zh)
Other versions
CN113409891A (en
Inventor
邹权
张昊宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Yangtze River Delta Research Institute of UESTC Huzhou
Priority to CN202110606033.0A priority Critical patent/CN113409891B/en
Publication of CN113409891A publication Critical patent/CN113409891A/en
Application granted granted Critical
Publication of CN113409891B publication Critical patent/CN113409891B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The application provides a method, a device, equipment and a storage medium for predicting a DNA6mA modification class. The method comprises the following steps: obtaining a DNA6mA characteristic data set; determining a similarity matrix among sequences in the DNA6mA characteristic data set; carrying out logarithm processing on the similarity matrix to obtain a first matrix distance matrix among the sequences; carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting positive qualitative requirements; and taking the distance matrix meeting the positive qualitative requirement as a custom core matrix of a support vector machine, and predicting the DNA6mA modification category of the sequence to be predicted based on a support vector machine model. The DNA6mA modification class that enables prediction of sequence.

Description

Method, device, equipment and storage medium for predicting DNA6mA modification class
Technical Field
The application relates to the technical field of bioinformatics, in particular to a method, a device, equipment and a storage medium for predicting a DNA6mA modification category.
Background
One of the earliest epigenetic regulatory mechanisms found in humans was DNA methylation. The most prominent DNA modification in mammals is 5mC (5-methylcytosine), which accounts for 3% -6% of the total cytosine in human DNA. In contrast, 5mC is rare in prokaryotes, while 6mA (N6-methyladenine) is the most representative DNA modification in prokaryotes, and is mainly involved in restriction-modification systems to protect individuals from foreign DNA invasion. The 6mA modification was first found in bacteria in 1951. However, it is not as appreciated as 5 mC. One important reason is that the 6mA modification is thought to be widespread only in prokaryotes and unicellular eukaryotes, but rarely found in multicellular eukaryotes. However, in recent years, 6mA was identified in eukaryotic organisms including even mammalian and plant genomes by experimental methods, and it was found that 6mA plays an important role in growth and development and disease regulation. These studies have opened up new sections for epigenetic modifications of eukaryotes. However, with the continuous increase of data volume and higher requirement of accuracy, the disadvantages of high time consumption and high cost of experimental methods are revealed, and some computational methods are emerged. Machine learning-based prediction tools are continuously developed, including iDNA6mA-PseKNC, i6mA-Pred, etc., but few studies have been conducted on the distance between sequences as the main basis for classification prediction. Therefore, it is necessary to study how to classify DNA6mA by sequence distance.
Disclosure of Invention
The application provides a method, a device and a storage medium for predicting a DNA6mA modification class, which can predict a DNA6mA modification class of a sequence.
In a first aspect, the embodiments of the present application provide a method for predicting a DNA6mA modification category, including:
obtaining a DNA6mA characteristic data set;
determining a similarity matrix among all sequences in the DNA6mA characteristic data set;
carrying out logarithm processing on the similarity matrix to obtain a distance matrix among all the sequences;
carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;
and taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the DNA6mA modification category based on the support vector machine model.
Optionally, determining a similarity matrix between the sequences in the DNA6mA feature dataset comprises:
and obtaining a similarity matrix among all sequences in the DNA6mA characteristic data set based on a double sequence alignment model of a suffix tree.
Optionally, obtaining a similarity matrix between sequences in the DNA6mA feature data set based on a double sequence alignment model of a suffix tree, including:
constructing the first input sequence as a first suffix tree;
obtaining a second input sequence which is compared with the first input sequence;
determining common substrings of the first input sequence and the second input sequence by adopting an LCS model based on the first suffix tree and the second input sequence;
based on a preset qualified standard, rejecting unqualified substrings from the public substrings;
adopting a Needleman-Wunsch model to compare unmatched substrings in the first input sequence and the second input sequence, and forming a comparison result sequence based on a comparison result;
and determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.
Optionally, the DNA6mA feature data set comprises a positive case data set and a negative case data set, the positive case data set is a DNA6mA sequence, and the negative case data set is a non-DNA 6mA sequence.
In a second aspect of the embodiments of the present application, there is provided an apparatus for predicting a6mA modification class of drug DNA, including:
the first acquisition unit is used for acquiring a DNA6mA characteristic data set;
the first determination unit is used for determining a similarity matrix among sequences in the DNA6mA characteristic data set;
the logarithm processing unit is used for carrying out logarithm processing on the similarity matrix to obtain a distance matrix among all the sequences;
the Gaussian processing unit is used for carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;
and the prediction unit is used for taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of the support vector machine and predicting the DNA6mA modification category based on the support vector machine model.
Optionally, the first determining unit includes:
and the first determining subunit is used for obtaining a similarity matrix among sequences in the DNA6mA characteristic data set based on a double-sequence comparison model of a suffix tree.
Optionally, the first determining unit includes:
a first construction subunit for constructing the first input sequence as a first suffix tree;
a first obtaining subunit, configured to obtain a second input sequence aligned with the first input sequence;
a second determining subunit, configured to determine, based on the first suffix tree and the second input sequence, a common substring of the first input sequence and the second input sequence by using an LCS model;
the first removing unit is used for removing unqualified substrings from the public substrings based on a preset qualified standard;
the first comparison unit is used for comparing unmatched substrings in the first input sequence and the second input sequence by adopting a Needleman-Wunsch model and forming a comparison result sequence based on a comparison result;
and the third determining subunit is used for determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.
Optionally, the DNA6mA feature data set comprises a positive case data set and a negative case data set, the positive case data set is a DNA6mA sequence, and the negative case data set is a non-DNA 6mA sequence.
A third aspect of embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, performs the steps in the method according to the first aspect of the present application.
A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the present application when executed.
By adopting the method for predicting the DNA6mA modification category provided by the embodiment of the application, the prediction of the DNA6mA modification category is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flow chart of a method for predicting the 6mA modification class of DNA provided in the examples of the present application.
FIG. 2 is a schematic diagram of data file types supported by a method for predicting a DNA6mA modification category provided in an embodiment of the present application.
Fig. 3 is a schematic diagram illustrating comparison of the effect of the prediction method using m.musculus data set in the prediction method of DNA6mA modification class provided in the embodiment of the present application.
FIG. 4 is a schematic diagram illustrating comparison of the effect of the Rice data set prediction method in the prediction method for the DNA6mA modification category provided in the embodiment of the present application.
FIG. 5 is a comparison of the effect of the Cross data set prediction method in the prediction method for the 6mA modified DNA class provided in the examples of the present application.
FIG. 6 is a schematic structural diagram of a prediction device for a DNA6mA modification category provided in an embodiment of the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.
Referring to FIG. 1, a flow chart of a method for predicting the 6mA modification class of DNA according to the present application is shown. As shown in fig. 1, the method comprises the steps of:
s101, obtaining a DNA6mA feature data set.
In some alternative embodiments, the DNA6mA signature dataset comprises a positive case dataset that is a DNA6mA sequence and a negative case dataset that is a non-DNA 6mA sequence.
In some alternative embodiments, there are a total of 3 DNA6mA sequence data files, DNA6mA M.musculus (1934 for positive example DNA6mA sequence number, 1934 for counter example non-DNA 6mA sequence number), DNA6mA Rice (880 for positive example DNA6mA sequence number, 880 for counter example non-DNA 6mA sequence number), and DNA6mA Cross (2768 for positive example DNA6mA sequence number, 2716 for counter example non-DNA 6mA sequence number).
In some alternative embodiments, the downloaded DNA6mA sequence data file needs to be subjected to format judgment and content judgment before the raw DNA6mA feature data set to be processed is acquired. The specific method for judging the format comprises the following steps: when the line of the read DNA6mA sequence data file is headed by a character string ">", the data added by one line is taken as sequence text data. The specific method for content judgment comprises the following steps: whether the content of the read sequence text data is composed of four letters of 'A', 'T', 'C' or 'G', if any, the input text is prompted to include the letters of 'A', 'T', 'C' and 'G'. The raw data set that was acquired as satisfactory is shown in fig. 2.
S102, determining a similarity matrix among all sequences in the DNA6mA characteristic data set.
In some alternative embodiments, a suffix tree based two-sequence alignment model yields a similarity matrix between individual sequences in the DNA6mA feature dataset. The method specifically comprises the following steps:
a, constructing a first input sequence seq1 as a first suffix tree1;
b, acquiring a second input sequence seq2 which is compared with the first input sequence seq 1;
c, determining a common substring of the first input sequence seq1 and the second input sequence seq2 by adopting an LCS model based on the first suffix tree1 and the second input sequence seq2;
d, based on a preset qualified standard, rejecting unqualified substrings from the public substrings; the preset qualification standard is that two common substrings matched with each other can not be far away, namely, the difference between the starting positions is less than or equal to the length of the substrings.
e, adopting a Needleman-Wunsch model to compare unmatched substrings in the first input sequence seq1 and the second input sequence seq2, and forming a comparison result sequence based on the comparison result;
f, based on the length of the common substring and the length of the alignment result sequence, determining that a similarity calculation formula between the first input sequence seq1 and the second input sequence seq2 is as follows:
Figure DEST_PATH_IMAGE001
s103, carrying out logarithm processing on the similarity matrix to obtain a distance matrix among the sequences.
In some optional embodiments, the similarity matrix is logarithmized to obtain a distance matrix between each sequence, and a calculation formula of the distance matrix is as follows:
Figure DEST_PATH_IMAGE002
wherein
Figure DEST_PATH_IMAGE003
Representing the similarity between seq1 and seq2,
Figure DEST_PATH_IMAGE004
representing the distance between seq1 and seq 2.
And S104, carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement.
In some optional embodiments, the distance matrix is gaussian processed to obtain a distance matrix satisfying the positive qualitative requirement, and the calculation formula is as follows:
Figure DEST_PATH_IMAGE005
wherein
Figure DEST_PATH_IMAGE006
Representing the distance between sequence i and sequence j,
Figure DEST_PATH_IMAGE007
is a constant for the gaussian transformation, and is,
Figure DEST_PATH_IMAGE008
the value of the ith row and the jth column of the distance matrix for satisfying the positive qualitative requirement.
And S105, taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the DNA6mA modification category based on the support vector machine model.
In some optional embodiments, a support vector machine algorithm is adopted, the distance matrix is used as a custom kernel matrix of the support vector machine, and the classification prediction is performed on the DNA6mA modification, and the algorithm flow comprises the following steps:
s51, constructing a Lagrangian function
Figure DEST_PATH_IMAGE009
S52, pair
Figure DEST_PATH_IMAGE010
The partial derivatives are calculated and made equal to 0:
Figure DEST_PATH_IMAGE011
Figure DEST_PATH_IMAGE012
s53, substituting the primitive function to obtain a dual problem of the primitive problem:
Figure DEST_PATH_IMAGE013
Figure DEST_PATH_IMAGE014
Figure DEST_PATH_IMAGE015
s54, solving the dual problem to obtain
Figure DEST_PATH_IMAGE016
And
Figure DEST_PATH_IMAGE017
and further obtaining an equation for classifying the hyperplane:
Figure DEST_PATH_IMAGE018
s55, classifying and predicting the DNA6mA data according to the equation,
Figure DEST_PATH_IMAGE019
for the sake of a positive example,
Figure DEST_PATH_IMAGE020
as a counter example.
Wherein
Figure DEST_PATH_IMAGE021
And
Figure DEST_PATH_IMAGE022
in order to classify the parameter vector of the hyperplane,
Figure DEST_PATH_IMAGE023
a parameter vector for the constructed lagrangian function, L for the constructed lagrangian function,
Figure DEST_PATH_IMAGE024
is classified hyperplane equation.
In the embodiment of the invention, the indexes for evaluating the classification effect comprise SE, SP, ACC, MCC and F1, and the calculation formula is as follows:
Figure DEST_PATH_IMAGE025
wherein TP represents the predicted correct DNA6mA amount, FP represents the predicted correct non-DNA 6mA amount, TN represents the predicted wrong DNA6mA amount, and FN represents the predicted wrong non-DNA 6mA amount.
The predicted effect of the present invention is further described below in a set of specific experimental examples.
Compared with the research results of the existing excellent prediction algorithm, the method uses consistent evaluation indexes (namely SE, SP, ACC and MCC) on the basis of ensuring the consistency of the used data sets during comparison.
We first compare the predicted results of our invention on m.musculus datasets with existing machine learning methods, as shown in fig. 3. As can be seen from fig. 3, the present invention achieves higher accuracy in the classification effect. On the M.musculus data set, the support vector machine classifier based on the distance obtains a classification accuracy ACC value of 0.982, which is higher than the classification accuracy ACC values of 0.966 of csDMA and 0.969 of iLM-CNN. Meanwhile, the distance-based SVM classifier also obtained the highest MCC value of 0.982 and F1 value of 0.982, indicating that the prediction accuracy of the distance-based SVM classifier is high even when processing unbalanced data sets.
Comparison the present invention then compares the prediction results on Rice datasets with existing machine learning methods, as shown in figure 4. As can be seen from fig. 4, the present invention achieves higher accuracy in the classification effect. On the Rice data set, a support vector machine classifier based on distance obtains a classification accuracy ACC value of 0.943, which is higher than the classification accuracy ACC values of 0.861 of csDMA and 0.875 of iLM-CNN, and experiments show that the prediction accuracy of the Rice data is effectively improved. Meanwhile, the distance-based SVM classifier also obtains the highest MCC value of 0.944 and the highest F1 value of 0.942, which shows that the distance-based SVM classifier has higher prediction accuracy even when processing an unbalanced data set, and provides a new idea for processing unbalanced Rice data.
Finally, the results of the present invention are compared to the results of prior art machine learning methods on Cross data sets, as shown in fig. 5. As can be seen from fig. 5, the present invention achieves higher accuracy in the classification effect. On a Cross data set, the support vector machine classifier based on distance obtains an MCC value of 0.838, which is far higher than 0.603 of csDMA and 0.651 of iLM-CNN, and experiments show that the prediction accuracy of the unbalanced Cross data set is obviously improved, and the method is of great benefit to the research of the data set. Meanwhile, the support vector machine classifier based on the distance also obtains the highest F1 value of 0.84, which shows that the invention has better balance for the data and provides a certain reference for the research of Cross data.
The invention has the advantages that
(1) The invention provides a brand-new DNA6mA prediction method, which utilizes the distance between DNA6mA sequences to carry out classification prediction on the sequences and provides a pilot support for corresponding theoretical research.
(2) When the support vector machine algorithm is applied, the invention adopts the self-defined kernel matrix, thereby effectively improving the processing efficiency.
(3) According to the invention, the similarity matrix between DNA6mA sequences is converted into a positive distance matrix, so that a support vector machine classifier is constructed, and the prediction effect of DNA6mA is improved.
Based on the same inventive concept, one embodiment of the present application provides a device for predicting a DNA6mA modification category. Referring to fig. 6, fig. 6 is a schematic diagram of a prediction device for DNA6mA modification class provided in an embodiment of the present application. As shown in fig. 6, the apparatus includes:
a first obtaining unit 601, configured to obtain a DNA6mA feature data set;
a first determining unit 602, configured to determine a similarity matrix between sequences in the DNA6mA feature data set;
a logarithm processing unit 603, configured to perform logarithm processing on the similarity matrix to obtain a distance matrix between each sequence;
a gaussian processing unit 604, configured to perform gaussian processing on the distance matrix to obtain a distance matrix meeting the positive qualitative requirement;
and the predicting unit 605 is configured to take the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of the support vector machine, and predict the DNA6mA modification category based on the support vector machine model.
Optionally, the first determining unit includes:
and the first determining subunit is used for obtaining a similarity matrix among sequences in the DNA6mA characteristic data set based on a double-sequence comparison model of a suffix tree.
Optionally, the first determining unit includes:
a first construction subunit for constructing the first input sequence as a first suffix tree;
a first obtaining subunit, configured to obtain a second input sequence aligned with the first input sequence;
a second determining subunit, configured to determine, based on the first suffix tree and the second input sequence, a common substring of the first input sequence and the second input sequence by using an LCS model;
the first removing unit is used for removing unqualified substrings from the public substrings based on a preset qualified standard;
the first comparison unit is used for comparing the unmatched substrings in the first input sequence and the second input sequence by adopting a Needleman-Wunsch model and forming a comparison result sequence based on the comparison result;
and the third determining subunit is used for determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence.
Optionally, the DNA6mA feature data set comprises a positive case data set and a negative case data set, the positive case data set is a DNA6mA sequence, and the negative case data set is a non-DNA 6mA sequence.
Based on the same inventive concept, another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method according to any of the above-mentioned embodiments of the present application.
Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the electronic device implements the steps of the method according to any of the above embodiments of the present application.
For the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising one of \ 8230; \8230;" does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.
The plant resistance protein identification method, device, equipment and storage medium provided by the application are described in detail above, and the principle and the implementation mode of the application are explained by applying specific examples, and the description of the above examples is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (6)

1. A method for predicting the 6mA modification class of DNA, comprising:
acquiring a DNA6mA characteristic data set;
obtaining a similarity matrix among sequences in the DNA6mA characteristic data set based on a double-sequence alignment model of a suffix tree, wherein the similarity matrix comprises the following steps:
constructing a first input sequence as a first suffix tree;
obtaining a second input sequence which is compared with the first input sequence;
determining common substrings of the first input sequence and the second input sequence by adopting an LCS model based on the first suffix tree and the second input sequence;
based on a preset qualified standard, rejecting unqualified substrings from the public substrings;
adopting a Needleman-Wunsch model to compare unmatched substrings in the first input sequence and the second input sequence, and forming a comparison result sequence based on a comparison result;
determining the similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence;
carrying out logarithm processing on the similarity matrix to obtain a distance matrix among all the sequences;
carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;
and taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of a support vector machine, and predicting the DNA6mA modification category based on the support vector machine model.
2. The prediction method of claim 1, wherein the DNA6mA feature data set comprises a positive case data set and a negative case data set, the positive case data set being a DNA6mA sequence and the negative case data set being a non-DNA 6mA sequence.
3. A device for predicting the 6mA modification class of DNA, comprising:
the first acquisition unit is used for acquiring a DNA6mA characteristic data set;
a first determination unit for determining a similarity matrix between the sequences in the DNA6mA characteristic data set,
the first determination unit includes:
the first determining subunit is used for obtaining a similarity matrix among all sequences in the DNA6mA characteristic data set based on a double-sequence comparison model of a suffix tree;
the first determination unit includes:
a first construction subunit for constructing the first input sequence as a first suffix tree;
a first obtaining subunit, configured to obtain a second input sequence aligned with the first input sequence;
a second determining subunit, configured to determine, based on the first suffix tree and the second input sequence, a common substring of the first input sequence and the second input sequence by using an LCS model;
the first removing unit is used for removing unqualified substrings from the public substrings based on a preset qualified standard;
the first comparison unit is used for comparing the unmatched substrings in the first input sequence and the second input sequence by adopting a Needleman-Wunsch model and forming a comparison result sequence based on the comparison result;
a third determining subunit, configured to determine a similarity between the first input sequence and the second input sequence based on the length of the common substring and the length of the alignment result sequence;
the logarithm processing unit is used for carrying out logarithm processing on the similarity matrix to obtain a distance matrix among all the sequences;
the Gaussian processing unit is used for carrying out Gaussian processing on the distance matrix to obtain the distance matrix meeting the positive qualitative requirement;
and the prediction unit is used for taking the distance matrix meeting the positive qualitative requirement as a custom kernel matrix of the support vector machine and predicting the DNA6mA modification category based on the support vector machine model.
4. The apparatus of claim 3, wherein the DNA6mA modification class prediction apparatus comprises a positive case data set and a negative case data set, the positive case data set being a DNA6mA sequence and the negative case data set being a non-DNA 6mA sequence.
5. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1-2.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing, carries out the steps of the method according to any of claims 1-2.
CN202110606033.0A 2021-05-25 2021-05-25 Method, device, equipment and storage medium for predicting DNA6mA modification class Active CN113409891B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110606033.0A CN113409891B (en) 2021-05-25 2021-05-25 Method, device, equipment and storage medium for predicting DNA6mA modification class

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110606033.0A CN113409891B (en) 2021-05-25 2021-05-25 Method, device, equipment and storage medium for predicting DNA6mA modification class

Publications (2)

Publication Number Publication Date
CN113409891A CN113409891A (en) 2021-09-17
CN113409891B true CN113409891B (en) 2023-02-03

Family

ID=77675555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110606033.0A Active CN113409891B (en) 2021-05-25 2021-05-25 Method, device, equipment and storage medium for predicting DNA6mA modification class

Country Status (1)

Country Link
CN (1) CN113409891B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2684217C (en) * 2007-04-13 2016-12-13 Sequenom, Inc. Comparative sequence analysis processes and systems
BRPI0913924B1 (en) * 2008-07-01 2020-02-04 Univ Leland Stanford Junior method for determining the likelihood that a female individual will experience a life-giving birth event
US9932640B1 (en) * 2013-05-02 2018-04-03 George Wyndham Cook, Jr. Clinical use of an Alu element based bioinformatics methodology for the detection and treatment of cancer
CN107491734B (en) * 2017-07-19 2021-05-07 苏州闻捷传感技术有限公司 Semi-supervised polarimetric SAR image classification method based on multi-core fusion and space Wishart LapSVM
CN109961093B (en) * 2019-03-07 2021-10-15 北京工业大学 Image classification method based on crowd-sourcing integrated learning
CN111161793B (en) * 2020-01-09 2023-02-03 青岛科技大学 Stacking integration based N in RNA 6 Method for predicting methyladenosine modification site

Also Published As

Publication number Publication date
CN113409891A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
Al-Ajlan et al. CNN-MGP: convolutional neural networks for metagenomics gene prediction
Hu et al. Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs
Liu et al. PEDLA: predicting enhancers with a deep learning-based algorithmic framework
Li et al. Predicting long noncoding RNA and protein interactions using heterogeneous network model
Sonnenburg et al. Accurate splice site prediction using support vector machines
Gudyś et al. HuntMi: an efficient and taxon-specific approach in pre-miRNA identification
Yi et al. RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information
Liu et al. LPI-NRLMF: lncRNA-protein interaction prediction by neighborhood regularized logistic matrix factorization
Yang et al. Prediction of aptamer–protein interacting pairs based on sparse autoencoder feature extraction and an ensemble classifier
Liang et al. SSRE: cell type detection based on sparse subspace representation and similarity enhancement
Ge et al. Prediction of disease-associated nsSNPs by integrating multi-scale ResNet models with deep feature fusion
Cagirici et al. LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants
Kumar et al. PredHSP: sequence based proteome-wide heat shock protein prediction and classification tool to unlock the stress biology
Do et al. A sequence-based approach for identifying recombination spots in Saccharomyces cerevisiae by using hyper-parameter optimization in FastText and support vector machine
Zeng et al. 4mCPred-MTL: accurate identification of DNA 4mC sites in multiple species using multi-task deep learning based on multi-head attention mechanism
Justyna et al. Machine learning for RNA 2D structure prediction benchmarked on experimental data
CN113409891B (en) Method, device, equipment and storage medium for predicting DNA6mA modification class
Beck et al. Signal analysis for genome-wide maps of histone modifications measured by ChIP-seq
Tuna et al. Inference from low precision transcriptome data representation
Jia et al. EMDL-ac4C: identifying N4-acetylcytidine based on ensemble two-branch residual connection DenseNet and attention
Moskowitz et al. Nonparametric analysis of contributions to variance in genomics and epigenomics data
Wu et al. Predicting nucleosome positioning based on geometrically transformed tsallis entropy
CN111832815A (en) Scientific research hotspot prediction method and system
Alam et al. Unveiling the Potential Pattern Representation of RNA 5-Methyluridine Modification Sites Through a Novel Feature Fusion Model Leveraging Convolutional Neural Network and Tetranucleotide Composition
Polushina et al. Change-point detection in binary Markov DNA sequences by the Cross-Entropy method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant