CN116052774A - Method and system for identifying key miRNA based on deep learning - Google Patents

Method and system for identifying key miRNA based on deep learning Download PDF

Info

Publication number
CN116052774A
CN116052774A CN202210780583.9A CN202210780583A CN116052774A CN 116052774 A CN116052774 A CN 116052774A CN 202210780583 A CN202210780583 A CN 202210780583A CN 116052774 A CN116052774 A CN 116052774A
Authority
CN
China
Prior art keywords
mirna
deep learning
feature
nucleotide sequence
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210780583.9A
Other languages
Chinese (zh)
Other versions
CN116052774B (en
Inventor
严承
张蕾
黄辛迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Chinese Medicine
Original Assignee
Hunan University of Chinese Medicine
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Chinese Medicine filed Critical Hunan University of Chinese Medicine
Priority to CN202210780583.9A priority Critical patent/CN116052774B/en
Publication of CN116052774A publication Critical patent/CN116052774A/en
Application granted granted Critical
Publication of CN116052774B publication Critical patent/CN116052774B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Epidemiology (AREA)
  • Chemical & Material Sciences (AREA)
  • Public Health (AREA)
  • Analytical Chemistry (AREA)
  • Biomedical Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a key miRNA identification method and a key miRNA identification system based on deep learning, which are characterized in that nucleotide sequence statistical features, structural features and deep learning features formed by shearing a plurality of miRNAs are obtained, attention distribution features of each deep learning feature on the corresponding statistical features and structural features are calculated, the calculated attention distribution features, the corresponding statistical features and structural features are spliced to obtain comprehensive features, a training set is constructed based on the comprehensive features, and the training set is used for training a constructed classification prediction model.

Description

Method and system for identifying key miRNA based on deep learning
Technical Field
The invention relates to the field of system biology, in particular to a key miRNA identification method and system based on deep learning.
Background
Recent studies have shown that non-coding RNAs play a very important role in many important life processes such as cell growth, proliferation, etc. Mirnas, as a class of non-coding RNAs with a length of approximately 22nt, have a close and indiscriminate association with many human complex diseases. Thus, in order to understand the effects of mirnas on human diseases more deeply, it is necessary to identify key mirnas that are related to life processes. Because of the inherent shortcomings of biomedical experiments in human, financial and material resources, this provides a useful place for predicting key mirnas by computational methods. Therefore, with the development of basic data and calculation technology of the key miRNA, a calculation method for identifying the key miRNA is also presented at present.
Currently, there are 2 main categories of methods for key miRNA recognition:
(1) Biological experiment measuring method
The method is a traditional biological and medical experimental method, and has the advantage of high accuracy. However, the disadvantages are also very significant, and in the case of numerous candidate key mirnas, a significant amount of time and financial costs are required. In this way, the staff currently constructs a basic data set of key miRNAs, and the basic data set also provides a data basis for the development of a subsequent calculation method.
(2) Prediction method based on machine learning
The method starts from the generation process of miRNA, and constructs a characteristic set of miRNA from the pre-miRNA to the mature miRNA in sequence and structure, and then constructs a classification model based on the characteristic set of miRNA by combining a machine learning method and a known key miRNA sample data set. The model for identifying the key miRNA converts the identification problem of the key miRNA into a classification problem, and provides an important reference basis for the follow-up accelerated pathogenic research related to the miRNA.
For example, in both methods miES and PESM, the sequences and structural features of pre-mirnas and mirnas are integrated, and then logistic regression and gradient hoist (Gradient Boosting Machine, GBM, XGBoost) based models are used to make key miRNA predictions. The difference is that the PESM integrates more pre-miRNA features and nucleotide pair features, and obtains better prediction performance. On the model, XGBoost also achieves better prediction effect.
However, although these methods described above have achieved some good results in identifying key mirnas, they provide important basis for saving biomedical research, but there are still some drawbacks. For example, these methods currently only use sequence-based structural features and statistical features, and do not adequately mine the characteristics of the sequence itself. The accuracy of the constructed prediction model is limited, and key miRNAs cannot be accurately identified.
Disclosure of Invention
The invention provides a key miRNA identification method and a key miRNA identification system based on deep learning, which are used for solving the technical problem of low accuracy of the existing key miRNA identification method.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
a key miRNA identification method based on deep learning comprises the following steps:
obtaining a plurality of pre-miRNAs and miRNAs of known classes thereof from historical data; the categories are classified into critical mirnas and non-critical mirnas;
the following steps are performed for each miRNA:
extracting nucleotide sequence statistical characteristics formed by cutting of pre-miRNA from the miRNA and corresponding pre-miRNA nucleotide sequence information
Figure 86941DEST_PATH_IMAGE001
And extracting structural feature of the pre-miRNA>
Figure 410606DEST_PATH_IMAGE002
The method comprises the steps of carrying out a first treatment on the surface of the Statistically characterizing the pre-miRNA>
Figure 721502DEST_PATH_IMAGE001
Is->
Figure 851132DEST_PATH_IMAGE002
Splicing to obtain static characteristic expression vector of the pre-miRNA>
Figure 72029DEST_PATH_IMAGE003
The method comprises the steps of carrying out a first treatment on the surface of the Encoding the nucleotide sequence of the pre-miRNA, inputting the encoded nucleotide sequence into a deep neural network, and extracting deep learning characteristics of the pre-miRNA>
Figure 425650DEST_PATH_IMAGE004
The method comprises the steps of carrying out a first treatment on the surface of the Extracting attention distribution characteristics of static characteristic expression vectors of pre-miRNA from deep learning characteristics of the pre-miRNA by using an attention mechanism algorithm>
Figure 99207DEST_PATH_IMAGE005
The method comprises the steps of carrying out a first treatment on the surface of the Splicing the attention distribution feature and the static feature expression vector to obtain the comprehensive feature +.>
Figure 94845DEST_PATH_IMAGE006
Characterization of the synthesis of mirnas from the plurality of known classes
Figure 659995DEST_PATH_IMAGE006
And marking classification labels to construct positive and negative training samples, training a pre-constructed classification prediction model by using the positive and negative training samples, and identifying the category of the target miRNA by using the trained classification prediction model.
Preferably, the statistical features include
Figure 184517DEST_PATH_IMAGE001
: the length of the miRNA nucleotide sequence, the length of the portion of the pre-miRNA nucleotide sequence that remains after removal of the miRNA sequence, the statistics of each base nucleotide in the pre-miRNA nucleotide sequence, the statistics of each base nucleotide in the miRNA, the statistics of each base nucleotide in the pre-miRNA nucleotideStatistics of the remaining portion of the nucleotide sequence of the miRNA after removal of the miRNA nucleotide sequence, the frequency of each dinucleotide pair in the pre-miRNA nucleotide sequence, the frequency of each dinucleotide pair in the miRNA, and the type of splice site in the pre-miRNA nucleotide sequence.
Preferably, the structural feature
Figure 345371DEST_PATH_IMAGE002
Comprising the following steps: the minimum free energy of the secondary structure in the pre-miRNA; the average minimum free energy of the nucleotides obtained by dividing the minimum free energy of the secondary structure in the pre-miRNA by the length of the sequence; standardized base pairing property in pre-miRNA, standardized base pairing property based on nucleotide length, standardized base pairing shannon entropy property; shannon entropy property based on nucleotide length; standardized base pair distance attributes; base pair distance properties based on nucleotide length.
Preferably, the nucleotide sequence of the pre-miRNA is encoded, the encoded nucleotide sequence is input into a deep neural network, and the deep learning characteristic of the pre-miRNA is extracted
Figure 613541DEST_PATH_IMAGE004
Comprising the following steps:
the nucleotide sequence of the pre-miRNA is initially encoded by adopting a 3-gram encoding mode, and an initial encoding vector of the pre-miRNA is obtained:
Figure 543451DEST_PATH_IMAGE007
wherein ,
Figure 707716DEST_PATH_IMAGE008
is the +.f in the pre-miRNA>
Figure 480500DEST_PATH_IMAGE009
One base, ->
Figure 427728DEST_PATH_IMAGE010
,/>
Figure 602357DEST_PATH_IMAGE011
Is the length of the pre-miRNA nucleotide sequence; />
Figure 842583DEST_PATH_IMAGE012
Is from->
Figure 305926DEST_PATH_IMAGE009
Initial coding vectors obtained according to a 3-gram coding mode are started by the bases;
inputting the initial coding vector of the pre-miRNA into a deep neural network, wherein the deep neural network performs the initial coding vector
Figure 915899DEST_PATH_IMAGE013
After the round convolution operation, extracting to obtain the deep learning feature of the local sub-vector, and obtaining the deep learning feature of the initial coding vector +.>
Figure 820401DEST_PATH_IMAGE014
Wherein t is the number of layers of CNN, |L|= |S| -2, |L| is the length of a sequence with the length of|S| after 3-gram coding processing, and +|>
Figure 123206DEST_PATH_IMAGE015
Is the firstiThe individual bases begin to be represented by feature vectors obtained after t rounds of convolution according to 3-gram codes;
preferably, the deep learning feature of the local sub-vector is extracted by the following formula:
Figure 745948DEST_PATH_IMAGE016
Figure 159612DEST_PATH_IMAGE017
Figure 918621DEST_PATH_IMAGE018
wherein ,
Figure 595590DEST_PATH_IMAGE019
an initial coding vector that is 3-gram code starting from the ith base; />
Figure 564683DEST_PATH_IMAGE020
For the activation function ReLU; />
Figure 893289DEST_PATH_IMAGE021
Is a weight matrix>
Figure 162597DEST_PATH_IMAGE022
Is a bias term.
Preferably, the attention mechanism algorithm calculates the attention weighting process based on dot product scalar values:
Figure 682571DEST_PATH_IMAGE023
Figure 138960DEST_PATH_IMAGE024
wherein ,
Figure 769793DEST_PATH_IMAGE025
and />
Figure 565710DEST_PATH_IMAGE026
Weight matrix and paranoid vector, respectively, +.>
Figure 381220DEST_PATH_IMAGE027
The weight value is the association degree of the nucleotide sequence of a pre-miRNA represented by the weight value and the static characteristic representation of the pre-miRNA,>
Figure 200271DEST_PATH_IMAGE028
is doubly curvedSwitching on the activation function>
Figure 759428DEST_PATH_IMAGE020
For activating the function ReLU->
Figure 846071DEST_PATH_IMAGE029
and />
Figure 832481DEST_PATH_IMAGE030
Calculating the attention weight based on the dot product scalar values, respectively +.>
Figure 138829DEST_PATH_IMAGE003
and />
Figure 236098DEST_PATH_IMAGE031
Implicit vector of vector.
Preferably, the classification prediction model is an LGBM classification model, and when the LGBM classification model is trained, sampling is performed by adopting a GOSS algorithm and an EFB algorithm, and a Level-wise growth strategy is executed, and the search strategy adopts a linear search option, which is specifically defined as follows:
Figure 272187DEST_PATH_IMAGE032
Figure 39286DEST_PATH_IMAGE033
wherein ,
Figure 957563DEST_PATH_IMAGE034
represents the number of iterations, +.>
Figure 999469DEST_PATH_IMAGE035
Indicate->
Figure 421223DEST_PATH_IMAGE034
A second strong learner model; />
Figure 126267DEST_PATH_IMAGE036
Indicate->
Figure 266261DEST_PATH_IMAGE034
Basic decision tree corresponding to the secondary iteration; />
Figure 846278DEST_PATH_IMAGE037
The weight parameters are combined for the current basic decision tree and the strong learner model; />
Figure 388118DEST_PATH_IMAGE038
Representing the number of samples; />
Figure 90495DEST_PATH_IMAGE039
Representing a binary GDBT loss function.
Preferably, the LGBM classification model is defined as follows:
Figure 593151DEST_PATH_IMAGE040
wherein ,
Figure 101493DEST_PATH_IMAGE041
for maximum number of iterations +.>
Figure 373206DEST_PATH_IMAGE042
Is a basic decision tree;
the optimization objective function of the LGBM classification model is a specific loss function, which is defined as follows:
Figure 43221DEST_PATH_IMAGE043
wherein ,
Figure 361070DEST_PATH_IMAGE044
is a sample feature, i.e. the comprehensive feature of any miRNA->
Figure 515846DEST_PATH_IMAGE006
,/>
Figure 32278DEST_PATH_IMAGE045
A tag being a sample feature, i.e. said integrated feature +.>
Figure 748561DEST_PATH_IMAGE006
Corresponding miRNA class,>
Figure 616023DEST_PATH_IMAGE046
a miRNA class predicted from the sample features for the LGBM classification model; />
Figure 75954DEST_PATH_IMAGE047
The average mathematical expectation of the predicted outcome and true outcome errors for all samples.
A computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method described above when the computer program is executed.
The invention has the following beneficial effects:
1. the invention discloses a key miRNA identification method (EMDS: predicting the essentiality of miRNAs based on deep learning and sequences) and a system based on deep learning. Firstly, according to the characteristics that mature miRNA is spliced from pre-miRNA, the statistical characteristics of miRNA sequences are calculated. Then, the structural features of the miRNA were calculated using the stem-loop structure of the pre-miRNA. And the two are spliced to obtain the statistics and structural characteristics of miRNA. Next, based on the timing characteristics possessed by the miRNA sequence, the sequence timing characteristics are obtained from the pre-miRNA sequence using convolutional neural networks (Convolutional Neural Networks, CNN). Based on the importance of the nucleotide sequence to the statistics and structural characteristics, a deep learning characteristic of miRNA is obtained in a mode based on an attention mechanism. Finally, miRNA statistics, structural features and deep learning features are spliced and integrated, and the integrated miRNA statistics, structural features and deep learning features are input into a LGBM (Light Gradient Boosting Machine) classification model to identify key miRNAs. Compared with the prior art, the invention calculates the sequence characteristics based on CNN according to the sequence characteristics of the sequence on the basis of the original sequence statistics and structural characteristics. On the basis, a deep learning feature acquisition mode based on an attention mechanism is provided according to the importance of the sequence features to statistics and structural features, and the deep learning features of miRNA are acquired. And finally, splicing and integrating the miRNA through statistics, structural features and deep learning features, inputting the integrated miRNA into an LGBM classification model, and calculating the probability score of the key miRNA according to the classification problem to obtain a final key miRNA prediction result, thereby greatly improving the accuracy of key miRNA identification.
In addition to the objects, features and advantages described above, the present invention has other objects, features and advantages. The invention will be described in further detail with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
fig. 1 is a flowchart of a method for identifying a key miRNA according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the operation of the attention mechanism model according to the embodiment of the present invention;
FIG. 3 is a graph showing comparison of the predicted performance AUC values of five times of cross-validation of EMDS and comparison methods provided by embodiments of the present invention.
Detailed Description
Embodiments of the invention are described in detail below with reference to the attached drawings, but the invention can be implemented in a number of different ways, which are defined and covered by the claims.
Embodiment one:
firstly, according to the characteristics that mature miRNAs are all derived from splicing and generating precursor miRNAs (pre-miRNAs), the nucleotide sequence statistical characteristics based on the pre-miRNAs and miRNAs are calculated. Meanwhile, the structural characteristics of the pre-miRNA are calculated according to the hairpin structural characteristics of the pre-miRNA. In addition, based on the time sequence characteristics of the miRNA sequence and the successful application of the neural network in natural language, the depth characteristic representation based on the miRNA sequence is calculated by using CNN. Then, according to this depth feature representation, importance to the sequence-based statistics and structural features, attention distribution probability distribution vectors (hereinafter referred to as attention distribution features) of mirnas are obtained using an attention mechanism. And finally, splicing and integrating the statistics, the structural features and the deep learning features, and inputting the statistics, the structural features and the deep learning features into a LGBM (Light Gradient Boosting Machine) classifier to calculate the probability score of the key miRNA.
All known positive and negative sample data for key mirnas of the invention are from Bartel, data published in the paper on cell under the name Metazoan micrornas. This data has also been applied to other key miRNA recognition calculation methods. As with miES and PESM methods, we used this experimentally validated key miRNA as positive samples, and then selected the same number of samples that were not confirmed to be key miRNA as negative samples. In addition, both miRNA and pre-miRNA sequence data are from the miRbase database. The database is an omnibearing database for providing information including miRNA sequence data, comments, predicted gene targets and the like, and comprises miRNA and pre-miRNA, and human beings, mice and the like from the aspect of species.
The whole flow of the key miRNA identification method based on deep learning and sequence is shown in the figure 1, and the method can be divided into the following steps:
(1) Statistical features of miRNA sequences
Figure 446893DEST_PATH_IMAGE001
The specific calculation process of (1) is as follows:
first, we split the sequence of one pre-miRNA into two parts: (1) mature miRNA (miRNA); (2) The pre-miRNA sequence has the portion (non-material miRNA) left after the mature miRNA sequence is removed. Whole extracted nucleotide sequence statistical feature package
Figure 334077DEST_PATH_IMAGE001
Comprising the following steps:
1) Basic nucleotide
Figure 360939DEST_PATH_IMAGE048
Statistics on pre-miRNA and miRNA over three nucleotides are 3 in dimension, respectively.
2) And calculating the length of the miRNA sequence to obtain the characteristic with the characteristic dimension of 1.
3) Calculating the non-Material miRNA basic single nucleotide of the remaining sequences in the pre-miRNA other than splicing into miRNA
Figure 749195DEST_PATH_IMAGE048
The statistical features therein, the features of dimension 3 are obtained.
4) The length of the non-material miRNA sequence was calculated as a 1-dimensional feature.
5) Calculating a feature with dimension 1 according to the splice site of the miRNA in the pre-miRNA, wherein the feature is expressed as follows:
1: all cleavage sites of miRNA in pre-miRNA are U;
0: not all cleavage sites are U;
-1: all cleavage sites are non-U.
6) The frequencies of the dinucleotide pairs (Dinucleotide pairs) in the miRNA and pre-miRNA were calculated separately, the types considered being U, C, G. The statistical features obtained on pre-miRNA and miRNA sequences are 30, two of which are floating point numbers, 1 of which is an integer of 1,0, -1, and the other of which is a positive integer.
Specifically, the overall statistical profile is summarized in table 1.
TABLE 1 statistical characterization of nucleotide sequences
Figure 351471DEST_PATH_IMAGE001
Summary table
Type(s) Description of the invention Special purposeSign of signDimension(s)Degree of
PremiRNA mesogenThis content Essentially single nucleotide
Figure 534191DEST_PATH_IMAGE048
Statistics in the pre-miRNA that,
Figure 986032DEST_PATH_IMAGE049
3
basic endo in miRNAContainer with a cover Essentially single nucleotide
Figure 177979DEST_PATH_IMAGE048
Statistics in the miRNA of the subject,
Figure 133296DEST_PATH_IMAGE049
3
MiRNA sequence Length Length of miRNA sequence 1
non-mature Basic in miRNAContent Essentially single nucleotide
Figure 752497DEST_PATH_IMAGE048
The removal of miRNA sequences from pre-miRNA sequencesStatistics of the lower part of the graph,
Figure 426055DEST_PATH_IMAGE049
3
non-mature miRNA Length The length of the portion of the pre-miRNA sequence remaining after removal of the miRNA sequence 1
Splice site types Cleavage sites are classified into 3 classes (1:all cleavage of miRNA in pre-miRNAThe cutting sites are U; 0: not all cleavage sites are U; -1: all cutsSites are all non-U) 1
The dinucleotide pair is present inIn pre-miRNAFrequency of Dinucleotide pairs
Figure 156113DEST_PATH_IMAGE050
The frequency in the pre-miRNA sequence,
Figure 293833DEST_PATH_IMAGE049
9
the dinucleotide pair is present inFrequencies in miRNAs Dinucleotide pairs
Figure 457836DEST_PATH_IMAGE050
The frequency in the sequence of the miRNA,
Figure 8903DEST_PATH_IMAGE049
9
(2) Structural features based on pre-miRNA
Figure 152440DEST_PATH_IMAGE002
The specific calculation process of (1) is as follows:
1) Since the minimum free energy of Pre-miRNA (minimum free energy of Pre-miRNA) is an important characteristic of structural robustness (genetic robustness) of miRNA, the minimum free energy of the secondary structure of Pre-miRNA and the minimum free energy value based on the length of the Pre-miRNA sequence are calculated respectively, and the characteristic value of 2 dimensions is obtained.
2) Calculating 6-dimensional base pairing properties based on the pre-miRNA sequence, including normalized base pairing properties; normalized base pairing properties based on nucleotide length; normalized base pair Shannon entropy (Shannon entopy) attribute; shannon entropy property based on nucleotide length; standardized base pair distance attributes; base pair distance properties based on nucleotide length. Thus obtaining the structural feature with the dimension of 8
Figure 206984DEST_PATH_IMAGE002
Wherein the overall structural feature summary is shown in table 2.
TABLE 2 pre-miRNA structural features
Figure 371249DEST_PATH_IMAGE002
Summary table
Type(s) Description of the invention Feature dimension
Minimum free energy, sequenceAverage of the most nucleotide in (B)Small free energy Minimum free energy of secondary structure in pre-miRNA;pre-The minimum free energy of secondary structure in miRNA divided by the sequence lengthThe average minimum free energy of the nucleotide obtained 2
Two of the pre-miRNAsStage structural features Standardized base pairing properties in pre-mirnas; based onStandardized base pairing properties of nucleotide length; standard ofThe genus Shannon entropy (Shannon entopy) of the base pairSex; shannon entropy property based on nucleotide length; normalizationBase pair distance attribute of (a); bases based on nucleotide lengthFor distance attribute 6
(3) Constructing a deep learning characteristic based on CNN according to the time sequence characteristic of the nucleotide sequence of the miRNA sequence
Figure 753820DEST_PATH_IMAGE051
, wherein ,/>
Figure 825681DEST_PATH_IMAGE052
For initializing the length of the characteristic after CNN iteration treatment, the specific process is as follows:
firstly, carrying out random initialization coding on the sequence according to a 3-gram mode to obtain initial pre-miRNA sequence characteristics. Taking the sequence "AUUCCG" as an example, the sequence in 3-gram is denoted as "AUU, UUC, UCC, CCG". So the pair length is
Figure 875676DEST_PATH_IMAGE011
pre-miRNA sequence of->
Figure 7580DEST_PATH_IMAGE053
The initial encoding vector expressed by the 3-gram scheme is:
Figure 656210DEST_PATH_IMAGE007
wherein ,
Figure 266183DEST_PATH_IMAGE012
is from->
Figure 498581DEST_PATH_IMAGE009
The sequence +.>
Figure 411174DEST_PATH_IMAGE054
Is characterized by->
Figure 158550DEST_PATH_IMAGE055
Based on such pre-miRNA sequence initialization feature, the input thereof into CNN uses a filter function, based on the input, in the first layer
Figure 713159DEST_PATH_IMAGE056
Which outputs an implicit vector +.>
Figure 596801DEST_PATH_IMAGE057
The calculation process is as follows:
Figure 945874DEST_PATH_IMAGE058
wherein ,
Figure 914967DEST_PATH_IMAGE020
for activating the function ReLU->
Figure 70005DEST_PATH_IMAGE059
Is a weight matrix>
Figure 447635DEST_PATH_IMAGE060
Is a bias term. According to this calculation procedure, the->
Figure 92243DEST_PATH_IMAGE013
The layer calculation formula is as follows:
Figure 423998DEST_PATH_IMAGE061
thus passing through
Figure 179464DEST_PATH_IMAGE013
Layer iteration, we can get a set of deep learning features +.>
Figure 913065DEST_PATH_IMAGE062
(4) Static characteristic expression vector obtained by splicing miRNA sequence statistical characteristics and structural characteristics
Figure 931837DEST_PATH_IMAGE003
Its dimension is 38. Then, the deep learning feature acquired based on CNN is +.>
Figure 875522DEST_PATH_IMAGE004
Deep learning feature of miRNA with equal dimension acquired by attention mechanism>
Figure 44466DEST_PATH_IMAGE005
Its dimension is also 38. The miRNA deep learning characteristic acquisition process comprises the following steps:
first, statistical features of nucleotide sequences based on pre-miRNA and miRNA sequences are obtained
Figure 22787DEST_PATH_IMAGE001
And structural features->
Figure 386028DEST_PATH_IMAGE002
miRNA static characteristic expression vector after splicing (Concate) operation>
Figure 817010DEST_PATH_IMAGE003
. Based on this feature, it was then combined with deep learning features acquired based on CNN and pre-miRNA +.>
Figure 117541DEST_PATH_IMAGE004
Attention distribution characteristics of miRNA obtained by adopting attention mechanism>
Figure 560155DEST_PATH_IMAGE005
The attention mechanism and the feature acquisition process are shown in fig. 2.
Representation vector taking into account static features of miRNA sequences
Figure 717466DEST_PATH_IMAGE003
And the implicit vector set for each nucleotide of the pre-miRNA sequence, we wish to consider the attention distribution characteristics>
Figure 776689DEST_PATH_IMAGE005
. Considering the static feature representation vector of the 3-gram implicit vector per nucleotide of the pre-miRNA sequence to the miRNA sequence +.>
Figure 412070DEST_PATH_IMAGE003
By weight-based approach to obtain the attention-distribution characteristics of the final miRNA for each nucleotide of the 3-gram implicit vector>
Figure 974770DEST_PATH_IMAGE005
. The significance of each nucleotide in the pre-miRNA sequence to the statistical and structural representation is calculated by the attentional mechanism, giving greater weight to the significant nucleotides therein. The calculation process is as follows:
Figure 302983DEST_PATH_IMAGE023
wherein ,
Figure 646239DEST_PATH_IMAGE025
and />
Figure 255950DEST_PATH_IMAGE026
Respectively a weight matrix and a paranoid vector. />
Figure 673156DEST_PATH_IMAGE027
The degree of association of the nucleotide sequence of a pre-miRNA represented by the weight value, i.e., attention, with the representation of its static characteristics. We based on this weight, miRNA final attention distribution profile +.>
Figure 172270DEST_PATH_IMAGE005
The calculation method is as follows:
Figure 674927DEST_PATH_IMAGE024
(5) Integration of miRNA sequence statistics and structure features and attention distribution features of equal dimension mirnas acquired based on CNN and attention mechanisms
Figure 917690DEST_PATH_IMAGE005
. Inputting the final features of miRNA into an LGBM classifier to construct a prediction model, and identifying key miRNA is as follows:
first, the features of the same dimension (38 dimensions) are acquired
Figure 454981DEST_PATH_IMAGE003
and />
Figure 124997DEST_PATH_IMAGE005
Splicing to obtain final comprehensive characteristics of miRNA>
Figure 616415DEST_PATH_IMAGE063
. And then, inputting the final comprehensive features of the obtained miRNA into an LGBM classifier, and carrying out key miRNA identification prediction. The problem of identification of key mirnas is a typical two-classification problem. The LGBM model is a typical boosting integration model, like XGBoost (eXtreme Gradient Boosting, extreme gradient lifting), is a model of the same as that used for GBDT (Gradient Boosting Decision Tree,gradient lift tree). In GBDT model, training set +.>
Figure 600551DEST_PATH_IMAGE064
, wherein />
Figure 116983DEST_PATH_IMAGE065
For the sample feature->
Figure 833266DEST_PATH_IMAGE066
For sample labels, the optimized objective function of GBDT is to minimize a specific loss function, which is defined as follows:
Figure 435149DEST_PATH_IMAGE043
to reduce the loss function, GBDT uses a linear search option, which is specifically defined as follows:
Figure 160660DEST_PATH_IMAGE032
Figure 531598DEST_PATH_IMAGE033
/>
wherein ,
Figure 418783DEST_PATH_IMAGE034
and />
Figure 242382DEST_PATH_IMAGE067
Representing the number of iterations and the basic decision tree, respectively. Compared with GBDT, LGBM uses GOSS (gradient-based one-side sampling) and EFB (exclusive feature bundling, mutual exclusion feature binding) to improve prediction accuracy under the condition of large data samples and features, and has greater improvement on training efficiency. LGBM is also an LGBM method, and according to the definition of GBDT, the definition of LGBM model by weight-binding is as follows:
Figure 833900DEST_PATH_IMAGE040
wherein ,
Figure 433247DEST_PATH_IMAGE041
maximum number of iterations, +.>
Figure 615966DEST_PATH_IMAGE042
Is a basic decision tree.
Furthermore, the Level-wise (layer-based growth) growth strategy used by XGBoost increases the computational effort until a stop condition is reached, but this growth strategy increases many unnecessary splits because the node gain is too small. While LGBM adopts Leaf-wise (according to Leaf growth) growth strategy, find one Leaf with maximum splitting gain (generally the maximum data amount) from all current leaves at a time, then split, so cycle, compared with Level-wise in XGBoost, leaf-wise can reduce more errors under the same splitting times, get better prediction accuracy.
To verify the validity of the method, five times of cross-validation is used for verification. The specific dataset included 77 positive samples of key mirnas that had been validated experimentally, the negative samples being randomly selected equal numbers of mirnas that have not yet been validated experimentally as key mirnas. In order to evaluate the accuracy of the prediction method, positive and negative samples are randomly divided into 5 groups in five-fold cross validation, wherein 1 group is sequentially selected as a test set, and the remaining 4 groups are training sets, and then the positive and negative samples are compared with key miRNA samples in the test set after being predicted by the method. The performance of the algorithm was evaluated using AUC (the areas under ROC curves, defined as the area under the ROC curve), F1-score, ACC (accuracies).
TABLE 3 predictive performance Table for EMDS and other methods of the invention
Figure 67807DEST_PATH_IMAGE068
Table 3 describes the other four algorithms of the present invention that outperform the comparison in the five-fold cross-validation test. The AUC value of the present invention is 0.9335, while the AUC of the other four algorithms are: 0.9117 (PESM), 0.8837 (miES), 0.8720 (gaussian nb, gaussian-naive bayes algorithm), 0.8571 (SVM, support vector machine). In addition, from the perspective of ACC and F1-score, performance values of 0.8768 and 0.8759, respectively, are obtained that are also superior to the best PESM method (ACC and F1-score of 0.8516 and 0.8572, respectively).
Fig. 3 depicts AUC plots in five-fold cross-validation for the EMDS method of the invention versus the other 4 methods, in which False Positive Rate represents false positive rate and True Positive Rate represents true positive rate. As can be seen, the EMDS achieves the greatest AUC value, which is 0.9335, about the AUC values achieved by other comparison methods (PESM: 0.9117, mies:0.8837, gaus_NB:0.8720, SVM: 0.8571).
Through the verification test experiments and comparison with the predicted performances of other 4 methods, the invention proves that the method can more accurately identify the key miRNA and can also provide important help for follow-up research on understanding, diagnosis, treatment and drug development of disease pathogenesis related to the miRNA.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. The key miRNA identification method based on deep learning is characterized by comprising the following steps of:
obtaining a plurality of pre-miRNAs and miRNAs of known classes thereof from historical data; the categories are classified into critical mirnas and non-critical mirnas;
the following steps are performed for each miRNA:
from the miRNA and its corresponding pre-miRNA nucleosidesExtracting nucleotide sequence statistical characteristics F formed by cutting the pre-miRNA from acid sequence information n And extracting structural feature F of the pre-miRNA s The method comprises the steps of carrying out a first treatment on the surface of the Statistical characterization of the pre-miRNA F n With the structural feature F s Splicing to obtain a static characteristic representation vector F of the pre-miRNA sn The method comprises the steps of carrying out a first treatment on the surface of the Encoding the nucleotide sequence of the pre-miRNA, inputting the encoded nucleotide sequence into a deep neural network, and extracting the deep learning characteristic C of the pre-miRNA; extracting the attention distribution feature F of the deep learning feature of the pre-miRNA on the static feature expression vector of the pre-miRNA by using an attention mechanism algorithm d The method comprises the steps of carrying out a first treatment on the surface of the Splicing the attention distribution feature and the static feature expression vector to obtain the comprehensive feature F of the miRNA f
Combining features F of miRNAs of the plurality of known classes f And marking classification labels to construct positive and negative training samples, training a pre-constructed classification prediction model by using the positive and negative training samples, and identifying the category of the target miRNA by using the trained classification prediction model.
2. The deep learning-based key miRNA identification method of claim 1, wherein the statistical features include F n : the length of the miRNA nucleotide sequence, the length of the portion of the pre-miRNA nucleotide sequence remaining after removal of the miRNA sequence, the statistics of each base nucleotide in the pre-miRNA nucleotide sequence, the statistics of each base nucleotide in the miRNA, the statistics of the portion of each base nucleotide remaining after removal of the miRNA nucleotide sequence in the pre-miRNA nucleotide sequence, the frequency of each dinucleotide pair in the miRNA, and the splice site type of the miRNA nucleotide sequence in the pre-miRNA nucleotide sequence.
3. The method for recognition of deep learning-based key mirnas according to claim 1 or 2, wherein the structural feature F s Comprising the following steps: the minimum free energy of the secondary structure in the pre-miRNA; secondary in pre-miRNAThe average minimum free energy of the nucleotide obtained by dividing the minimum free energy of the structure by the length of the sequence; standardized base pairing property in pre-miRNA, standardized base pairing property based on nucleotide length, standardized base pairing shannon entropy property; shannon entropy property based on nucleotide length; standardized base pair distance attributes; base pair distance properties based on nucleotide length.
4. A deep learning based key miRNA identification method according to claim 3, wherein the nucleotide sequence of the pre-miRNA is encoded, the encoded nucleotide sequence is input into a deep neural network, and the deep learning feature C of the pre-miRNA is extracted, comprising the steps of:
the nucleotide sequence of the pre-miRNA is initially encoded by adopting a 3-gram encoding mode, and an initial encoding vector of the pre-miRNA is obtained:
[X 1 ;X 2 ;X 3 ],[X 2 ;X 3 ;X 4 ],...,[X |S|-2 ;X |S|-1 ;X |S| ],
wherein ,Xi I = 1,2,3 for the i base in the pre-miRNA, |s| is the length of the pre-miRNA nucleotide sequence; [ X ] i ;X i+1 ;X i+2 ]∈R d An initial coding vector obtained by a 3-gram coding method from the ith base;
inputting the initial coding vector of the pre-miRNA into a deep neural network, and extracting the deep learning feature of the local sub-vector after t rounds of convolution operation of the initial coding vector by the deep neural network to obtain the deep learning feature of the initial coding vector
Figure QLYQS_1
Wherein t is the number of layers of CNN, |L|= |S| -2, |L| is the length of the sequence with the length of|S| after 3-gram coding processing, and +|>
Figure QLYQS_2
Is the ithThe base starts as represented by the feature vector obtained after the t-round convolution according to the 3-gram code.
5. The deep learning-based key miRNA identification method of claim 4, wherein extracting the deep learning features of the local sub-vectors is achieved by the following formula:
Figure QLYQS_3
Figure QLYQS_4
Figure QLYQS_5
wherein ,
Figure QLYQS_6
an initial coding vector that is 3-gram code starting from the ith base; f is an activation function ReLU; w (w) conv ∈R d*d As a weight matrix, b conv Is a bias term.
6. The deep learning-based key miRNA identification method of claim 5, wherein the attention mechanism algorithm calculates an attention weighting process based on dot product scalar values:
h m =f(W inter F sn +b inter ),
Figure QLYQS_7
Figure QLYQS_8
Figure QLYQS_9
wherein ,Winter and binter Respectively a weight matrix and a paranoid vector alpha i Is the association degree of the nucleotide sequence of a pre-miRNA represented by the weight value and the static characteristic representation, sigma is the hyperbolic tangent activation function, f is the activation function ReLU, h m and hi F for calculating the attention weights based on the dot product scalar values, respectively sn And
Figure QLYQS_10
implicit vector of vector.
7. The method for identifying key mirnas based on deep learning according to claim 6, wherein the classification prediction model is an LGBM classification model, and the LGBM classification model uses a GOSS algorithm and an EFB algorithm to sample during training, and executes a Level-wise growth strategy, and the search strategy uses a linear search option, which is specifically defined as follows:
F α (x)=F α-1 (x)+ξ α h α (x)
Figure QLYQS_11
wherein α represents the number of iterations, F α (x) A strong learner model representing a alpha-th time; h is a α (x) Representing a basic decision tree corresponding to the alpha iteration; xi is the weight parameter of the combination of the current basic decision tree and the strong learner model; n represents the number of samples; l represents a binary GDBT loss function.
8. The method for identifying key mirnas based on deep learning according to claim 7, wherein the LGBM classification model is defined as follows:
Figure QLYQS_12
wherein m is the maximum iteration number, h m Is a basic decision tree;
the optimization objective function of the LGBM classification model is a specific loss function, which is defined as follows:
Figure QLYQS_13
wherein x is the sample characteristic, namely the comprehensive characteristic F of any miRNA f Y is the label of the sample feature, i.e. the integrated feature F f F (x) is a miRNA class predicted by the LGBM classification model according to the sample feature; e (E) (x,y) The average mathematical expectation of the predicted outcome and true outcome errors for all samples.
9. A computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of the preceding claims 1 to 8 when the computer program is executed.
CN202210780583.9A 2022-07-04 2022-07-04 Method and system for identifying key miRNA based on deep learning Active CN116052774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210780583.9A CN116052774B (en) 2022-07-04 2022-07-04 Method and system for identifying key miRNA based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210780583.9A CN116052774B (en) 2022-07-04 2022-07-04 Method and system for identifying key miRNA based on deep learning

Publications (2)

Publication Number Publication Date
CN116052774A true CN116052774A (en) 2023-05-02
CN116052774B CN116052774B (en) 2023-11-28

Family

ID=86126083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210780583.9A Active CN116052774B (en) 2022-07-04 2022-07-04 Method and system for identifying key miRNA based on deep learning

Country Status (1)

Country Link
CN (1) CN116052774B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012031033A2 (en) * 2010-08-31 2012-03-08 Lawrence Ganeshalingam Method and systems for processing polymeric sequence data and related information
WO2019077494A1 (en) * 2017-10-16 2019-04-25 King Abdullah University Of Science And Technology System, apparatus, and method for sequence-based enzyme ec number prediction by deep learning
CN112270958A (en) * 2020-10-23 2021-01-26 大连民族大学 Prediction method based on hierarchical deep learning miRNA-lncRNA interaction relation
CN114023376A (en) * 2021-11-02 2022-02-08 四川大学 RNA-protein binding site prediction method and system based on self-attention mechanism
CN114093425A (en) * 2021-11-29 2022-02-25 湖南大学 lncRNA and disease association prediction method fusing heterogeneous network and graph neural network
EP3981003A1 (en) * 2019-06-07 2022-04-13 Leica Microsystems CMS GmbH A system and method for training machine-learning algorithms for processing biology-related data, a microscope and a trained machine learning algorithm
CN114496092A (en) * 2022-02-09 2022-05-13 中南林业科技大学 miRNA and disease association relation prediction method based on graph convolution network
CN114496084A (en) * 2022-02-08 2022-05-13 中南林业科技大学 Efficient prediction method for association relation between circRNA and miRNA
CN114582526A (en) * 2022-03-03 2022-06-03 湖南中医药大学 Similarity and tensor decomposition-based microorganism-disease association relation prediction method
CN114664376A (en) * 2022-03-31 2022-06-24 重庆邮电大学 miRNA-mRNA target prediction method based on sequence statistical characterization learning

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012031033A2 (en) * 2010-08-31 2012-03-08 Lawrence Ganeshalingam Method and systems for processing polymeric sequence data and related information
WO2019077494A1 (en) * 2017-10-16 2019-04-25 King Abdullah University Of Science And Technology System, apparatus, and method for sequence-based enzyme ec number prediction by deep learning
EP3981003A1 (en) * 2019-06-07 2022-04-13 Leica Microsystems CMS GmbH A system and method for training machine-learning algorithms for processing biology-related data, a microscope and a trained machine learning algorithm
CN112270958A (en) * 2020-10-23 2021-01-26 大连民族大学 Prediction method based on hierarchical deep learning miRNA-lncRNA interaction relation
CN114023376A (en) * 2021-11-02 2022-02-08 四川大学 RNA-protein binding site prediction method and system based on self-attention mechanism
CN114093425A (en) * 2021-11-29 2022-02-25 湖南大学 lncRNA and disease association prediction method fusing heterogeneous network and graph neural network
CN114496084A (en) * 2022-02-08 2022-05-13 中南林业科技大学 Efficient prediction method for association relation between circRNA and miRNA
CN114496092A (en) * 2022-02-09 2022-05-13 中南林业科技大学 miRNA and disease association relation prediction method based on graph convolution network
CN114582526A (en) * 2022-03-03 2022-06-03 湖南中医药大学 Similarity and tensor decomposition-based microorganism-disease association relation prediction method
CN114664376A (en) * 2022-03-31 2022-06-24 重庆邮电大学 miRNA-mRNA target prediction method based on sequence statistical characterization learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CHENG YAN,ET AL.: "PESM: predicting the essentiality of miRNAs based on gradient boosting machines and sequences", BMC BIOINFORMATICS, pages 1 - 5 *
FEI SONG, ET AL.: "miES: predicting the essentiality of miRNAs with machine learning and sequence features", BIOINFORMATICS, no. 6, pages 1053 - 1054 *
万美含,等: "基于异质网络层次注意力机制的基因功能预测", 计算机工程, no. 07, pages 49 - 55 *
何沉峰: "一种基于能量和结构的microRNA成熟体预测方法", 硕士电子期刊基础科学辑, no. 11, pages 1 - 21 *
林云光;陈月辉;邵光亭;: "基于前馈人工神经网络的miRNA预测", 计算机技术与发展, no. 05, pages 25 - 28 *

Also Published As

Publication number Publication date
CN116052774B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
Luo et al. Improving the computational efficiency of recursive cluster elimination for gene selection
Xu et al. SD-MSAEs: Promoter recognition in human genome based on deep feature extraction
Noviello et al. Deep learning predicts short non-coding RNA functions from only raw sequence data
Bugnon et al. Deep Learning for the discovery of new pre-miRNAs: Helping the fight against COVID-19
Zhu et al. Heterogeneous graph convolutional networks and matrix completion for miRNA-disease association prediction
CN116052774B (en) Method and system for identifying key miRNA based on deep learning
El-Tohamy et al. A deep learning approach for viral DNA sequence classification using genetic algorithm
Oğul et al. A probabilistic approach to microRNA-target binding
Maulik et al. Finding multiple coherent biclusters in microarray data using variable string length multiobjective genetic algorithm
Sarhani et al. Simultaneous feature selection and parameter optimisation of support vector machine using adaptive particle swarm gravitational search algorithm
CN111477271B (en) MicroRNA prediction method based on supervised self-organizing mapping neural network
Purba et al. Classification of liver cancer with microrna data using the deep neural network (DNN) method
Thomas et al. Feature versus raw sequence: Deep learning comparative study on predicting pre-mirna
Mousavi et al. A new approach to human microRNA target prediction using ensemble pruning and rotation forest
Murthy Genetic Algorithms: Basic principles and applications
Guan et al. A brief survey for microRNA precursor identification using machine learning methods
CN115206432A (en) Key miRNA recognition method based on multi-head self-attention mechanism and sequence
Li et al. A new heuristic of the decision tree induction
Zhang et al. Supervised learning methods for microRNA studies
Harrison Identifying essential features for the classification of real and pseudo microRNAs precursors using fuzzy decision trees
CN111414935A (en) Effective mixed feature selection method based on chi-square detection algorithm and improved fruit fly optimization algorithm
Zhong et al. Pre-miRNA classification via combinatorial feature mining and boosting
CN117095738A (en) ClncRNA-protein interaction relation prediction method based on clustering
Ameen et al. An improved CNN-LSTM deep model for Classification of guideRNA in CRISPR-Casl2 System
Theofilatos et al. A novel pre-miRNA classification approach for the prediction of microRNA genes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant