CN116052774A

CN116052774A - Method and system for identifying key miRNA based on deep learning

Info

Publication number: CN116052774A
Application number: CN202210780583.9A
Authority: CN
Inventors: 严承; 张蕾; 黄辛迪
Original assignee: Hunan University of Chinese Medicine
Current assignee: Hunan University of Chinese Medicine
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2023-05-02
Anticipated expiration: 2042-07-04
Also published as: CN116052774B

Abstract

The invention discloses a key miRNA identification method and a key miRNA identification system based on deep learning, which are characterized in that nucleotide sequence statistical features, structural features and deep learning features formed by shearing a plurality of miRNAs are obtained, attention distribution features of each deep learning feature on the corresponding statistical features and structural features are calculated, the calculated attention distribution features, the corresponding statistical features and structural features are spliced to obtain comprehensive features, a training set is constructed based on the comprehensive features, and the training set is used for training a constructed classification prediction model.

Description

Method and system for identifying key miRNA based on deep learning

Technical Field

The invention relates to the field of system biology, in particular to a key miRNA identification method and system based on deep learning.

Background

Recent studies have shown that non-coding RNAs play a very important role in many important life processes such as cell growth, proliferation, etc. Mirnas, as a class of non-coding RNAs with a length of approximately 22nt, have a close and indiscriminate association with many human complex diseases. Thus, in order to understand the effects of mirnas on human diseases more deeply, it is necessary to identify key mirnas that are related to life processes. Because of the inherent shortcomings of biomedical experiments in human, financial and material resources, this provides a useful place for predicting key mirnas by computational methods. Therefore, with the development of basic data and calculation technology of the key miRNA, a calculation method for identifying the key miRNA is also presented at present.

Currently, there are 2 main categories of methods for key miRNA recognition:

(1) Biological experiment measuring method

The method is a traditional biological and medical experimental method, and has the advantage of high accuracy. However, the disadvantages are also very significant, and in the case of numerous candidate key mirnas, a significant amount of time and financial costs are required. In this way, the staff currently constructs a basic data set of key miRNAs, and the basic data set also provides a data basis for the development of a subsequent calculation method.

(2) Prediction method based on machine learning

The method starts from the generation process of miRNA, and constructs a characteristic set of miRNA from the pre-miRNA to the mature miRNA in sequence and structure, and then constructs a classification model based on the characteristic set of miRNA by combining a machine learning method and a known key miRNA sample data set. The model for identifying the key miRNA converts the identification problem of the key miRNA into a classification problem, and provides an important reference basis for the follow-up accelerated pathogenic research related to the miRNA.

For example, in both methods miES and PESM, the sequences and structural features of pre-mirnas and mirnas are integrated, and then logistic regression and gradient hoist (Gradient Boosting Machine, GBM, XGBoost) based models are used to make key miRNA predictions. The difference is that the PESM integrates more pre-miRNA features and nucleotide pair features, and obtains better prediction performance. On the model, XGBoost also achieves better prediction effect.

However, although these methods described above have achieved some good results in identifying key mirnas, they provide important basis for saving biomedical research, but there are still some drawbacks. For example, these methods currently only use sequence-based structural features and statistical features, and do not adequately mine the characteristics of the sequence itself. The accuracy of the constructed prediction model is limited, and key miRNAs cannot be accurately identified.

Disclosure of Invention

The invention provides a key miRNA identification method and a key miRNA identification system based on deep learning, which are used for solving the technical problem of low accuracy of the existing key miRNA identification method.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a key miRNA identification method based on deep learning comprises the following steps:

obtaining a plurality of pre-miRNAs and miRNAs of known classes thereof from historical data; the categories are classified into critical mirnas and non-critical mirnas;

the following steps are performed for each miRNA:

extracting nucleotide sequence statistical characteristics formed by cutting of pre-miRNA from the miRNA and corresponding pre-miRNA nucleotide sequence information

And extracting structural feature of the pre-miRNA>

The method comprises the steps of carrying out a first treatment on the surface of the Statistically characterizing the pre-miRNA>

Is->

Splicing to obtain static characteristic expression vector of the pre-miRNA>

The method comprises the steps of carrying out a first treatment on the surface of the Encoding the nucleotide sequence of the pre-miRNA, inputting the encoded nucleotide sequence into a deep neural network, and extracting deep learning characteristics of the pre-miRNA>

The method comprises the steps of carrying out a first treatment on the surface of the Extracting attention distribution characteristics of static characteristic expression vectors of pre-miRNA from deep learning characteristics of the pre-miRNA by using an attention mechanism algorithm>

The method comprises the steps of carrying out a first treatment on the surface of the Splicing the attention distribution feature and the static feature expression vector to obtain the comprehensive feature +.>

；

Characterization of the synthesis of mirnas from the plurality of known classes

And marking classification labels to construct positive and negative training samples, training a pre-constructed classification prediction model by using the positive and negative training samples, and identifying the category of the target miRNA by using the trained classification prediction model.

Preferably, the statistical features include

: the length of the miRNA nucleotide sequence, the length of the portion of the pre-miRNA nucleotide sequence that remains after removal of the miRNA sequence, the statistics of each base nucleotide in the pre-miRNA nucleotide sequence, the statistics of each base nucleotide in the miRNA, the statistics of each base nucleotide in the pre-miRNA nucleotideStatistics of the remaining portion of the nucleotide sequence of the miRNA after removal of the miRNA nucleotide sequence, the frequency of each dinucleotide pair in the pre-miRNA nucleotide sequence, the frequency of each dinucleotide pair in the miRNA, and the type of splice site in the pre-miRNA nucleotide sequence.

Preferably, the structural feature

Comprising the following steps: the minimum free energy of the secondary structure in the pre-miRNA; the average minimum free energy of the nucleotides obtained by dividing the minimum free energy of the secondary structure in the pre-miRNA by the length of the sequence; standardized base pairing property in pre-miRNA, standardized base pairing property based on nucleotide length, standardized base pairing shannon entropy property; shannon entropy property based on nucleotide length; standardized base pair distance attributes; base pair distance properties based on nucleotide length.

Preferably, the nucleotide sequence of the pre-miRNA is encoded, the encoded nucleotide sequence is input into a deep neural network, and the deep learning characteristic of the pre-miRNA is extracted

Comprising the following steps:

the nucleotide sequence of the pre-miRNA is initially encoded by adopting a 3-gram encoding mode, and an initial encoding vector of the pre-miRNA is obtained:

，

wherein ,

is the +.f in the pre-miRNA>

One base, ->

，/>

Is the length of the pre-miRNA nucleotide sequence; />

Is from->

Initial coding vectors obtained according to a 3-gram coding mode are started by the bases;

inputting the initial coding vector of the pre-miRNA into a deep neural network, wherein the deep neural network performs the initial coding vector

After the round convolution operation, extracting to obtain the deep learning feature of the local sub-vector, and obtaining the deep learning feature of the initial coding vector +.>

Wherein t is the number of layers of CNN, |L|= |S| -2, |L| is the length of a sequence with the length of|S| after 3-gram coding processing, and +|>

Is the firstiThe individual bases begin to be represented by feature vectors obtained after t rounds of convolution according to 3-gram codes;

preferably, the deep learning feature of the local sub-vector is extracted by the following formula:

wherein ,

an initial coding vector that is 3-gram code starting from the ith base; />

For the activation function ReLU; />

Is a weight matrix>

Is a bias term.

Preferably, the attention mechanism algorithm calculates the attention weighting process based on dot product scalar values:

wherein ,

and />

Weight matrix and paranoid vector, respectively, +.>

The weight value is the association degree of the nucleotide sequence of a pre-miRNA represented by the weight value and the static characteristic representation of the pre-miRNA,>

is doubly curvedSwitching on the activation function>

For activating the function ReLU->

and />

Calculating the attention weight based on the dot product scalar values, respectively +.>

and />

Implicit vector of vector.

Preferably, the classification prediction model is an LGBM classification model, and when the LGBM classification model is trained, sampling is performed by adopting a GOSS algorithm and an EFB algorithm, and a Level-wise growth strategy is executed, and the search strategy adopts a linear search option, which is specifically defined as follows:

wherein ,

represents the number of iterations, +.>

Indicate->

A second strong learner model; />

Indicate->

Basic decision tree corresponding to the secondary iteration; />

The weight parameters are combined for the current basic decision tree and the strong learner model; />

Representing the number of samples; />

Representing a binary GDBT loss function.

Preferably, the LGBM classification model is defined as follows:

wherein ,

for maximum number of iterations +.>

Is a basic decision tree;

the optimization objective function of the LGBM classification model is a specific loss function, which is defined as follows:

wherein ,

is a sample feature, i.e. the comprehensive feature of any miRNA->

，/>

A tag being a sample feature, i.e. said integrated feature +.>

Corresponding miRNA class,>

a miRNA class predicted from the sample features for the LGBM classification model; />

The average mathematical expectation of the predicted outcome and true outcome errors for all samples.

A computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method described above when the computer program is executed.

The invention has the following beneficial effects:

1. the invention discloses a key miRNA identification method (EMDS: predicting the essentiality of miRNAs based on deep learning and sequences) and a system based on deep learning. Firstly, according to the characteristics that mature miRNA is spliced from pre-miRNA, the statistical characteristics of miRNA sequences are calculated. Then, the structural features of the miRNA were calculated using the stem-loop structure of the pre-miRNA. And the two are spliced to obtain the statistics and structural characteristics of miRNA. Next, based on the timing characteristics possessed by the miRNA sequence, the sequence timing characteristics are obtained from the pre-miRNA sequence using convolutional neural networks (Convolutional Neural Networks, CNN). Based on the importance of the nucleotide sequence to the statistics and structural characteristics, a deep learning characteristic of miRNA is obtained in a mode based on an attention mechanism. Finally, miRNA statistics, structural features and deep learning features are spliced and integrated, and the integrated miRNA statistics, structural features and deep learning features are input into a LGBM (Light Gradient Boosting Machine) classification model to identify key miRNAs. Compared with the prior art, the invention calculates the sequence characteristics based on CNN according to the sequence characteristics of the sequence on the basis of the original sequence statistics and structural characteristics. On the basis, a deep learning feature acquisition mode based on an attention mechanism is provided according to the importance of the sequence features to statistics and structural features, and the deep learning features of miRNA are acquired. And finally, splicing and integrating the miRNA through statistics, structural features and deep learning features, inputting the integrated miRNA into an LGBM classification model, and calculating the probability score of the key miRNA according to the classification problem to obtain a final key miRNA prediction result, thereby greatly improving the accuracy of key miRNA identification.

In addition to the objects, features and advantages described above, the present invention has other objects, features and advantages. The invention will be described in further detail with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

fig. 1 is a flowchart of a method for identifying a key miRNA according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the operation of the attention mechanism model according to the embodiment of the present invention;

FIG. 3 is a graph showing comparison of the predicted performance AUC values of five times of cross-validation of EMDS and comparison methods provided by embodiments of the present invention.

Detailed Description

Embodiments of the invention are described in detail below with reference to the attached drawings, but the invention can be implemented in a number of different ways, which are defined and covered by the claims.

Embodiment one:

firstly, according to the characteristics that mature miRNAs are all derived from splicing and generating precursor miRNAs (pre-miRNAs), the nucleotide sequence statistical characteristics based on the pre-miRNAs and miRNAs are calculated. Meanwhile, the structural characteristics of the pre-miRNA are calculated according to the hairpin structural characteristics of the pre-miRNA. In addition, based on the time sequence characteristics of the miRNA sequence and the successful application of the neural network in natural language, the depth characteristic representation based on the miRNA sequence is calculated by using CNN. Then, according to this depth feature representation, importance to the sequence-based statistics and structural features, attention distribution probability distribution vectors (hereinafter referred to as attention distribution features) of mirnas are obtained using an attention mechanism. And finally, splicing and integrating the statistics, the structural features and the deep learning features, and inputting the statistics, the structural features and the deep learning features into a LGBM (Light Gradient Boosting Machine) classifier to calculate the probability score of the key miRNA.

All known positive and negative sample data for key mirnas of the invention are from Bartel, data published in the paper on cell under the name Metazoan micrornas. This data has also been applied to other key miRNA recognition calculation methods. As with miES and PESM methods, we used this experimentally validated key miRNA as positive samples, and then selected the same number of samples that were not confirmed to be key miRNA as negative samples. In addition, both miRNA and pre-miRNA sequence data are from the miRbase database. The database is an omnibearing database for providing information including miRNA sequence data, comments, predicted gene targets and the like, and comprises miRNA and pre-miRNA, and human beings, mice and the like from the aspect of species.

The whole flow of the key miRNA identification method based on deep learning and sequence is shown in the figure 1, and the method can be divided into the following steps:

(1) Statistical features of miRNA sequences

The specific calculation process of (1) is as follows:

first, we split the sequence of one pre-miRNA into two parts: (1) mature miRNA (miRNA); (2) The pre-miRNA sequence has the portion (non-material miRNA) left after the mature miRNA sequence is removed. Whole extracted nucleotide sequence statistical feature package

Comprising the following steps:

1) Basic nucleotide

Statistics on pre-miRNA and miRNA over three nucleotides are 3 in dimension, respectively.

2) And calculating the length of the miRNA sequence to obtain the characteristic with the characteristic dimension of 1.

3) Calculating the non-Material miRNA basic single nucleotide of the remaining sequences in the pre-miRNA other than splicing into miRNA

The statistical features therein, the features of dimension 3 are obtained.

4) The length of the non-material miRNA sequence was calculated as a 1-dimensional feature.

5) Calculating a feature with dimension 1 according to the splice site of the miRNA in the pre-miRNA, wherein the feature is expressed as follows:

1: all cleavage sites of miRNA in pre-miRNA are U;

0: not all cleavage sites are U;

-1: all cleavage sites are non-U.

6) The frequencies of the dinucleotide pairs (Dinucleotide pairs) in the miRNA and pre-miRNA were calculated separately, the types considered being U, C, G. The statistical features obtained on pre-miRNA and miRNA sequences are 30, two of which are floating point numbers, 1 of which is an integer of 1,0, -1, and the other of which is a positive integer.

Specifically, the overall statistical profile is summarized in table 1.

TABLE 1 statistical characterization of nucleotide sequences

Summary table

Type(s)	Description of the invention	Special purposeSign of signDimension(s)Degree of
			PremiRNA mesogenThis content	Essentially single nucleotide Statistics in the pre-miRNA that,	3
basic endo in miRNAContainer with a cover	Essentially single nucleotide Statistics in the miRNA of the subject,	3
			MiRNA sequence Length	Length of miRNA sequence	1
non-mature Basic in miRNAContent	Essentially single nucleotide The removal of miRNA sequences from pre-miRNA sequencesStatistics of the lower part of the graph,	3
			non-mature miRNA Length	The length of the portion of the pre-miRNA sequence remaining after removal of the miRNA sequence	1
Splice site types	Cleavage sites are classified into 3 classes (1:all cleavage of miRNA in pre-miRNAThe cutting sites are U; 0: not all cleavage sites are U; -1: all cutsSites are all non-U)	1
			The dinucleotide pair is present inIn pre-miRNAFrequency of	Dinucleotide pairs The frequency in the pre-miRNA sequence,	9
the dinucleotide pair is present inFrequencies in miRNAs	Dinucleotide pairs The frequency in the sequence of the miRNA,	9

(2) Structural features based on pre-miRNA

The specific calculation process of (1) is as follows:

1) Since the minimum free energy of Pre-miRNA (minimum free energy of Pre-miRNA) is an important characteristic of structural robustness (genetic robustness) of miRNA, the minimum free energy of the secondary structure of Pre-miRNA and the minimum free energy value based on the length of the Pre-miRNA sequence are calculated respectively, and the characteristic value of 2 dimensions is obtained.

2) Calculating 6-dimensional base pairing properties based on the pre-miRNA sequence, including normalized base pairing properties; normalized base pairing properties based on nucleotide length; normalized base pair Shannon entropy (Shannon entopy) attribute; shannon entropy property based on nucleotide length; standardized base pair distance attributes; base pair distance properties based on nucleotide length. Thus obtaining the structural feature with the dimension of 8

。

Wherein the overall structural feature summary is shown in table 2.

TABLE 2 pre-miRNA structural features

Summary table

Type(s)	Description of the invention	Feature dimension
			Minimum free energy, sequenceAverage of the most nucleotide in (B)Small free energy	Minimum free energy of secondary structure in pre-miRNA；pre-The minimum free energy of secondary structure in miRNA divided by the sequence lengthThe average minimum free energy of the nucleotide obtained	2
Two of the pre-miRNAsStage structural features	Standardized base pairing properties in pre-mirnas; based onStandardized base pairing properties of nucleotide length; standard ofThe genus Shannon entropy (Shannon entopy) of the base pairSex; shannon entropy property based on nucleotide length; normalizationBase pair distance attribute of (a); bases based on nucleotide lengthFor distance attribute	6

(3) Constructing a deep learning characteristic based on CNN according to the time sequence characteristic of the nucleotide sequence of the miRNA sequence

, wherein ,/>

For initializing the length of the characteristic after CNN iteration treatment, the specific process is as follows:

firstly, carrying out random initialization coding on the sequence according to a 3-gram mode to obtain initial pre-miRNA sequence characteristics. Taking the sequence "AUUCCG" as an example, the sequence in 3-gram is denoted as "AUU, UUC, UCC, CCG". So the pair length is

pre-miRNA sequence of->

The initial encoding vector expressed by the 3-gram scheme is:

，

wherein ,

is from->

The sequence +.>

Is characterized by->

Based on such pre-miRNA sequence initialization feature, the input thereof into CNN uses a filter function, based on the input, in the first layer

Which outputs an implicit vector +.>

The calculation process is as follows:

，

wherein ,

for activating the function ReLU->

Is a weight matrix>

Is a bias term. According to this calculation procedure, the->

The layer calculation formula is as follows:

，

thus passing through

Layer iteration, we can get a set of deep learning features +.>

。

(4) Static characteristic expression vector obtained by splicing miRNA sequence statistical characteristics and structural characteristics

Its dimension is 38. Then, the deep learning feature acquired based on CNN is +.>

Deep learning feature of miRNA with equal dimension acquired by attention mechanism>

Its dimension is also 38. The miRNA deep learning characteristic acquisition process comprises the following steps:

first, statistical features of nucleotide sequences based on pre-miRNA and miRNA sequences are obtained

And structural features->

miRNA static characteristic expression vector after splicing (Concate) operation>

. Based on this feature, it was then combined with deep learning features acquired based on CNN and pre-miRNA +.>

Attention distribution characteristics of miRNA obtained by adopting attention mechanism>

The attention mechanism and the feature acquisition process are shown in fig. 2.

Representation vector taking into account static features of miRNA sequences

And the implicit vector set for each nucleotide of the pre-miRNA sequence, we wish to consider the attention distribution characteristics>

. Considering the static feature representation vector of the 3-gram implicit vector per nucleotide of the pre-miRNA sequence to the miRNA sequence +.>

By weight-based approach to obtain the attention-distribution characteristics of the final miRNA for each nucleotide of the 3-gram implicit vector>

. The significance of each nucleotide in the pre-miRNA sequence to the statistical and structural representation is calculated by the attentional mechanism, giving greater weight to the significant nucleotides therein. The calculation process is as follows:

wherein ,

and />

Respectively a weight matrix and a paranoid vector. />

The degree of association of the nucleotide sequence of a pre-miRNA represented by the weight value, i.e., attention, with the representation of its static characteristics. We based on this weight, miRNA final attention distribution profile +.>

The calculation method is as follows:

(5) Integration of miRNA sequence statistics and structure features and attention distribution features of equal dimension mirnas acquired based on CNN and attention mechanisms

. Inputting the final features of miRNA into an LGBM classifier to construct a prediction model, and identifying key miRNA is as follows:

first, the features of the same dimension (38 dimensions) are acquired

and />

Splicing to obtain final comprehensive characteristics of miRNA>

. And then, inputting the final comprehensive features of the obtained miRNA into an LGBM classifier, and carrying out key miRNA identification prediction. The problem of identification of key mirnas is a typical two-classification problem. The LGBM model is a typical boosting integration model, like XGBoost (eXtreme Gradient Boosting, extreme gradient lifting), is a model of the same as that used for GBDT (Gradient Boosting Decision Tree,gradient lift tree). In GBDT model, training set +.>

, wherein />

For the sample feature->

For sample labels, the optimized objective function of GBDT is to minimize a specific loss function, which is defined as follows:

to reduce the loss function, GBDT uses a linear search option, which is specifically defined as follows:

/>

wherein ,

and />

Representing the number of iterations and the basic decision tree, respectively. Compared with GBDT, LGBM uses GOSS (gradient-based one-side sampling) and EFB (exclusive feature bundling, mutual exclusion feature binding) to improve prediction accuracy under the condition of large data samples and features, and has greater improvement on training efficiency. LGBM is also an LGBM method, and according to the definition of GBDT, the definition of LGBM model by weight-binding is as follows:

wherein ,

maximum number of iterations, +.>

Is a basic decision tree.

Furthermore, the Level-wise (layer-based growth) growth strategy used by XGBoost increases the computational effort until a stop condition is reached, but this growth strategy increases many unnecessary splits because the node gain is too small. While LGBM adopts Leaf-wise (according to Leaf growth) growth strategy, find one Leaf with maximum splitting gain (generally the maximum data amount) from all current leaves at a time, then split, so cycle, compared with Level-wise in XGBoost, leaf-wise can reduce more errors under the same splitting times, get better prediction accuracy.

To verify the validity of the method, five times of cross-validation is used for verification. The specific dataset included 77 positive samples of key mirnas that had been validated experimentally, the negative samples being randomly selected equal numbers of mirnas that have not yet been validated experimentally as key mirnas. In order to evaluate the accuracy of the prediction method, positive and negative samples are randomly divided into 5 groups in five-fold cross validation, wherein 1 group is sequentially selected as a test set, and the remaining 4 groups are training sets, and then the positive and negative samples are compared with key miRNA samples in the test set after being predicted by the method. The performance of the algorithm was evaluated using AUC (the areas under ROC curves, defined as the area under the ROC curve), F1-score, ACC (accuracies).

TABLE 3 predictive performance Table for EMDS and other methods of the invention

Table 3 describes the other four algorithms of the present invention that outperform the comparison in the five-fold cross-validation test. The AUC value of the present invention is 0.9335, while the AUC of the other four algorithms are: 0.9117 (PESM), 0.8837 (miES), 0.8720 (gaussian nb, gaussian-naive bayes algorithm), 0.8571 (SVM, support vector machine). In addition, from the perspective of ACC and F1-score, performance values of 0.8768 and 0.8759, respectively, are obtained that are also superior to the best PESM method (ACC and F1-score of 0.8516 and 0.8572, respectively).

Fig. 3 depicts AUC plots in five-fold cross-validation for the EMDS method of the invention versus the other 4 methods, in which False Positive Rate represents false positive rate and True Positive Rate represents true positive rate. As can be seen, the EMDS achieves the greatest AUC value, which is 0.9335, about the AUC values achieved by other comparison methods (PESM: 0.9117, mies:0.8837, gaus_NB:0.8720, SVM: 0.8571).

Through the verification test experiments and comparison with the predicted performances of other 4 methods, the invention proves that the method can more accurately identify the key miRNA and can also provide important help for follow-up research on understanding, diagnosis, treatment and drug development of disease pathogenesis related to the miRNA.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The key miRNA identification method based on deep learning is characterized by comprising the following steps of:

the following steps are performed for each miRNA:

from the miRNA and its corresponding pre-miRNA nucleosidesExtracting nucleotide sequence statistical characteristics F formed by cutting the pre-miRNA from acid sequence information _n And extracting structural feature F of the pre-miRNA _s The method comprises the steps of carrying out a first treatment on the surface of the Statistical characterization of the pre-miRNA F _n With the structural feature F _s Splicing to obtain a static characteristic representation vector F of the pre-miRNA _sn The method comprises the steps of carrying out a first treatment on the surface of the Encoding the nucleotide sequence of the pre-miRNA, inputting the encoded nucleotide sequence into a deep neural network, and extracting the deep learning characteristic C of the pre-miRNA; extracting the attention distribution feature F of the deep learning feature of the pre-miRNA on the static feature expression vector of the pre-miRNA by using an attention mechanism algorithm _d The method comprises the steps of carrying out a first treatment on the surface of the Splicing the attention distribution feature and the static feature expression vector to obtain the comprehensive feature F of the miRNA _f ；

Combining features F of miRNAs of the plurality of known classes _f And marking classification labels to construct positive and negative training samples, training a pre-constructed classification prediction model by using the positive and negative training samples, and identifying the category of the target miRNA by using the trained classification prediction model.

2. The deep learning-based key miRNA identification method of claim 1, wherein the statistical features include F _n : the length of the miRNA nucleotide sequence, the length of the portion of the pre-miRNA nucleotide sequence remaining after removal of the miRNA sequence, the statistics of each base nucleotide in the pre-miRNA nucleotide sequence, the statistics of each base nucleotide in the miRNA, the statistics of the portion of each base nucleotide remaining after removal of the miRNA nucleotide sequence in the pre-miRNA nucleotide sequence, the frequency of each dinucleotide pair in the miRNA, and the splice site type of the miRNA nucleotide sequence in the pre-miRNA nucleotide sequence.

3. The method for recognition of deep learning-based key mirnas according to claim 1 or 2, wherein the structural feature F _s Comprising the following steps: the minimum free energy of the secondary structure in the pre-miRNA; secondary in pre-miRNAThe average minimum free energy of the nucleotide obtained by dividing the minimum free energy of the structure by the length of the sequence; standardized base pairing property in pre-miRNA, standardized base pairing property based on nucleotide length, standardized base pairing shannon entropy property; shannon entropy property based on nucleotide length; standardized base pair distance attributes; base pair distance properties based on nucleotide length.

4. A deep learning based key miRNA identification method according to claim 3, wherein the nucleotide sequence of the pre-miRNA is encoded, the encoded nucleotide sequence is input into a deep neural network, and the deep learning feature C of the pre-miRNA is extracted, comprising the steps of:

[X ₁ ；X ₂ ；X ₃ ],[X ₂ ；X ₃ ；X ₄ ],...,[X _|S|-2 ；X _|S|-1 ；X _|S| ]，

wherein ,X_i I = 1,2,3 for the i base in the pre-miRNA, |s| is the length of the pre-miRNA nucleotide sequence; [ X ] _i ；X _i+1 ；X _i+2 ]∈R ^d An initial coding vector obtained by a 3-gram coding method from the ith base;

inputting the initial coding vector of the pre-miRNA into a deep neural network, and extracting the deep learning feature of the local sub-vector after t rounds of convolution operation of the initial coding vector by the deep neural network to obtain the deep learning feature of the initial coding vector

Wherein t is the number of layers of CNN, |L|= |S| -2, |L| is the length of the sequence with the length of|S| after 3-gram coding processing, and +|>

Is the ithThe base starts as represented by the feature vector obtained after the t-round convolution according to the 3-gram code.

5. The deep learning-based key miRNA identification method of claim 4, wherein extracting the deep learning features of the local sub-vectors is achieved by the following formula:

wherein ,

an initial coding vector that is 3-gram code starting from the ith base; f is an activation function ReLU; w (w) _conv ∈R ^d*d As a weight matrix, b _conv Is a bias term.

6. The deep learning-based key miRNA identification method of claim 5, wherein the attention mechanism algorithm calculates an attention weighting process based on dot product scalar values:

h _m ＝f(W _inter F _sn +b _inter ),

wherein ,W_inter and b_inter Respectively a weight matrix and a paranoid vector alpha _i Is the association degree of the nucleotide sequence of a pre-miRNA represented by the weight value and the static characteristic representation, sigma is the hyperbolic tangent activation function, f is the activation function ReLU, h _m and h_i F for calculating the attention weights based on the dot product scalar values, respectively _sn And

implicit vector of vector.

7. The method for identifying key mirnas based on deep learning according to claim 6, wherein the classification prediction model is an LGBM classification model, and the LGBM classification model uses a GOSS algorithm and an EFB algorithm to sample during training, and executes a Level-wise growth strategy, and the search strategy uses a linear search option, which is specifically defined as follows:

F _α (x)＝F _α-1 (x)+ξ _α h _α (x)

wherein α represents the number of iterations, F _α (x) A strong learner model representing a alpha-th time; h is a _α (x) Representing a basic decision tree corresponding to the alpha iteration; xi is the weight parameter of the combination of the current basic decision tree and the strong learner model; n represents the number of samples; l represents a binary GDBT loss function.

8. The method for identifying key mirnas based on deep learning according to claim 7, wherein the LGBM classification model is defined as follows:

wherein m is the maximum iteration number, h _m Is a basic decision tree;

wherein x is the sample characteristic, namely the comprehensive characteristic F of any miRNA _f Y is the label of the sample feature, i.e. the integrated feature F _f F (x) is a miRNA class predicted by the LGBM classification model according to the sample feature; e (E) _(x,y) The average mathematical expectation of the predicted outcome and true outcome errors for all samples.

9. A computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of the preceding claims 1 to 8 when the computer program is executed.