CN114058691B

CN114058691B - Skeletal muscle early injury time prediction method based on Stacking ensemble learning

Info

Publication number: CN114058691B
Application number: CN202111317633.1A
Authority: CN
Inventors: 李娜; 党丽虹; 李健; 冯娜; 梁芯瑞; 安国帅; 任康; 杜秋香; 曹洁; 靳茜茜; 孙俊红
Original assignee: Shanxi Medical University
Current assignee: Shanxi Medical University
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2023-02-03
Anticipated expiration: 2041-11-09
Also published as: CN114058691A

Abstract

The invention relates to the field of forensic medicine, in particular to a skeletal muscle early injury time prediction method based on Stacking ensemble learning, which comprises the following steps: collecting skeletal muscle samples of rats at different damage time to obtain the expression quantity of genes related to skeletal muscle damage repair; the prediction models of the three base classifiers are used for Stacking the prediction probability values of the three base classifiers to form a new feature set, and training is carried out to obtain a final Stacking ensemble learning model; and inputting the data of the unknown sample into a Stacking ensemble learning model so as to predict the damage time of the unknown sample. According to the prediction method, the prediction results of the three basic classifiers are integrated by adopting Stacking ensemble learning, and the three basic classifiers are subjected to parameter optimization through grid search and cross validation, so that the accuracy and stability of skeletal muscle early damage time inference are effectively improved.

Description

Skeletal muscle early injury time prediction method based on Stacking ensemble learning

Technical Field

The invention relates to the field of forensic medicine, in particular to a skeletal muscle early injury time prediction method based on Stacking ensemble learning.

Background

In forensic practice and research, accurate inference of injury time is a critical problem that needs to be solved urgently, and especially in early injury, since the living reaction of the body changes insignificantly, the inference of early injury time is more difficult. In general, when mechanical injury occurs to human tissue, a series of characteristic changes such as bleeding, wound, inflammatory reaction, and enzyme activity change are often formed on the surface and tissues of the body. But it is difficult to make a more accurate inference of time to injury by life alone for individuals who die immediately after injury or who survive for a shorter period of time. With the development of biological techniques, the study of estimating the damage time has been expanded from morphological indexes based on histology to molecular biological indexes for estimating the damage time by detecting proteins, mrnas, and the like. Since mRNA production is earlier than protein, changes in mRNAs during lesion repair are more favorable for early lesion time inference. The repair of skeletal muscle after injury is a complex process, wherein the repair involves the participation of multiple genes, multiple channels and multiple cells, so that the injury time is difficult to accurately infer by only using a single index, more and more students think that more indexes related to the injury time are searched and multi-index joint analysis is carried out, the error of the inference of the injury time can be reduced, and the accuracy of the inference of the injury time is improved.

With the rapid development of scientific technology, especially computer technology, machine Learning (ML), such as Support Vector Machine (SVM), random forest classifier (RF), and multilayer perceptron (MLP), has been gradually applied to the field of forensic medicine, and also provides algorithm support for inference and prediction of lesion time and multi-index union. However, in the face of various algorithms, how to select a proper machine algorithm and how to improve the accuracy rate are still difficult. The principle and the sensitivity of each algorithm to data are different, and for the same classification problem, the training error and the generalization error of the model may be different, which causes the difficulty of prediction and decision. The Stacking ensemble learning can integrate a plurality of sub-learners and compensate errors by utilizing the output of the group learner, and has higher decision performance and generalization capability compared with a single model. At present, the relevant research of the application of the Stacking ensemble learning to the damage time inference is not reported.

Disclosure of Invention

The invention provides a skeletal muscle early injury time prediction method based on Stacking ensemble learning, and aims to better integrate multiple mRNA changes to improve the accuracy of early injury time inference and the stability of prediction.

The invention is realized by the following technical scheme: a skeletal muscle early damage time prediction method based on Stacking ensemble learning comprises the following steps:

1) Collecting skeletal muscle samples of rats at different injury times, extracting total RNA in tissues, carrying out reverse transcription on the total RNA to obtain cDNA, and acquiring expression quantity data of genes related to skeletal muscle injury repair at a transcription level by using an RT-qPCR technology;

2) Selecting a support vector machine, a random forest and a multilayer perceptron as ensemble learning base classifiers, respectively establishing prediction models of the three base classifiers, stacking the prediction probability values of the three base classifiers to form a new feature set, and training the new feature set by using Logitics regression to obtain a final Stacking ensemble learning model;

3) And inputting the expression quantity data of the skeletal muscle damage repair related genes of the unknown sample into a Stacking ensemble learning model so as to predict the damage time of the unknown sample.

As a further improvement of the technical scheme of the invention, the RT-qPCR technology comprises the steps of using reference genes RPL13 and RPL32 mRNA as standardized reference, and applying 2 ^-△△ct The Ct value of the target gene measured by RT-qPCR is calculated by the method, and the relative expression quantity of the target gene is obtained.

As a further improvement of the technical scheme of the invention, in the step 2), according to the principle of random sampling, one part of the expression quantity data is selected as a training set, and the other part of the expression quantity data is selected as a test set.

As a further improvement of the technical scheme of the invention, the training set is 70 percent, and the test set is 30 percent.

As a further improvement of the technical means of the present invention, the method for obtaining the expression level data of the skeletal muscle injury repair-related gene of the unknown sample is the same as that in step 1).

As a further improvement of the technical scheme of the invention, in the step 1), the rat skeletal muscle sample comprises an undamaged skeletal muscle sample.

According to the prediction method, an early injury time inference prediction model is established based on skeletal muscle injury repair related gene expression data, the model integrates prediction results of three base classifiers by adopting Stacking ensemble learning, and the three base classifiers are optimized in parameters through grid search and cross validation, so that accuracy and stability of skeletal muscle early injury time inference are effectively improved.

The invention provides a new research thought and method for the damage time prediction method, and provides an algorithm model basis for the human skeletal muscle early damage time inference method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of the prediction method of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Specific examples of the technical solution of the present invention are given below.

1. Grouping of laboratory animals

56 male Sprague-Dewley rats, 6-8 weeks old and about 180-220g in body mass, were selected for this study and were provided by the Experimental animals center of Shanxi university of medicine. The rats were randomly divided into a control group and an injured group, the control group was rats with intact skeletal muscles, and the injured groups included 4h, 8h, 12h, 16h, 20h and 24h groups, each of which was 8 rats.

2. Preparation of animal model with skeletal muscle injury

After fasting for 12h, rats were anesthetized with 3% sodium pentobarbital (40 mg/kg) by intraperitoneal injection. A500 g gravity hammer is adopted to strike the skeletal muscle of the right hind limb of the rat by freely falling from the height of 30cm in the plastic sleeve, so as to form a skeletal muscle injury model of the rat. Then, sufficient grain and water were given for continued feeding, and the rats were sacrificed at 4h, 8h, 12h, 16h, 20h and 24h post-injury by intraperitoneal injection of a lethal dose of sodium pentobarbital, and the rats of the control group were sacrificed using the same method. Taking 100mg of muscle tissue at the center of a skeletal muscle injury area of a right hind limb of a rat in an injury group, taking 100mg of muscle tissue in the same area corresponding to the injury area of the right hind limb of the rat in a control group, averagely dividing the muscle tissue into two parts, respectively wrapping the two parts by using tinfoil, quickly freezing the two parts in liquid nitrogen, and then placing the two parts in a refrigerator at the temperature of 80 ℃ below zero for later use.

3. Total RNA extraction and quality control of samples

Total RNA from skeletal muscle was extracted by TRIzol method, and the purity and concentration of total RNA were measured by Infinite M200 Pro microplate reader, and RNA with absorbance OD260/280 between 1.8-2.2 was used for subsequent experiments. Total RNA integrity was measured using Agilent RNA 6000Nano kit and Agilent 2100 (Agilent Technologies, USA), and samples with RNA Integer Number (RIN) values greater than 7.0 were considered to have better integrity and could be used in subsequent experiments.

Synthesis of cDNA by reverse transcription of RNA

Total RNA after completion of detection was PrimeScript ^TM RT Master Mix (Perfect Real Time) kit (TaKaRa company) is used for reverse transcription into cDNA, and the specific steps are as follows: within the clean bench, primeScript is used ^TM RT Master Mix testThe kit was prepared with 10. Mu.l reverse transcription system: 2 μ l 5 XPrimeScript ^TM RT Master Mix,400ng Total RNA, make up the system to 10. Mu.l with RNase Free water. And (3) placing the prepared reverse transcription reaction system in a thermal cycler T-1 type, setting the reaction conditions to be 15min at 37 ℃ and 15s at 85 ℃, and finishing the RNA reverse transcription. The cDNA obtained after reverse transcription was subpackaged and stored at-20 ℃ for subsequent experiments.

RT-qPCR detection of the relative expression level of the target mRNA

Acquiring all sequences of target detection genes from GenBank, and then acquiring the positions of introns on the corresponding gene sequences by using BLAT function of UCSC; primers and probes of reference genes RPL13 and RPL32 and target genes are designed by using Allole ID 6.0 software, so that the sequence between the related upstream and downstream primers spans the position of an intron to avoid the interference of genomic DNA in amplification; the primers and probes for the reference genes RPL13 and RPL32 and the target gene were synthesized by Shanghai Biotechnology, inc., and the sequences and amplification efficiencies of the primers and probes were shown in Table 1.

TABLE 1 primer and probe sequences for reference and target genes

Premix Ex Taq was used ^TM The instructions of the (Probe qPCR) kit (TaKaRa company) are used for preparing a composite amplification (4 genes in total, including 2 reference genes and 2 target genes) reaction system as follows: 12.5. Mu.L of Taq DNA polymerase, 0.5. Mu.L of the forward primer, the reverse primer and the fluorescent probe (8 primers and 4 probes in total), 10% DMSO 2.0. Mu.L, 1.5. Mu.L of cDNA, 3. Mu.L of RNase Free Water. Using Bio-Rad CFX384Touch ^TM The fluorescent quantitative PCR detection system (BIO-RAD, USA) performs reverse transcription real-time fluorescent quantitative PCR, and each sample is repeated three times. The reaction conditions set in this study were: pre-denaturation at 95 ℃ for 30s, denaturation at 95 ℃ for 5s, annealing and extension at 60 ℃ for 40s, 40 cycles in total, and fluorescence signals are collected at the end of each cycle. The expression levels of RPL13 and RPL32 mRNA were used as normalization parameters, application 2 ^-△△ct The Ct value of the target gene measured by RT-qPCR is calculated by the method, and the relative expression quantity of 9 target genes is obtained.

6. Constructing a damage time Stacking prediction model

The Stacking prediction model is formed by overlapping two layers of models, a random forest, a support vector machine and a multilayer perceptron model are used in the first layer of base classifier, and the second layer adopts Logitics regression to stack the prediction probability values of the three base classifiers to form a new feature set for training to obtain a final integrated model, and the method specifically comprises the following steps:

(1) According to the principle of random sampling, 70% of data set is selected as training set, 30% is selected as testing set

(2) Selecting a support vector machine, a random forest and a multilayer perceptron as ensemble learning base classifiers, and respectively establishing prediction models of the three base classifiers, wherein the specific method comprises the following steps:

1) A Support Vector Machine (SVM) prediction model is established, and the specific method comprises the following steps: bringing training set data into a support vector machine model for training, screening after grid search and cross validation to obtain the optimal hyperparameter of the SVM model, wherein the punishment parameter c of the important parameter is 1, the kernel function (kernel) is 'rbf' and the kernel function parameter (gamma) is 1, and establishing the support vector machine model by utilizing the optimal hyperparameter;

2) For the Random Forest (RF) model, parameters are optimized by network searching and cross validation, wherein the number of basic decision trees is 400, the maximum depth of each basic decision tree model is 80, bootstrap (Boolean value) is True, namely, a sampling method bootstrap sampling is used to generate training data of the decision trees, and the random forest classification model is built by using the searched optimal parameter combination based on the training set data.

3) The training set is brought into a multi-layer perceptron (MLP) classifier, grid search and cross validation are applied to optimize two important parameters of the multi-layer perceptron, namely hidden layer number (hidden _ layer _ sizes) and an optimization mode (solution) to obtain an optimal hyper-parameter, wherein the hidden layers are three layers, the number of the hidden layers of each layer is respectively 64, 128 and 256, the optimization mode is adam, and the multi-layer perceptron model is constructed by utilizing the optimal hyper-parameter.

(3) And establishing an ensemble learning Stacking model, stacking the predicted probability values of the three base classifiers to form a new feature set, and training the new feature set by using Logitics regression to obtain a final Stacking ensemble model.

(4) And (3) testing unknown data, substituting the test set randomly divided in the step (1) into a trained Stacking integrated model and three base classifiers, and evaluating the model by adopting indexes such as ROC (rock characteristic) curves, AUC (AUC) area and accuracy (Table 2). The result shows that the MLP classification model has the highest accuracy among the three classification models established on the basis of the optimal hyper-parameter, the test accuracy is 88.24%, the area under the curve AUC value is 0.98, the prediction accuracy of the integrated Stacking integration model can reach 94.12%, the prediction effect is better than that of a single model, the Stacking integration model is also shown to integrate the advantages of the three classification models, the method is more stable and reliable, the accuracy and the coverage capability of the classification identification of the damage time are more comprehensive and balanced, and the reliability and the prediction power are higher.

TABLE 2 comparison of results of SVM, RF and MLP classification models with Stacking integration model

7. Knowing sample preparation, detection and result inference

And (3) carrying out sample detection on a rat skeletal muscle sample to be detected according to the steps 1-5, obtaining the relative expression quantity of 9 target genes of the sample, introducing the relative expression quantity of the 9 target genes into a Stacking integrated model for prediction, and deducing the rat skeletal muscle damage time.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

<110> Shanxi university of medical science

<120> skeletal muscle early injury time prediction method based on Stacking ensemble learning

<160>33

<210>1

<211>22

<212>DNA

<213> Artificial sequence

<220>

<223> F5 of RPL13

<400>1

TCGTGAGGTGCCCTACAGTTAG

<210>2

<211>23

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of RPL13

<400>2

CACACCAAGGTCCGGGCTGGCAG

<210>3

<211>21

<212>DNA

<213> Artificial sequence

<220>

<223> R5 of RPL13

<400>3

GGTGCGTGCCATTTTCTTGTG

<210>4

<211>22

<212>DNA

<213> Artificial sequence

<220>

<223> F5 of RPL32

<400>4

ATCTGGCCCTTGAATCTTCTCC

<210>5

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of RPL32

<400>5

TGTCGATGCCTCTGGGTTTCCGCC

<210>6

<211>23

<212>DNA

<213> Artificial sequence

<220>

<223> R5 of RPL32

<400>6

AGAGGACCAAGAAGTTCATCAGG

<210>7

<211>21

<212>DNA

<213> Artificial sequence

<220>

<223> F5 of Rae1

<400>7

AAGCTGAAGACCTCAGAGCAG

<210>8

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of Rae1

<400>8

CCGTGGCAGCGTGTGGCTTCAACC

<210>9

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> R5 of Rae1

<400>9

TTATAAAACTCATGCCCCTTGGAC

<210>10

<211>20

<212>DNA

<213> Artificial sequence

<220>

<223> F5 of Ier3

<400>10

CGTGCGTCCGAACACTTCTC

<210>11

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of Ier3

<400>11

CGAAAACGCAGCCGACGGGTGCTC

<210>12

<211>21

<212>DNA

<213> Artificial sequence

<220>

<223> R5 of Ier3

<400>12

AATGTTGGGTTCCTCGGTTGG

<210>13

<211>23

<212>DNA

<213> Artificial sequence

<220>

<223> F5 of Leprot

<400>13

GGGATTGTTGTTTCTGCCTTTGG

<210>14

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of Leprot

<400>14

TGCCAGCCAGCACAAGACCACAGG

<210>15

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> R5 of Leprot

<400>15

GCCTTGGATCGTGAGGAAAATAAC

<210>16

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> -F5 of impact

<400>16

AAGGTTCTTGCCAAGTTGTATGAG

<210>17

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of impact

<400>17

TCGCCAGTGCCACCCACAACATCT

<210>18

<211>22

<212>DNA

<213> Artificial sequence

<220>

<223> -R5 of impact

<400>18

GCTGTTTCTCCATCATCTTCGG

<210>19

<211>21

<212>DNA

<213> Artificial sequence

<220>

<223> F5 of Asb5

<400>19

GGTCGTCTTCTTGCTCTGAGG

<210>20

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of Asb5

<400>20

CCACATGGTCACCCAGGCAGGCTT

<210>21

<211>20

<212>DNA

<213> Artificial sequence

<220>

<223> R5 of Asb5

<400>21

TCCAGCTTCCAGGAGAGTCC

<210>22

<211>22

<212>DNA

<213> Artificial sequence

<220>

<223> F5 of Sc65

<400>22

GGAGATGAGTCCCTCACTGATC

<210>23

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of Sc65

<400>23

CCGCTCCATGTGTTCTGTGCTGCT

<210>24

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> R5 of Sc65

<400>24

AGCAAAGACGGTCATATAATCAGC

<210>25

<211>20

<212>DNA

<213> Artificial sequence

<220>

<223> -F5 of Myg1

<400>25

ACCTCGCAACAACCTCATGG

<210>26

<211>23

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of Myg1

<400>26

CGAATCGGGACGCACAACGGCAC

<210>27

<211>20

<212>DNA

<213> Artificial sequence

<220>

<223> -R5 of Myg1

<400>27

CCGAGTCCGCACAATCTCTG

<210>28

<211>20

<212>DNA

<213> Artificial sequence

<220>

<223> F5 of Dennd5a

<400>28

TACCATCCGTCAGCCCAAAC

<210>29

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of Dennd5a

<400>29

CCTGTCTCCCTCGGTCATTGCCCA

<210>30

<211>22

<212>DNA

<213> Artificial sequence

<220>

<223> R5 of Dennd5a

<400>30

CCCATCTTCTCTACCAGCATCC

<210>31

<211>21

<212>DNA

<213> Artificial sequence

<220>

<223> F5 of Slfn3/4

<400>31

AAAGGCCCTCTTCAGTCAAGC

<210>32

<211>24

<212>DNA

<213> Artificial sequence

<220>

<223> P5 of Slfn3/4

<400>32

CTGCCACACAGTCCCCGTAGCTGC

<210>33

<211>21

<212>DNA

<213> Artificial sequence

<220>

<223> R5 of Slfn3/4

<400>33

TGAGAACAGTTTCCCGCAGAG

Claims

1. A skeletal muscle early damage time prediction method based on Stacking ensemble learning is characterized by comprising the following steps:

1) Collecting skeletal muscle samples of rats at different damage time, extracting total RNA in tissues, carrying out reverse transcription on the total RNA to obtain cDNA (complementary deoxyribonucleic acid), and obtaining expression quantity data of skeletal muscle damage repair related genes at a transcription level by utilizing an RT-qPCR (reverse transcription-quantitative polymerase chain reaction) technology, wherein the skeletal muscle damage repair related genes are Rae1, ier3, leprot, impact, asb5, sc65, myg1, dennd5a and Slfn3/4;

2) Selecting a support vector machine, a random forest and a multilayer perceptron as ensemble learning base classifiers, respectively establishing prediction models of the three base classifiers, optimizing the parameters by utilizing network search and cross validation to obtain optimal parameters, stacking the prediction probability values of the three base classifiers to form a new feature set, and training the new feature set by adopting Logitics regression to obtain a final Stacking ensemble learning model;

3) Inputting the expression quantity data of skeletal muscle injury repair related genes of an unknown sample into a Stacking ensemble learning model so as to predict the injury time of the unknown sample;

the method is aimed at non-disease diagnosis.

2. The method for predicting the early skeletal muscle damage time based on Stacking ensemble learning as claimed in claim 1, wherein the RT-qPCR technology comprises using reference genes RPL13 and RPL32 mRNA as standardized reference, and applying 2 ^-△△ct The Ct value of the target gene measured by RT-qPCR is calculated by the method, and the relative expression quantity of the target gene is obtained.

3. The method for predicting the early skeletal muscle damage time based on Stacking ensemble learning as claimed in claim 1, wherein in step 2), one part of the expression quantity data is selected as a training set and the other part is selected as a testing set according to a random sampling principle.

4. The method for predicting the early skeletal muscle injury time based on Stacking ensemble learning as claimed in claim 3, wherein the training set is 70% and the testing set is 30%.

5. The method for predicting the early skeletal muscle damage time based on Stacking ensemble learning according to claim 1, wherein the method for acquiring the expression level data of the skeletal muscle damage repair-related gene of the unknown sample is the same as that in step 1).