CN111554347B - Method for constructing model for classifying hand-foot-mouth samples and application of method - Google Patents
Method for constructing model for classifying hand-foot-mouth samples and application of method Download PDFInfo
- Publication number
- CN111554347B CN111554347B CN202010313182.3A CN202010313182A CN111554347B CN 111554347 B CN111554347 B CN 111554347B CN 202010313182 A CN202010313182 A CN 202010313182A CN 111554347 B CN111554347 B CN 111554347B
- Authority
- CN
- China
- Prior art keywords
- auxiliary
- marker gene
- gene
- foot
- distinguishing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 148
- 239000003550 marker Substances 0.000 claims abstract description 85
- 230000014509 gene expression Effects 0.000 claims abstract description 60
- 208000024891 symptom Diseases 0.000 claims abstract description 60
- 238000013145 classification model Methods 0.000 claims abstract description 34
- 238000012549 training Methods 0.000 claims description 25
- 238000012165 high-throughput sequencing Methods 0.000 claims description 13
- 108020004445 glyceraldehyde-3-phosphate dehydrogenase Proteins 0.000 claims description 11
- 102100031181 Glyceraldehyde-3-phosphate dehydrogenase Human genes 0.000 claims description 10
- 101000852865 Homo sapiens Interferon alpha/beta receptor 2 Proteins 0.000 claims description 10
- 102100036718 Interferon alpha/beta receptor 2 Human genes 0.000 claims description 10
- 238000007637 random forest analysis Methods 0.000 claims description 10
- 238000012163 sequencing technique Methods 0.000 claims description 10
- 102100035779 Bublin coiled-coil protein Human genes 0.000 claims description 7
- 101000874281 Homo sapiens Bublin coiled-coil protein Proteins 0.000 claims description 7
- 101000626703 Homo sapiens YEATS domain-containing protein 2 Proteins 0.000 claims description 6
- 102100024781 YEATS domain-containing protein 2 Human genes 0.000 claims description 6
- 102100034000 Heterogeneous nuclear ribonucleoprotein F Human genes 0.000 claims description 5
- 101000807547 Homo sapiens E3 ubiquitin-protein ligase UBR4 Proteins 0.000 claims description 5
- 101001017544 Homo sapiens Heterogeneous nuclear ribonucleoprotein F Proteins 0.000 claims description 5
- 101000738771 Homo sapiens Receptor-type tyrosine-protein phosphatase C Proteins 0.000 claims description 5
- 101000723833 Homo sapiens Zinc finger E-box-binding homeobox 2 Proteins 0.000 claims description 5
- 102100037422 Receptor-type tyrosine-protein phosphatase C Human genes 0.000 claims description 5
- 102000003442 UBR4 Human genes 0.000 claims description 5
- 102100028458 Zinc finger E-box-binding homeobox 2 Human genes 0.000 claims description 5
- 108020004707 nucleic acids Proteins 0.000 claims description 4
- 150000007523 nucleic acids Chemical class 0.000 claims description 4
- 102000039446 nucleic acids Human genes 0.000 claims description 4
- 102100040135 FGFR1 oncogene partner 2 Human genes 0.000 claims description 2
- 101000890644 Homo sapiens FGFR1 oncogene partner 2 Proteins 0.000 claims description 2
- 101001131204 Homo sapiens Sulfhydryl oxidase 1 Proteins 0.000 claims description 2
- 101000747867 Homo sapiens Upstream-binding protein 1 Proteins 0.000 claims description 2
- 101000803403 Homo sapiens Vimentin Proteins 0.000 claims description 2
- 102100034371 Sulfhydryl oxidase 1 Human genes 0.000 claims description 2
- 102100040065 Upstream-binding protein 1 Human genes 0.000 claims description 2
- 102100035071 Vimentin Human genes 0.000 claims description 2
- 238000011529 RT qPCR Methods 0.000 claims 2
- -1 pafar 1B1 Proteins 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 description 7
- 238000003753 real-time PCR Methods 0.000 description 7
- 208000020061 Hand, Foot and Mouth Disease Diseases 0.000 description 6
- 201000010099 disease Diseases 0.000 description 6
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 6
- 208000025713 Hand-foot-and-mouth disease Diseases 0.000 description 5
- 208000030194 mouth disease Diseases 0.000 description 5
- 241000976806 Genea <ascomycete fungus> Species 0.000 description 4
- 101001064282 Homo sapiens Platelet-activating factor acetylhydrolase IB subunit beta Proteins 0.000 description 4
- 102100030655 Platelet-activating factor acetylhydrolase IB subunit beta Human genes 0.000 description 4
- 108050000586 YEATS Proteins 0.000 description 4
- 102000008710 YEATS Human genes 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 4
- 108091008794 FGF receptors Proteins 0.000 description 3
- 101150017606 GPI gene Proteins 0.000 description 3
- 101100246999 Gallus gallus QSOX1 gene Proteins 0.000 description 3
- 101150112014 Gapdh gene Proteins 0.000 description 3
- 101000650854 Homo sapiens Small glutamine-rich tetratricopeptide repeat-containing protein alpha Proteins 0.000 description 3
- 102100027722 Small glutamine-rich tetratricopeptide repeat-containing protein alpha Human genes 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 102000052178 fibroblast growth factor receptor activity proteins Human genes 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 210000003819 peripheral blood mononuclear cell Anatomy 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 210000005105 peripheral blood lymphocyte Anatomy 0.000 description 2
- 229920002689 polyvinyl acetate Polymers 0.000 description 2
- 239000011118 polyvinyl acetate Substances 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 101150076489 B gene Proteins 0.000 description 1
- 208000035473 Communicable disease Diseases 0.000 description 1
- 241000709661 Enterovirus Species 0.000 description 1
- 238000002123 RNA extraction Methods 0.000 description 1
- 238000003559 RNA-seq method Methods 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 230000001363 autoimmune Effects 0.000 description 1
- 238000007622 bioinformatic analysis Methods 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 210000000987 immune system Anatomy 0.000 description 1
- 208000015181 infectious disease Diseases 0.000 description 1
- 230000008774 maternal effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000926 neurological effect Effects 0.000 description 1
- 210000005259 peripheral blood Anatomy 0.000 description 1
- 239000011886 peripheral blood Substances 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 230000009885 systemic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Abstract
The application provides a method for distinguishing hand-foot-mouth samples. The method comprises the following steps: determining the expression quantity of each gene in a first marker gene combination of a sample to be tested; and inputting the expression quantity result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-mouth sample between a light symptom group and a heavy symptom group.
Description
Technical Field
The application relates to the field of biological analysis, in particular to a method for constructing a model for classifying hand-foot-and-mouth samples, and a method and equipment for distinguishing hand-foot-and-mouth samples.
Background
Hand-foot-and-mouth disease (HandFootandMouthDisease, HFMD) is a common infectious disease in children caused by a group of enteroviruses. Severe patients often develop neurological and systemic complications rapidly, and in some severe cases death occurs within 3 to 5 days. For infants and children between 6 months and 5 years, their immune system has not yet developed completely and is no longer able to acquire maternal transferred antibodies, thus lacking the ability to resist viruses, relying entirely on autoimmune development. Therefore, the searching of the marker immune gene which can be used for distinguishing the light and the serious symptoms at the early stage of the disease, and the prediction of the light and the serious symptoms of the hand-foot-mouth disease has very important significance for clinical treatment, and even can reduce the death rate caused by the serious symptoms.
High-throughput sequencing and artificial intelligence are combined with medical treatment, artificial intelligence analysis is adopted on high-throughput sequencing data, and diagnostic deviation is reduced by adjusting parameters. This adds more objectivity to the diagnosis relying on the experience of the physician and can also make up for the deficiencies of modern medical resources. Especially for the prediction of the early stage light and severe symptoms of hand-foot-and-mouth disease, the method only depends on the traditional medical means and does not have a good solution, and the method has important significance in distinguishing the light and severe symptoms of hand-foot-and-mouth disease at the early stage by combining high-throughput sequencing and artificial intelligence.
Disclosure of Invention
According to the application, through the combination of high-throughput sequencing, artificial intelligence and medical treatment, a plurality of marker genes are selected, modeling is performed by means of artificial intelligence, machine learning and the like, and the early-stage predicted severe and mild symptoms of the hand-foot-and-mouth disease are intuitively displayed, so that the result is more objective and the accuracy is higher.
In a first aspect of the application, the application provides a method for constructing a model for classifying hand-foot-and-mouth samples. According to an embodiment of the application, the method comprises: (1) Sequencing nucleic acid samples from a plurality of hand-foot-and-mouth patients and obtaining sequencing data for each patient, wherein the plurality of hand-foot-and-mouth patients includes a light symptom group and a heavy symptom group; (2) Determining the expression level of each gene in the initial gene set of each patient by comparing the sequencing data with a reference genome; (3) Determining a reference gene set based on the coefficient of variation of the expression level of each gene in the initial gene set in each patient, the coefficient of variation of the reference gene being less than a predetermined threshold; (4) Performing a first classification training using the expression level of the gene determined in step (2) as a training feature and the light symptom group and the heavy symptom group as training sets, so as to obtain a first marker gene combination and a first classification model for distinguishing light symptoms from heavy symptoms; (5) Selecting one reference gene from the reference gene set, taking the ratio of the rest genes in the initial gene set to the reference genes as training characteristics, and adopting the light symptom group and the heavy symptom group as training sets to carry out auxiliary classification training so as to obtain an auxiliary marker gene combination and an auxiliary classification model for distinguishing light symptoms from heavy symptoms.
According to an embodiment of the present application, the above method may further include at least one of the following additional technical features:
according to an embodiment of the present application, the reference genes include GPI and GAPDH.
According to an embodiment of the application, the first marker gene combination comprises FGFR1OP2, IFNAR2, PAFAH1B1, PTPRC, HNRNPF, YEATS2, UBP1.
According to an embodiment of the application, the first classification training and the auxiliary classification training are each independently random model classification training.
According to an embodiment of the present application, in step (5), the auxiliary classification training is performed separately for each reference gene in the reference gene set, so as to obtain a plurality of auxiliary marker gene combinations and a corresponding plurality of auxiliary classification models.
According to an embodiment of the application, the first reference gene is GPI, the first auxiliary marker gene combination comprises GAS6-AS2, UBR4, C9orf16, IFNAR2, YEATS2, the second reference gene is GAPDH, and the second auxiliary marker gene combination comprises QSOX1, VIM, ZEB2, C9orf16.
In a second aspect of the application, the application proposes a method for distinguishing hand-foot-and-mouth samples. According to an embodiment of the application, the method comprises: determining the expression quantity of each gene in a first marker gene combination of a sample to be tested; inputting the expression level result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-and-mouth sample between a light symptom group and a heavy symptom group, wherein the first marker gene combination and the first classification model are established according to the method.
According to an embodiment of the present application, the above method may further include at least one of the following additional technical features:
according to an embodiment of the present application, the expression level of the first marker gene combination is obtained by high throughput sequencing.
According to an embodiment of the present application, further comprising, by: a qPCR method for determining the expression level of each gene in the first auxiliary marker gene combination and the second auxiliary marker gene combination; distinguishing the hand-foot-mouth sample between a light symptom group and a heavy symptom group by using a first auxiliary classification model based on the gene expression quantity of the first auxiliary marker gene combination so as to obtain a first distinguishing result; distinguishing the hand-foot-mouth sample between a light symptom group and a heavy symptom group by using a second auxiliary classification model based on the gene expression quantity of the second auxiliary marker gene combination so as to obtain a second distinguishing result; and selecting the same distinguishing result as the second distinguishing result as the judging result, wherein the first auxiliary marker gene combination and the second auxiliary marker gene are combined, and the first auxiliary classification model and the second auxiliary classification model are established by the method.
In a third aspect of the application, the application proposes a device for distinguishing hand-foot-and-mouth samples. According to an embodiment of the application, the apparatus comprises: the first expression quantity determining module is used for determining the expression quantity of each gene in a first marker gene combination of the sample to be detected; and the first classification module is used for inputting the expression quantity result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-mouth sample between a light symptom group and a heavy symptom group, wherein the first marker gene combination and the first classification model are established according to the method.
According to an embodiment of the present application, the above apparatus may further include at least one of the following additional technical features:
according to an embodiment of the present application, the expression level of the first marker gene combination is obtained by high throughput sequencing.
According to an embodiment of the present application, further comprising, by: a second expression level determining module for determining the expression level of each gene in the first auxiliary marker gene combination and the second auxiliary marker gene combination by qPCR method; the first auxiliary classification module is used for distinguishing the hand-foot-mouth sample between the light symptom group and the heavy symptom group by utilizing a first auxiliary classification model based on the gene expression quantity of the first auxiliary marker gene combination so as to obtain a first distinguishing result; the second auxiliary classification module is used for distinguishing the hand-foot-mouth sample between the light symptom group and the heavy symptom group by using a second auxiliary classification model based on the gene expression quantity of the second auxiliary marker gene combination so as to obtain a second distinguishing result; and the judging module is used for selecting the distinguishing result with the same first distinguishing result and the second distinguishing result as judging results, wherein the first auxiliary marker gene combination and the second auxiliary marker gene combination are established according to the method.
According to the method and the device for distinguishing the hand-foot-mouth disease samples, the limitation that the early-stage light and severe disease diagnosis of the hand-foot-mouth disease is dependent on the experience diagnosis of doctors is broken through by combining high-throughput sequencing and artificial intelligence with medical treatment, a plurality of marker genes are selected, modeling is performed by means of artificial intelligence, machine learning and the like, so that the situation of the early-stage predicted light and severe disease of the hand-foot-mouth disease is intuitively displayed, the result is more objective, and the accuracy is higher.
Drawings
Fig. 1 is a flowchart for distinguishing hand-foot-and-mouth light and severe disease samples according to an embodiment of the present application;
fig. 2 is a schematic diagram of an apparatus for distinguishing hand-foot-mouth samples according to an embodiment of the present application;
fig. 3 is a schematic diagram of an apparatus for distinguishing hand-foot-and-mouth samples according to another embodiment of the present application;
FIG. 4 shows that 7 marker genes (FGFR 1OP2, IFNAR2, PAFAH1B1, PTPRC, HNRNPF, YEATS and UBP 1) are selected by using a random forest model according to the gene expression quantity FPKM, and the combination accuracy is optimal;
FIG. 5 is a ROC curve of a training set using 4/5 samples of 7 marker genes selected by using a gene expression level FPKM according to an embodiment of the present application;
FIG. 6 shows the expression level of GPI gene (FPKM) according to an embodiment of the present application GPI ) Taking reference and calculating other genesThe ratio of the expression level FPKM to the same (FPKM/FPKM) GPI ) For this ratio, 5 marker genes (GAS 6-AS2, UBR4, C9orf16, IFNAR2, YEATS 2) were selected using a random forest model, and the combined accuracy was optimal;
FIG. 7 shows the expression level of GPI gene (FPKM) according to an embodiment of the present application GPI ) Performing benchmark selection for 5 marker genes for modeling, and taking 4/5 samples as a training set and ROC curves of the training set;
FIG. 8 shows the expression level (FPKM) of GAPDH gene according to an embodiment of the present application GAPDH ) Taking the reference as a standard, calculating the ratio (FPKM/FPKM) of the expression quantity FPKM of other genes to the FPKM GAPDH ) For the ratio, 4 marker genes (QSOX 1, VIM, ZEB2 and C9orf 16) are selected by using a random forest model, and the combination accuracy is optimal;
FIG. 9 shows the expression level (FPKM) of GAPDH gene according to an embodiment of the present application GAPDH ) And (4) performing benchmark selection for modeling of 4 marker genes, and performing 4/5 sample as a training set and ROC curve of the training set.
Detailed Description
Embodiments of the method for distinguishing hand-foot-mouth light and severe cases of the present application are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.
1. Sample nucleic acid extraction
After peripheral blood lymphocytes (PBMCs) are isolated from blood, nucleic acids of the cells are extracted (RNA extraction) and subjected to high throughput sequencing or quantitative PCR (qPCR) gene quantification.
2. Bioinformatic analysis
Step one: sequencing result analysis
1. And (5) removing the low-quality sequence from the sequencing data in the machine to obtain a standby sequence.
2. The alternate sequence was aligned to a human reference gene using software Bowtie 2.
3. Gene expression levels (Fragments Per Kilobase of exon per Million fragments mapped, FPKM) were calculated using the (RNA-Seq by Expectation Maximization, RSEM) software package.
Step two: selection of reference genes
The coefficient of variation of the genes was calculated and relatively stable reference genes (GeneA, geneB) were selected.
Step three: selection of marker genes
1. Using the gene expression level FPKM, a set of marker genes (Group 1) was selected based on the light and severe samples using a random forest model for distinguishing different groups.
2. Expressed in terms of the amount of Gene A gene (FPKM GeneA ) Taking the reference as a standard, calculating the ratio (FPKM/FPKM) of the expression quantity FPKM of other genes to the FPKM GeneA ) For this ratio a set of marker genes (Group 2) was selected using a random forest model.
3. Expressed in terms of the gene B gene (FPKM) GeneB ) Taking the reference as a standard, calculating the ratio (FPKM/FPKM) of the expression quantity FPKM of other genes to the FPKM GeneB ) For this ratio a set of marker genes (Group 3) was selected using a random forest model.
4. Repeating the steps of 2 or 3 to obtain the ratio of multiple reference genes, and selecting the optimal combination.
Step four: light and severe disease prediction
1. Modeling by using a selected marker gene (Group 1), and predicting the light and heavy conditions of the hand-foot-and-mouth sample subjected to high-throughput sequencing.
2. Modeling the selected marker gene (Group 2) by using the gene GeneA as a benchmark, and predicting the light and heavy symptoms of the detected hand-foot-mouth sample.
3. Modeling a selected marker gene (Group 3) by using the gene B as a reference, predicting the light and heavy conditions of the detected hand-foot-and-mouth sample, combining the predicted result with the result of 2, adopting the result of judging coincidence, and judging that the coincidence is not predicted.
For ease of understanding, the applicant shows the flow of the present application for distinguishing hand-foot-and-mouth light and severe samples as fig. 1.
In another aspect, the application provides a device for distinguishing hand-foot-and-mouth samples. According to an embodiment of the application, referring to fig. 2, the apparatus comprises: a first expression level determining module 100, configured to determine an expression level of each gene in a first marker gene combination of a sample to be tested; a first classification module 200, configured to input the expression level result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-and-mouth sample between a light symptom group and a heavy symptom group, where the first marker gene combination and the first classification model are established according to the foregoing method.
Specifically, according to an embodiment of the present application, referring to fig. 3, the apparatus further includes: a second expression level determining module 300 for determining the expression level of each gene in the first auxiliary marker gene combination and the second auxiliary marker gene combination by qPCR method; a first auxiliary classification module 400, configured to distinguish the hand-foot-mouth sample between the light symptom group and the heavy symptom group using a first auxiliary classification model based on the gene expression amounts of the first auxiliary marker gene combinations, so as to obtain a first distinguishing result; a second auxiliary classification module 500, configured to distinguish the hand-foot-mouth sample between the light symptom group and the heavy symptom group using a second auxiliary classification model based on the gene expression amounts of the second auxiliary marker gene combinations, so as to obtain a second distinguishing result; the judging module 600 is configured to select, as a judging result, a distinguishing result that is the same as the first distinguishing result and the second distinguishing result, wherein the first auxiliary marker gene and the second auxiliary marker gene are combined, and the first auxiliary classification model and the second auxiliary classification model are established according to the method described above.
The application will be further illustrated with reference to specific examples. The experimental methods used in the following examples are conventional methods unless otherwise specified. Materials, reagents and the like used in the examples described below are commercially available unless otherwise specified.
Examples
1. Test example sample
PBMC (poly (vinyl acetate) sample for 35 severe hand-foot-mouth patients
PBMC (poly (vinyl acetate) sample for 30 cases of mild cases of hand-foot-mouth disease patients
2. Test analysis flow
1) 3mL peripheral blood was taken, PBMC were isolated, RNA was extracted, and RNA was subjected to pool sequencing (second generation high throughput sequencing).
2) And (5) removing the low-quality sequence from the sequencing data in the machine to obtain cleardata.
3) Cleardata was aligned to the reference gene using Bowtie 2.
4) Gene expression amount FPKM was calculated using RSEM.
5) The coefficient of variation of the gene was calculated and the relatively stable genes (GPI, GAPDH) were selected.
6) Marker genes were selected with 80% of samples:
using the gene expression level FPKM, a set of marker genes (FGFR 1OP2, IFNAR2, PAFAH1B1, PTPRC, HNRNPF, YEATS2, UBP 1) was selected using a random forest model, and the results were shown in FIGS. 4 and 5.
By the expression level of GPI gene (FPKM) GPI ) Taking the reference as a standard, calculating the ratio (FPKM/FPKM) of the expression quantity FPKM of other genes to the FPKM GPI ) A set of marker genes (GAS 6-AS2, UBR4, C9orf16, IFNAR2, YEATS 2) was selected for this ratio using a random forest model, and the results were shown in FIGS. 6 and 7.
Expressed in terms of GAPDH gene (FPKM) GAPDH ) Taking the reference as a standard, calculating the ratio (FPKM/FPKM) of the expression quantity FPKM of other genes to the FPKM GAPDH ) A set of marker genes (QSOX 1, VIM, ZEB2, C9orf 16) was selected for this ratio using a random forest model, and the results are shown in FIGS. 8 and 9.
In fig. 4 to 9, ROC Receiver Operating Characteristic represents a receiver operation feature; AUC, area Under the Curve; specificity means Specificity; sensitivity means Sensitivity; case represents a severe condition; control indicates mild symptoms.
7) Light and severe disease prediction for the remaining 20% of samples
a. Modeling by using selected marker genes (FGFR 1OP2, IFNAR2, PAFAH1B1, PTPRC, HNRNPF, YEATS2 and UBP 1), and predicting the light and heavy conditions of the hand-foot-and-mouth samples subjected to high-throughput sequencing.
b. And modeling selected marker genes (GAS 6-AS2, UBR4, C9orf16, IFNAR2 and YEATS 2) by using GPI genes AS references, and predicting the light and heavy condition of the hand-foot-mouth sample detected by qPCR.
c. Modeling selected marker genes (QSOX 1, VIM, ZEB2 and C9orf 16) by using GAPDH genes as a reference, predicting the light and heavy condition of a qPCR detected hand-foot-and-mouth sample, combining the predicted result with the result of b, and adopting the result of consistent judgment, wherein inconsistent judgment is not predicted.
Table 1: individual model efficiency evaluation table
MCC: matthew's Correlation Coefficient, ranging from [ -1,1], -1 representing a completely contradictory prediction; 1 represents a completely correct prediction; 0 represents a random prediction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.
Claims (12)
1. A method of constructing a model for hand-foot-and-mouth sample classification, comprising:
(1) Sequencing nucleic acid samples from a plurality of hand-foot-and-mouth patients and obtaining sequencing data for each patient, wherein the plurality of hand-foot-and-mouth patients includes a light symptom group and a heavy symptom group;
(2) Determining the expression level of each gene in the initial gene set of each patient by comparing the sequencing data with a reference genome;
(3) Determining a reference gene set based on the coefficient of variation of the expression level of each gene in the initial gene set in each patient, the coefficient of variation of the reference gene being less than a predetermined threshold;
(4) Performing a first classification training using the expression level of the gene determined in step (2) as a training feature and the light symptom group and the heavy symptom group as training sets, so as to obtain a first marker gene combination and a first classification model for distinguishing light symptoms from heavy symptoms;
(5) Selecting one reference gene from the reference gene set, taking the ratio of the rest genes in the initial gene set to the reference genes as training characteristics, and adopting the light symptom group and the heavy symptom group as training sets to carry out auxiliary classification training so as to obtain an auxiliary marker gene combination and an auxiliary classification model for distinguishing light symptoms from heavy symptoms.
2. The method of claim 1, wherein the reference genes comprise GPI and GAPDH.
3. The method of claim 1, wherein the first marker gene combination comprises FGFR1OP2, IFNAR2, pafar 1B1, PTPRC, HNRNPF, YEATS2, UBP1.
4. The method of claim 1, wherein the first classification training and the auxiliary classification training are each independently random forest model classification training.
5. The method of claim 1, wherein in step (5), the auxiliary classification training is performed separately for each reference gene in the set of reference genes to obtain a plurality of auxiliary marker gene combinations and a corresponding plurality of auxiliary classification models.
6. The method of claim 5, wherein the first reference gene is GPI and the first auxiliary marker gene set comprises GAS6-AS2, UBR4, C9orf16, IFNAR2, YEATS2,
the second internal reference gene is GAPDH, and the second auxiliary marker gene combination comprises QSOX1, VIM, ZEB2, C9orf16.
7. A method for distinguishing hand-foot-and-mouth samples, comprising:
determining the expression quantity of each gene in a first marker gene combination of a sample to be tested;
inputting the expression level result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-and-mouth sample between a light symptom group and a heavy symptom group,
wherein said first marker gene combination and said first classification model are established in any of claims 1 to 6.
8. The method of claim 7, wherein the expression level of the first marker gene combination is obtained by high throughput sequencing.
9. The method of claim 7, further comprising, by:
a qPCR method for determining the expression level of each gene in the first auxiliary marker gene combination and the second auxiliary marker gene combination;
distinguishing the hand-foot-mouth sample between a light symptom group and a heavy symptom group by using a first auxiliary classification model based on the gene expression quantity of the first auxiliary marker gene combination so as to obtain a first distinguishing result;
distinguishing the hand-foot-mouth sample between a light symptom group and a heavy symptom group by using a second auxiliary classification model based on the gene expression quantity of the second auxiliary marker gene combination so as to obtain a second distinguishing result;
selecting as a judgment result a discrimination result in which the first discrimination result and the second discrimination result are identical, wherein the first auxiliary marker gene combination and the second auxiliary marker gene combination, the first auxiliary classification model and the second auxiliary classification model are established in claim 5 or 6.
10. An apparatus for distinguishing hand-foot-and-mouth samples, comprising:
the first expression quantity determining module is used for determining the expression quantity of each gene in a first marker gene combination of the sample to be detected;
a first classification module for inputting the expression quantity result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-mouth sample between a light symptom group and a heavy symptom group,
wherein said first marker gene combination and said first classification model are established in any of claims 1 to 6.
11. The apparatus of claim 10, wherein the expression level of the first marker gene combination is obtained by high throughput sequencing.
12. The apparatus of claim 10, further comprising, by:
a second expression level determining module for determining the expression level of each gene in the first auxiliary marker gene combination and the second auxiliary marker gene combination by qPCR method;
the first auxiliary classification module is used for distinguishing the hand-foot-mouth sample between the light symptom group and the heavy symptom group by utilizing a first auxiliary classification model based on the gene expression quantity of the first auxiliary marker gene combination so as to obtain a first distinguishing result;
the second auxiliary classification module is used for distinguishing the hand-foot-mouth sample between the light symptom group and the heavy symptom group by using a second auxiliary classification model based on the gene expression quantity of the second auxiliary marker gene combination so as to obtain a second distinguishing result;
the judging module is used for selecting the distinguishing result with the same first distinguishing result and the second distinguishing result as judging results;
wherein the first and second helper marker gene combinations, the first and second helper classification models are established in claim 5 or 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010313182.3A CN111554347B (en) | 2020-04-20 | 2020-04-20 | Method for constructing model for classifying hand-foot-mouth samples and application of method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010313182.3A CN111554347B (en) | 2020-04-20 | 2020-04-20 | Method for constructing model for classifying hand-foot-mouth samples and application of method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111554347A CN111554347A (en) | 2020-08-18 |
CN111554347B true CN111554347B (en) | 2023-10-31 |
Family
ID=72001134
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010313182.3A Active CN111554347B (en) | 2020-04-20 | 2020-04-20 | Method for constructing model for classifying hand-foot-mouth samples and application of method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111554347B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103237901A (en) * | 2010-03-01 | 2013-08-07 | 卡里斯生命科学卢森堡控股有限责任公司 | Biomarkers for theranostics |
CN104073569A (en) * | 2014-07-21 | 2014-10-01 | 广州市妇女儿童医疗中心 | Molecular marker used for diagnosing extremely severe case of hand-foot-and-mouth disease and testing method as well as kit |
CN110706749A (en) * | 2019-09-10 | 2020-01-17 | 至本医疗科技(上海)有限公司 | Cancer type prediction system and method based on tissue and organ differentiation hierarchical relation |
CN110904208A (en) * | 2019-10-31 | 2020-03-24 | 华中科技大学 | SNP (single nucleotide polymorphism) site related to CV-A6 type hand-foot-and-mouth disease severe susceptibility and application thereof |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017083564A1 (en) * | 2015-11-11 | 2017-05-18 | Northeastern University | Methods and systems for profiling personalized biomarker expression perturbations |
-
2020
- 2020-04-20 CN CN202010313182.3A patent/CN111554347B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103237901A (en) * | 2010-03-01 | 2013-08-07 | 卡里斯生命科学卢森堡控股有限责任公司 | Biomarkers for theranostics |
CN104073569A (en) * | 2014-07-21 | 2014-10-01 | 广州市妇女儿童医疗中心 | Molecular marker used for diagnosing extremely severe case of hand-foot-and-mouth disease and testing method as well as kit |
CN110706749A (en) * | 2019-09-10 | 2020-01-17 | 至本医疗科技(上海)有限公司 | Cancer type prediction system and method based on tissue and organ differentiation hierarchical relation |
CN110904208A (en) * | 2019-10-31 | 2020-03-24 | 华中科技大学 | SNP (single nucleotide polymorphism) site related to CV-A6 type hand-foot-and-mouth disease severe susceptibility and application thereof |
Non-Patent Citations (3)
Title |
---|
Bravo-Merodio L等.Machine learning for the detection of early immunological markers as predictors of multi-organ dysfunction.Scientific data.2019,第6卷(第6期),全文. * |
Min N等.Circulating salivary miRNA hsa-miR-221 as clinically validated diagnostic marker for hand, foot, and mouth disease in pediatric patients.EBioMedicine.2018,第31卷全文. * |
邹容容.IFNAR1基因SNP与EV71手足口病的易感性相关.中国硕士论文全文库.2017,全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111554347A (en) | 2020-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110800063B (en) | Detection of tumor-associated variants using cell-free DNA fragment size | |
CN106462670B (en) | Rare variant calling in ultra-deep sequencing | |
JP5938484B2 (en) | Method, system, and computer-readable storage medium for determining presence / absence of genome copy number variation | |
CN106834474B (en) | Utilize gene order-checking diagnosing fetal chromosomal aneuploidy | |
CN109767810B (en) | High-throughput sequencing data analysis method and device | |
CN105132407B (en) | A kind of cast-off cells DNA low frequencies mutation enrichment sequence measurement | |
KR20200093438A (en) | Method and system for determining somatic mutant clonability | |
CN105040111B (en) | The construction method of systemic loupus erythematosus spectrum model | |
CN106778073A (en) | A kind of method and system for assessing tumor load change | |
CN113096728B (en) | Method, device, storage medium and equipment for detecting tiny residual focus | |
CN106350589A (en) | DNA library for detecting pathogenic genes of genetic vascular diseases and application thereof | |
WO2018209625A1 (en) | Analysis system for peripheral blood-based non-invasive detection of lesion immune repertoire diversity and uses of system | |
CN108004304A (en) | A kind of Clonal method for detecting lymphocyte related genes and resetting | |
CN115052994A (en) | Method for determining base type of predetermined site in chromosome of embryonic cell and application thereof | |
CN110904213A (en) | Intestinal flora-based ulcerative colitis biomarker and application thereof | |
CN111554347B (en) | Method for constructing model for classifying hand-foot-mouth samples and application of method | |
CN106156539B (en) | The method and apparatus of the immunity difference of the individual two class states of analysis | |
CN113260710A (en) | Compositions, systems, devices, and methods for validating microbiome sequence processing and differential abundance analysis by multiple custom blended mixtures | |
CN108977533A (en) | It is a kind of for predicting the miRNA combination object of chronic hepatitis B inflammation damnification | |
CN105838720A (en) | PTPRQ gene mutant and application thereof | |
CN105177130B (en) | It is used for assessing the mark of aids patient generation immune reconstitution inflammatory syndrome | |
CN114171116A (en) | Method for evaluating fetal DNA concentration by free and self DNA of pregnant woman and application | |
CN114424291A (en) | Immune repertoire health assessment system and method | |
Lauria | Rank-based miRNA signatures for early cancer detection | |
KR102519739B1 (en) | Non-invasive prenatal testing method and devices based on double Z-score |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |