CN111554347B

CN111554347B - Method for constructing model for classifying hand-foot-mouth samples and application of method

Info

Publication number: CN111554347B
Application number: CN202010313182.3A
Authority: CN
Inventors: 麻锦敏; 李琼芳; 陈唯军
Original assignee: Shenzhen Huada Yinyuan Pharmaceutical Technology Co Ltd
Current assignee: Shenzhen Huada Yinyuan Pharmaceutical Technology Co Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2023-10-31
Anticipated expiration: 2040-04-20
Also published as: CN111554347A

Abstract

The application provides a method for distinguishing hand-foot-mouth samples. The method comprises the following steps: determining the expression quantity of each gene in a first marker gene combination of a sample to be tested; and inputting the expression quantity result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-mouth sample between a light symptom group and a heavy symptom group.

Description

Method for constructing model for classifying hand-foot-mouth samples and application of method

Technical Field

The application relates to the field of biological analysis, in particular to a method for constructing a model for classifying hand-foot-and-mouth samples, and a method and equipment for distinguishing hand-foot-and-mouth samples.

Background

Hand-foot-and-mouth disease (HandFootandMouthDisease, HFMD) is a common infectious disease in children caused by a group of enteroviruses. Severe patients often develop neurological and systemic complications rapidly, and in some severe cases death occurs within 3 to 5 days. For infants and children between 6 months and 5 years, their immune system has not yet developed completely and is no longer able to acquire maternal transferred antibodies, thus lacking the ability to resist viruses, relying entirely on autoimmune development. Therefore, the searching of the marker immune gene which can be used for distinguishing the light and the serious symptoms at the early stage of the disease, and the prediction of the light and the serious symptoms of the hand-foot-mouth disease has very important significance for clinical treatment, and even can reduce the death rate caused by the serious symptoms.

High-throughput sequencing and artificial intelligence are combined with medical treatment, artificial intelligence analysis is adopted on high-throughput sequencing data, and diagnostic deviation is reduced by adjusting parameters. This adds more objectivity to the diagnosis relying on the experience of the physician and can also make up for the deficiencies of modern medical resources. Especially for the prediction of the early stage light and severe symptoms of hand-foot-and-mouth disease, the method only depends on the traditional medical means and does not have a good solution, and the method has important significance in distinguishing the light and severe symptoms of hand-foot-and-mouth disease at the early stage by combining high-throughput sequencing and artificial intelligence.

Disclosure of Invention

According to the application, through the combination of high-throughput sequencing, artificial intelligence and medical treatment, a plurality of marker genes are selected, modeling is performed by means of artificial intelligence, machine learning and the like, and the early-stage predicted severe and mild symptoms of the hand-foot-and-mouth disease are intuitively displayed, so that the result is more objective and the accuracy is higher.

In a first aspect of the application, the application provides a method for constructing a model for classifying hand-foot-and-mouth samples. According to an embodiment of the application, the method comprises: (1) Sequencing nucleic acid samples from a plurality of hand-foot-and-mouth patients and obtaining sequencing data for each patient, wherein the plurality of hand-foot-and-mouth patients includes a light symptom group and a heavy symptom group; (2) Determining the expression level of each gene in the initial gene set of each patient by comparing the sequencing data with a reference genome; (3) Determining a reference gene set based on the coefficient of variation of the expression level of each gene in the initial gene set in each patient, the coefficient of variation of the reference gene being less than a predetermined threshold; (4) Performing a first classification training using the expression level of the gene determined in step (2) as a training feature and the light symptom group and the heavy symptom group as training sets, so as to obtain a first marker gene combination and a first classification model for distinguishing light symptoms from heavy symptoms; (5) Selecting one reference gene from the reference gene set, taking the ratio of the rest genes in the initial gene set to the reference genes as training characteristics, and adopting the light symptom group and the heavy symptom group as training sets to carry out auxiliary classification training so as to obtain an auxiliary marker gene combination and an auxiliary classification model for distinguishing light symptoms from heavy symptoms.

According to an embodiment of the present application, the above method may further include at least one of the following additional technical features:

according to an embodiment of the present application, the reference genes include GPI and GAPDH.

According to an embodiment of the application, the first marker gene combination comprises FGFR1OP2, IFNAR2, PAFAH1B1, PTPRC, HNRNPF, YEATS2, UBP1.

According to an embodiment of the application, the first classification training and the auxiliary classification training are each independently random model classification training.

According to an embodiment of the present application, in step (5), the auxiliary classification training is performed separately for each reference gene in the reference gene set, so as to obtain a plurality of auxiliary marker gene combinations and a corresponding plurality of auxiliary classification models.

According to an embodiment of the application, the first reference gene is GPI, the first auxiliary marker gene combination comprises GAS6-AS2, UBR4, C9orf16, IFNAR2, YEATS2, the second reference gene is GAPDH, and the second auxiliary marker gene combination comprises QSOX1, VIM, ZEB2, C9orf16.

In a second aspect of the application, the application proposes a method for distinguishing hand-foot-and-mouth samples. According to an embodiment of the application, the method comprises: determining the expression quantity of each gene in a first marker gene combination of a sample to be tested; inputting the expression level result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-and-mouth sample between a light symptom group and a heavy symptom group, wherein the first marker gene combination and the first classification model are established according to the method.

according to an embodiment of the present application, the expression level of the first marker gene combination is obtained by high throughput sequencing.

According to an embodiment of the present application, further comprising, by: a qPCR method for determining the expression level of each gene in the first auxiliary marker gene combination and the second auxiliary marker gene combination; distinguishing the hand-foot-mouth sample between a light symptom group and a heavy symptom group by using a first auxiliary classification model based on the gene expression quantity of the first auxiliary marker gene combination so as to obtain a first distinguishing result; distinguishing the hand-foot-mouth sample between a light symptom group and a heavy symptom group by using a second auxiliary classification model based on the gene expression quantity of the second auxiliary marker gene combination so as to obtain a second distinguishing result; and selecting the same distinguishing result as the second distinguishing result as the judging result, wherein the first auxiliary marker gene combination and the second auxiliary marker gene are combined, and the first auxiliary classification model and the second auxiliary classification model are established by the method.

In a third aspect of the application, the application proposes a device for distinguishing hand-foot-and-mouth samples. According to an embodiment of the application, the apparatus comprises: the first expression quantity determining module is used for determining the expression quantity of each gene in a first marker gene combination of the sample to be detected; and the first classification module is used for inputting the expression quantity result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-mouth sample between a light symptom group and a heavy symptom group, wherein the first marker gene combination and the first classification model are established according to the method.

According to an embodiment of the present application, the above apparatus may further include at least one of the following additional technical features:

According to an embodiment of the present application, further comprising, by: a second expression level determining module for determining the expression level of each gene in the first auxiliary marker gene combination and the second auxiliary marker gene combination by qPCR method; the first auxiliary classification module is used for distinguishing the hand-foot-mouth sample between the light symptom group and the heavy symptom group by utilizing a first auxiliary classification model based on the gene expression quantity of the first auxiliary marker gene combination so as to obtain a first distinguishing result; the second auxiliary classification module is used for distinguishing the hand-foot-mouth sample between the light symptom group and the heavy symptom group by using a second auxiliary classification model based on the gene expression quantity of the second auxiliary marker gene combination so as to obtain a second distinguishing result; and the judging module is used for selecting the distinguishing result with the same first distinguishing result and the second distinguishing result as judging results, wherein the first auxiliary marker gene combination and the second auxiliary marker gene combination are established according to the method.

According to the method and the device for distinguishing the hand-foot-mouth disease samples, the limitation that the early-stage light and severe disease diagnosis of the hand-foot-mouth disease is dependent on the experience diagnosis of doctors is broken through by combining high-throughput sequencing and artificial intelligence with medical treatment, a plurality of marker genes are selected, modeling is performed by means of artificial intelligence, machine learning and the like, so that the situation of the early-stage predicted light and severe disease of the hand-foot-mouth disease is intuitively displayed, the result is more objective, and the accuracy is higher.

Drawings

Fig. 1 is a flowchart for distinguishing hand-foot-and-mouth light and severe disease samples according to an embodiment of the present application;

fig. 2 is a schematic diagram of an apparatus for distinguishing hand-foot-mouth samples according to an embodiment of the present application;

fig. 3 is a schematic diagram of an apparatus for distinguishing hand-foot-and-mouth samples according to another embodiment of the present application;

FIG. 4 shows that 7 marker genes (FGFR 1OP2, IFNAR2, PAFAH1B1, PTPRC, HNRNPF, YEATS and UBP 1) are selected by using a random forest model according to the gene expression quantity FPKM, and the combination accuracy is optimal;

FIG. 5 is a ROC curve of a training set using 4/5 samples of 7 marker genes selected by using a gene expression level FPKM according to an embodiment of the present application;

FIG. 6 shows the expression level of GPI gene (FPKM) according to an embodiment of the present application _GPI ) Taking reference and calculating other genesThe ratio of the expression level FPKM to the same (FPKM/FPKM) _GPI ) For this ratio, 5 marker genes (GAS 6-AS2, UBR4, C9orf16, IFNAR2, YEATS 2) were selected using a random forest model, and the combined accuracy was optimal;

FIG. 7 shows the expression level of GPI gene (FPKM) according to an embodiment of the present application _GPI ) Performing benchmark selection for 5 marker genes for modeling, and taking 4/5 samples as a training set and ROC curves of the training set;

FIG. 8 shows the expression level (FPKM) of GAPDH gene according to an embodiment of the present application _GAPDH ) Taking the reference as a standard, calculating the ratio (FPKM/FPKM) of the expression quantity FPKM of other genes to the FPKM _GAPDH ) For the ratio, 4 marker genes (QSOX 1, VIM, ZEB2 and C9orf 16) are selected by using a random forest model, and the combination accuracy is optimal;

FIG. 9 shows the expression level (FPKM) of GAPDH gene according to an embodiment of the present application _GAPDH ) And (4) performing benchmark selection for modeling of 4 marker genes, and performing 4/5 sample as a training set and ROC curve of the training set.

Detailed Description

Embodiments of the method for distinguishing hand-foot-mouth light and severe cases of the present application are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.

1. Sample nucleic acid extraction

After peripheral blood lymphocytes (PBMCs) are isolated from blood, nucleic acids of the cells are extracted (RNA extraction) and subjected to high throughput sequencing or quantitative PCR (qPCR) gene quantification.

2. Bioinformatic analysis

Step one: sequencing result analysis

1. And (5) removing the low-quality sequence from the sequencing data in the machine to obtain a standby sequence.

2. The alternate sequence was aligned to a human reference gene using software Bowtie 2.

3. Gene expression levels (Fragments Per Kilobase of exon per Million fragments mapped, FPKM) were calculated using the (RNA-Seq by Expectation Maximization, RSEM) software package.

Step two: selection of reference genes

The coefficient of variation of the genes was calculated and relatively stable reference genes (GeneA, geneB) were selected.

Step three: selection of marker genes

1. Using the gene expression level FPKM, a set of marker genes (Group 1) was selected based on the light and severe samples using a random forest model for distinguishing different groups.

2. Expressed in terms of the amount of Gene A gene (FPKM _GeneA ) Taking the reference as a standard, calculating the ratio (FPKM/FPKM) of the expression quantity FPKM of other genes to the FPKM _GeneA ) For this ratio a set of marker genes (Group 2) was selected using a random forest model.

3. Expressed in terms of the gene B gene (FPKM) _GeneB ) Taking the reference as a standard, calculating the ratio (FPKM/FPKM) of the expression quantity FPKM of other genes to the FPKM _GeneB ) For this ratio a set of marker genes (Group 3) was selected using a random forest model.

4. Repeating the steps of 2 or 3 to obtain the ratio of multiple reference genes, and selecting the optimal combination.

Step four: light and severe disease prediction

1. Modeling by using a selected marker gene (Group 1), and predicting the light and heavy conditions of the hand-foot-and-mouth sample subjected to high-throughput sequencing.

2. Modeling the selected marker gene (Group 2) by using the gene GeneA as a benchmark, and predicting the light and heavy symptoms of the detected hand-foot-mouth sample.

3. Modeling a selected marker gene (Group 3) by using the gene B as a reference, predicting the light and heavy conditions of the detected hand-foot-and-mouth sample, combining the predicted result with the result of 2, adopting the result of judging coincidence, and judging that the coincidence is not predicted.

For ease of understanding, the applicant shows the flow of the present application for distinguishing hand-foot-and-mouth light and severe samples as fig. 1.

In another aspect, the application provides a device for distinguishing hand-foot-and-mouth samples. According to an embodiment of the application, referring to fig. 2, the apparatus comprises: a first expression level determining module 100, configured to determine an expression level of each gene in a first marker gene combination of a sample to be tested; a first classification module 200, configured to input the expression level result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-and-mouth sample between a light symptom group and a heavy symptom group, where the first marker gene combination and the first classification model are established according to the foregoing method.

Specifically, according to an embodiment of the present application, referring to fig. 3, the apparatus further includes: a second expression level determining module 300 for determining the expression level of each gene in the first auxiliary marker gene combination and the second auxiliary marker gene combination by qPCR method; a first auxiliary classification module 400, configured to distinguish the hand-foot-mouth sample between the light symptom group and the heavy symptom group using a first auxiliary classification model based on the gene expression amounts of the first auxiliary marker gene combinations, so as to obtain a first distinguishing result; a second auxiliary classification module 500, configured to distinguish the hand-foot-mouth sample between the light symptom group and the heavy symptom group using a second auxiliary classification model based on the gene expression amounts of the second auxiliary marker gene combinations, so as to obtain a second distinguishing result; the judging module 600 is configured to select, as a judging result, a distinguishing result that is the same as the first distinguishing result and the second distinguishing result, wherein the first auxiliary marker gene and the second auxiliary marker gene are combined, and the first auxiliary classification model and the second auxiliary classification model are established according to the method described above.

The application will be further illustrated with reference to specific examples. The experimental methods used in the following examples are conventional methods unless otherwise specified. Materials, reagents and the like used in the examples described below are commercially available unless otherwise specified.

Examples

1. Test example sample

PBMC (poly (vinyl acetate) sample for 35 severe hand-foot-mouth patients

PBMC (poly (vinyl acetate) sample for 30 cases of mild cases of hand-foot-mouth disease patients

2. Test analysis flow

1) 3mL peripheral blood was taken, PBMC were isolated, RNA was extracted, and RNA was subjected to pool sequencing (second generation high throughput sequencing).

2) And (5) removing the low-quality sequence from the sequencing data in the machine to obtain cleardata.

3) Cleardata was aligned to the reference gene using Bowtie 2.

4) Gene expression amount FPKM was calculated using RSEM.

5) The coefficient of variation of the gene was calculated and the relatively stable genes (GPI, GAPDH) were selected.

6) Marker genes were selected with 80% of samples:

using the gene expression level FPKM, a set of marker genes (FGFR 1OP2, IFNAR2, PAFAH1B1, PTPRC, HNRNPF, YEATS2, UBP 1) was selected using a random forest model, and the results were shown in FIGS. 4 and 5.

By the expression level of GPI gene (FPKM) _GPI ) Taking the reference as a standard, calculating the ratio (FPKM/FPKM) of the expression quantity FPKM of other genes to the FPKM _GPI ) A set of marker genes (GAS 6-AS2, UBR4, C9orf16, IFNAR2, YEATS 2) was selected for this ratio using a random forest model, and the results were shown in FIGS. 6 and 7.

Expressed in terms of GAPDH gene (FPKM) _GAPDH ) Taking the reference as a standard, calculating the ratio (FPKM/FPKM) of the expression quantity FPKM of other genes to the FPKM _GAPDH ) A set of marker genes (QSOX 1, VIM, ZEB2, C9orf 16) was selected for this ratio using a random forest model, and the results are shown in FIGS. 8 and 9.

In fig. 4 to 9, ROC Receiver Operating Characteristic represents a receiver operation feature; AUC, area Under the Curve; specificity means Specificity; sensitivity means Sensitivity; case represents a severe condition; control indicates mild symptoms.

7) Light and severe disease prediction for the remaining 20% of samples

a. Modeling by using selected marker genes (FGFR 1OP2, IFNAR2, PAFAH1B1, PTPRC, HNRNPF, YEATS2 and UBP 1), and predicting the light and heavy conditions of the hand-foot-and-mouth samples subjected to high-throughput sequencing.

b. And modeling selected marker genes (GAS 6-AS2, UBR4, C9orf16, IFNAR2 and YEATS 2) by using GPI genes AS references, and predicting the light and heavy condition of the hand-foot-mouth sample detected by qPCR.

c. Modeling selected marker genes (QSOX 1, VIM, ZEB2 and C9orf 16) by using GAPDH genes as a reference, predicting the light and heavy condition of a qPCR detected hand-foot-and-mouth sample, combining the predicted result with the result of b, and adopting the result of consistent judgment, wherein inconsistent judgment is not predicted.

Table 1: individual model efficiency evaluation table

MCC: matthew's Correlation Coefficient, ranging from [ -1,1], -1 representing a completely contradictory prediction; 1 represents a completely correct prediction; 0 represents a random prediction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A method of constructing a model for hand-foot-and-mouth sample classification, comprising:

(1) Sequencing nucleic acid samples from a plurality of hand-foot-and-mouth patients and obtaining sequencing data for each patient, wherein the plurality of hand-foot-and-mouth patients includes a light symptom group and a heavy symptom group;

(2) Determining the expression level of each gene in the initial gene set of each patient by comparing the sequencing data with a reference genome;

(3) Determining a reference gene set based on the coefficient of variation of the expression level of each gene in the initial gene set in each patient, the coefficient of variation of the reference gene being less than a predetermined threshold;

(4) Performing a first classification training using the expression level of the gene determined in step (2) as a training feature and the light symptom group and the heavy symptom group as training sets, so as to obtain a first marker gene combination and a first classification model for distinguishing light symptoms from heavy symptoms;

(5) Selecting one reference gene from the reference gene set, taking the ratio of the rest genes in the initial gene set to the reference genes as training characteristics, and adopting the light symptom group and the heavy symptom group as training sets to carry out auxiliary classification training so as to obtain an auxiliary marker gene combination and an auxiliary classification model for distinguishing light symptoms from heavy symptoms.

2. The method of claim 1, wherein the reference genes comprise GPI and GAPDH.

3. The method of claim 1, wherein the first marker gene combination comprises FGFR1OP2, IFNAR2, pafar 1B1, PTPRC, HNRNPF, YEATS2, UBP1.

4. The method of claim 1, wherein the first classification training and the auxiliary classification training are each independently random forest model classification training.

5. The method of claim 1, wherein in step (5), the auxiliary classification training is performed separately for each reference gene in the set of reference genes to obtain a plurality of auxiliary marker gene combinations and a corresponding plurality of auxiliary classification models.

6. The method of claim 5, wherein the first reference gene is GPI and the first auxiliary marker gene set comprises GAS6-AS2, UBR4, C9orf16, IFNAR2, YEATS2,

the second internal reference gene is GAPDH, and the second auxiliary marker gene combination comprises QSOX1, VIM, ZEB2, C9orf16.

7. A method for distinguishing hand-foot-and-mouth samples, comprising:

determining the expression quantity of each gene in a first marker gene combination of a sample to be tested;

inputting the expression level result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-and-mouth sample between a light symptom group and a heavy symptom group,

wherein said first marker gene combination and said first classification model are established in any of claims 1 to 6.

8. The method of claim 7, wherein the expression level of the first marker gene combination is obtained by high throughput sequencing.

9. The method of claim 7, further comprising, by:

a qPCR method for determining the expression level of each gene in the first auxiliary marker gene combination and the second auxiliary marker gene combination;

distinguishing the hand-foot-mouth sample between a light symptom group and a heavy symptom group by using a first auxiliary classification model based on the gene expression quantity of the first auxiliary marker gene combination so as to obtain a first distinguishing result;

distinguishing the hand-foot-mouth sample between a light symptom group and a heavy symptom group by using a second auxiliary classification model based on the gene expression quantity of the second auxiliary marker gene combination so as to obtain a second distinguishing result;

selecting as a judgment result a discrimination result in which the first discrimination result and the second discrimination result are identical, wherein the first auxiliary marker gene combination and the second auxiliary marker gene combination, the first auxiliary classification model and the second auxiliary classification model are established in claim 5 or 6.

10. An apparatus for distinguishing hand-foot-and-mouth samples, comprising:

the first expression quantity determining module is used for determining the expression quantity of each gene in a first marker gene combination of the sample to be detected;

a first classification module for inputting the expression quantity result of the first marker gene combination into a first classification model so as to distinguish the hand-foot-mouth sample between a light symptom group and a heavy symptom group,

11. The apparatus of claim 10, wherein the expression level of the first marker gene combination is obtained by high throughput sequencing.

12. The apparatus of claim 10, further comprising, by:

a second expression level determining module for determining the expression level of each gene in the first auxiliary marker gene combination and the second auxiliary marker gene combination by qPCR method;

the first auxiliary classification module is used for distinguishing the hand-foot-mouth sample between the light symptom group and the heavy symptom group by utilizing a first auxiliary classification model based on the gene expression quantity of the first auxiliary marker gene combination so as to obtain a first distinguishing result;

the second auxiliary classification module is used for distinguishing the hand-foot-mouth sample between the light symptom group and the heavy symptom group by using a second auxiliary classification model based on the gene expression quantity of the second auxiliary marker gene combination so as to obtain a second distinguishing result;

the judging module is used for selecting the distinguishing result with the same first distinguishing result and the second distinguishing result as judging results;

wherein the first and second helper marker gene combinations, the first and second helper classification models are established in claim 5 or 6.