CN110942808A

CN110942808A - Prognosis prediction method and prediction system based on gene big data

Info

Publication number: CN110942808A
Application number: CN201911256723.7A
Authority: CN
Inventors: 张海霞; 刘艺迪; 袁东风
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-03-31

Abstract

The invention relates to a prognosis prediction method and a prediction system based on gene big data, which belong to the technical field of artificial intelligence and mainly comprise the following steps: extracting gene information in a tissue sample to form a training set, sequencing the gene importance by using a relief algorithm, performing fitting classification on the prognosis time by using a machine learning algorithm model, and selecting the algorithm model with the highest accuracy and the gene characteristic number as the gene characteristic number of the specific disease and a prediction method. The method can quickly test new gene data after model training is completed, and can help to carry out prognosis evaluation.

Description

Prognosis prediction method and prediction system based on gene big data

Technical Field

The invention relates to a cancer prognosis prediction method and a prediction system based on gene big data, belonging to the technical field of artificial intelligence.

Background

According to annual statistics reported by the american cancer society, 1 out of 4 cancer deaths died from lung cancer. While previous scholars have acquired a large amount of data from microarray technology and Next Generation Sequencing (NGS), the information in these data may not be fully explored. Traditional survival predictions depend on the clinical pathology of the patient and are sometimes inaccurate.

In recent years, with the development of next-generation sequencing technology, large-scale cancer sample gene sequencing data can be obtained, and the development of big data artificial intelligence makes it possible to mine valuable potential information from the massive data. At present, aiming at the problem of cancer prognosis prediction, intuitive clinical characteristics are generally used, and the prediction is carried out by combining a traditional statistical method. Although some studies have shifted the research focus to the level of gene characteristics, the traditional statistical methods are used to select gene characteristics according to the differences of gene expression, and some genes with smaller expression differences but larger influence on prognosis cannot be found. To be more accurate, in the present application, genetic features selected from the above data are correlated with the survival time of the patient, and the correlation between the genes and survival time is determined, resulting in a calibrated predictive model.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a novel cancer prognosis modeling method and a novel prediction system, and relates to a method for predicting and classifying cancer prognosis time based on gene big data and relevant finding of relevant pathogenic gene genes. The method is simple, efficient, and suitable for wide range of different cancers based on gene expression. The method comprises the steps of screening and cleaning sample data, screening and cleaning gene data, sequencing gene importance, training and selecting a model, and finally predicting a new sample. Helps doctors to pre-estimate the disease condition of cancer patients and assist in treatment.

The technical scheme of the invention is as follows:

a prognosis prediction method based on gene big data comprises the following steps:

(1) collecting and fusing data; collecting fresh or frozen cancer tissue samples of patients, sequencing to obtain gene data, and obtaining the survival time and the clinical data of the survival state of the patients according to follow-up visit investigation; fusing gene data with clinical data, matching the corresponding clinical data according to sample names, namely survival time data, deleting samples with missing survival time, and standardizing the gene data into FPKM (fragments Per Kilobase Million) format data for subsequent processing after sequencing to obtain raw counts values;

(2) screening samples according to the prescribed conditions of clinical data: selecting samples with the survival states of death and survival and the survival time of more than two years in clinical data; samples that survive but have a survival time of less than two years are discarded here because we cannot determine whether the final survival time of similar samples belongs to the longer group (>3 years) or the shorter group (<3 years).

(3) Screening samples according to the specified conditions of the genetic data:

deleting excessive gene characteristics which cannot be detected to be expressed, and normalizing gene data; the excessive undetected expressed gene characteristics specifically mean that if a certain gene is expressed to be zero in more than 85 percent of samples, the gene is determined not to be detected in most samples, and the characteristics are discarded; the normalization method is to divide the FPKM value of each gene by the maximum value of the gene expression, so that the FPKM value of each gene is between 0 and 1; the gene expression refers to the amount of a functional gene product synthesized by measuring genetic information from a gene, and is data in FPKM format;

then deleting the non-primary tumor sample, i.e., the non-cancer tissue sample, from the data set, leaving only the cancer tissue sample;

(4) dividing the sample screened in the step (2) and the step (3) into two types of prognosis time more than three years and prognosis time less than or equal to three years according to the prognosis time, and using a relief algorithm to carry out importance ranking on the genes; taking a certain amount of gene data, sequentially using at least two machine learning algorithms to perform cross validation on cancer prognosis by gradually increasing gene characteristic numbers, and selecting an optimal model and a gene characteristic number by result comparison, wherein the gene characteristic number is the number of the gene data.

Preferably, in step (4), the importance ranks are: the relief algorithm is trained to generate a corresponding weight for each feature, namely a gene, wherein the higher the weight is, the more important the contribution of the gene to distinguishing two groups of samples (the prognosis time is more than three years and the prognosis time is less than or equal to three years), and the higher the ranking is.

Preferably, in the step (4), the number of the gene data is selected to be at least one.

Preferably, in the step (4), gene data are taken, 8 machine learning algorithm models are sequentially used for cross validation of cancer prognosis by gradually increasing gene feature numbers, and the 8 machine learning algorithm models are respectively a support vector machine, a random forest, a Logistic regression, naive Bayes, a linear regression, a support vector regression-polynomial kernel function, a support vector regression-linear kernel function and a ridge regression;

respectively training 8 algorithm models, recording results, and when each algorithm model is trained, firstly taking 1 gene data for training, then taking two gene data for training, and sequentially increasing the number of the gene data for training; obtaining and recording the accuracy rate through the training of the algorithm model each time, wherein the accuracy rate is the ratio of the prognosis time obtained by the algorithm model to the survival time recorded by the actual clinical data, and the number of samples with accurate prognosis to the total number of samples; recording the number of the selected gene data corresponding to the highest accuracy under each algorithm model; and comparing the results of the 8 algorithm models, and selecting the algorithm model with the highest accuracy and the number of the selected gene data corresponding to the algorithm model with the highest accuracy.

Preferably, in the step (4), ten-fold cross validation with an english name of 10-fold cross-validation is adopted for each training, and is used for testing the accuracy of the algorithm model.

A prognosis prediction system based on gene big data comprises a data preprocessing module, a screening module and a training verification module, wherein the data preprocessing module is used for downloading data from a public database TCGA (TCGA) and standardizing the data into data in an FPKM (flexible flat panel display) format, and the data comprises gene data and clinical data; the screening module is used for screening the data according to two types of conditions, wherein the two types of conditions are respectively specified conditions of clinical data and specified conditions of gene data; the training verification module comprises at least two algorithm models, and is used for classifying the samples screened by the screening module again, ranking the importance of the genes by using a relief algorithm, training the input data of the different algorithm models respectively, comparing the results of the different algorithm models by the training verification module, and selecting the algorithm model with the highest accuracy and the number of the selected gene data corresponding to the algorithm model with the highest accuracy.

After the main pathogenic genes and the model of a certain cancer are determined, new gene data can be directly introduced into a trained model for prediction, clinical data can be judged, and reference is provided.

The invention has the beneficial effects that:

the invention provides a method for modeling cancer patient prognosis based on combination of a feature importance ranking algorithm and a plurality of classification fitting models. The method is based on the ordering of the importance of certain gene characteristics in two groups with larger difference in differentiated survival time (3 years group and 3 years group) and then is combined with different machine learning models for screening, so that not only can the accurate prediction of different cancer prognosis time be realized, but also supplement and support can be provided for the discovery of oncogenes of different cancers and key genes influencing prognosis.

Drawings

FIG. 1 is a schematic diagram of a data processing flow;

fig. 2 is an overall flowchart.

Detailed Description

The present invention will be further described by way of examples, but not limited thereto, with reference to the accompanying drawings.

Example 1:

(1) collecting and fusing data;

collecting fresh or frozen cancer tissue samples of patients, sequencing to obtain gene data, and obtaining the survival time and the clinical data of the survival state of the patients according to follow-up visit investigation; the present embodiment uses a common data set: taking lung adenocarcinoma as an example, lung adenocarcinoma LUAD related data https:// portal.gdc.cancer.gov/, including genetic data and clinical data, are downloaded from a public database TCGA;

fusing gene data with clinical data, matching the corresponding clinical data according to the name of the sample, namely survival time data, deleting the sample with missing survival time, and standardizing the gene data into FPKM (fragments Per Kilost Million) format data for subsequent processing after sequencing the gene data to obtain raw counts value.

(2) Screening samples according to the prescribed conditions of clinical data:

selecting samples with the survival states of death and survival and the survival time of more than two years in clinical data; samples that survive but have a survival time of less than two years are discarded here because we cannot determine whether the final survival time of similar samples belongs to the longer group (>3 years) or the shorter group (<3 years).

(4) dividing the sample screened in the step (2) and the step (3) into two types of samples with prognosis time more than three years and with prognosis time less than or equal to three years according to the prognosis time;

genes were ranked for importance using relief algorithm: the relief algorithm is trained to generate a corresponding weight for each feature, namely a gene, wherein the higher the weight is, the more important the contribution of the gene to distinguishing two groups of samples (the prognosis time is more than three years and the prognosis time is less than or equal to three years), and the higher the ranking is.

And sequentially using 8 machine learning algorithm models to perform cross validation on the prognosis of the cancer by gradually increasing the gene feature number, wherein the 8 machine learning algorithm models are respectively a support vector machine, a random forest, a Logistic regression, a naive Bayes, a linear regression, a support vector regression-polynomial kernel function, a support vector regression-linear kernel function and a ridge regression, and the algorithm models are all the existing models.

Respectively training 8 algorithm models, recording results, and when each algorithm model is trained, firstly taking 1 gene data for training, then taking two gene data for training, and sequentially increasing the number of the gene data to 200 for training; obtaining and recording the accuracy rate through the training of the algorithm model each time, wherein the accuracy rate is the ratio of the prognosis time obtained by the algorithm model to the survival time recorded by the actual clinical data, and the number of samples with accurate prognosis to the total number of samples; recording the number of the selected gene data corresponding to the highest accuracy under each algorithm model; and comparing the results of the 8 algorithm models, and selecting the algorithm model with the highest accuracy and the number of the selected gene data corresponding to the algorithm model with the highest accuracy.

The method comprises the specific implementation steps of dividing a data set into ten parts, taking 9 parts as training data and 1 part as test data in turn, and carrying out a test.

Example 2:

The number of the prognosis optimal models of different cancer genes and the corresponding gene data can be selected through the training result of the algorithm model. For a new sample, sequencing can be carried out to obtain a gene expression value, then corresponding gene characteristics are selected according to the determined optimal gene characteristic number, and a trained model is used for prediction.

Claims

1. A prognosis prediction method based on gene big data is characterized by comprising the following steps:

(1) collecting and fusing data; collecting fresh or frozen cancer tissue samples of patients, sequencing to obtain gene data, and obtaining the survival time and the clinical data of the survival state of the patients according to follow-up visit investigation; fusing gene data and clinical data, matching corresponding clinical data according to sample names, deleting samples with missing life time, and standardizing the gene data into FPKM format data for subsequent processing after sequencing to obtain raw counts numerical values;

(2) screening samples according to the prescribed conditions of clinical data: selecting samples with the survival states of death and survival and the survival time of more than two years in clinical data;

deleting excessive gene characteristics which cannot be detected to be expressed, and normalizing gene data; the excessive undetected expression gene characteristic means that if a certain gene is expressed to be zero in more than 85 percent of samples, the gene is determined to be undetected in most samples; the normalization method is to divide the FPKM value of each gene by the maximum value of the gene expression, so that the FPKM value of each gene is between 0 and 1; the gene expression refers to the amount of a functional gene product synthesized by measuring genetic information from a gene, and is data in FPKM format;

then deleting the non-cancer tissue sample from the data set, and only keeping the cancer tissue sample;

2. The method for prognosis prediction based on gene big data according to claim 1, wherein in the step (4), the importance ranks are as follows: the relief algorithm is trained to generate a corresponding weight for each gene, and the higher the weight is, the more the contribution of the gene to distinguishing two groups of samples is, the more important the gene is, and the higher the ranking is.

3. The method according to claim 1, wherein in the step (4), at least one gene data is selected.

4. The prognosis prediction method based on gene big data as claimed in claim 1, wherein in step (4), the gene data is taken, 8 machine learning algorithm models are sequentially used for cross validation of cancer prognosis by gradually increasing the number of gene features, and the 8 machine learning algorithm models are respectively support vector machine, random forest, Logistic regression, naive Bayes, linear regression, support vector regression-polynomial kernel function, support vector regression-linear kernel function and ridge regression;

5. The method for prognosis prediction based on gene big data as claimed in claim 4, wherein in step (4), each training is performed by ten-fold cross validation, which is used to test the accuracy of the algorithm model, and the specific implementation step is dividing the data set into ten parts, and taking 9 parts as training data and 1 part as test data in turn to perform the test.

6. The prognosis prediction system based on gene big data is characterized by comprising a data preprocessing module, a screening module and a training verification module, wherein the data preprocessing module is used for downloading data from a public database TCGA (TCGA) and standardizing the data into data in FPKM (fast Fourier transform and genetic Algorithm) format, and the data comprises gene data and clinical data; the screening module is used for screening the data according to two types of conditions, wherein the two types of conditions are respectively specified conditions of clinical data and specified conditions of gene data; the training verification module comprises at least two algorithm models, and is used for classifying the samples screened by the screening module again, ranking the importance of the genes by using a relief algorithm, training the input data of the different algorithm models respectively, comparing the results of the different algorithm models by the training verification module, and selecting the algorithm model with the highest accuracy and the number of the selected gene data corresponding to the algorithm model with the highest accuracy.