WO2022048071A1 - Tumor risk grading method and system, terminal, and storage medium - Google Patents

Tumor risk grading method and system, terminal, and storage medium Download PDF

Info

Publication number
WO2022048071A1
WO2022048071A1 PCT/CN2020/139298 CN2020139298W WO2022048071A1 WO 2022048071 A1 WO2022048071 A1 WO 2022048071A1 CN 2020139298 W CN2020139298 W CN 2020139298W WO 2022048071 A1 WO2022048071 A1 WO 2022048071A1
Authority
WO
WIPO (PCT)
Prior art keywords
patient
prognosis
gene
value
status information
Prior art date
Application number
PCT/CN2020/139298
Other languages
French (fr)
Chinese (zh)
Inventor
李霞
蔡云鹏
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022048071A1 publication Critical patent/WO2022048071A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present application belongs to the field of biomedical technology, and in particular relates to a method, system, terminal and storage medium for classifying tumor risk levels.
  • the current general method for classifying different risk groups based on gene expression levels mainly includes three steps:
  • Population grouping such as dividing the population into tumor group and normal group, metastasis group and non-metastasis group, and then obtaining differentially expressed genes between groups according to gene expression levels;
  • the shortcomings of the existing risk group classification methods include:
  • the present application provides a method, system, terminal, and storage medium for classifying tumor risk levels, aiming to solve one of the above-mentioned technical problems in the prior art at least to a certain extent.
  • a tumor risk grade classification method comprising the following steps:
  • transcriptome data at least includes the gene expression value of each gene in the patient
  • the patient is scored according to the molecular marker and the gene expression value, and the patient is divided into tumor risk levels according to the score distribution.
  • the technical solution adopted in the embodiment of the present application further includes: the obtaining of the patient's transcriptome data and clinical survival status information includes:
  • the gene expression value of each gene in the transcriptome data was logarithmically transformed by statistics, the genes with low expression levels were eliminated, and the genes that were expressed in more than a set proportion of patients were retained; Obtain the clinical survival status information of the patient from the data;
  • the clinical survival status information includes patient ID number, survival status and survival time.
  • the technical solution adopted in the embodiment of the present application further includes: the obtaining of the patient's prognosis-related genes according to the transcriptome data and the clinical survival status information includes:
  • the technical solution adopted in the embodiment of the present application further includes: the sorting of the gene expression values of each gene in the patient, and obtaining the high and low expression populations corresponding to each gene respectively include:
  • the high and low expression populations corresponding to each gene were obtained by the method of dichotomy, trichotomy, or quartile, respectively.
  • the technical solution adopted in the embodiment of the present application further includes: the obtaining of the patient's prognosis-related genes according to the transcriptome data and the clinical survival status information includes:
  • the relationship between the expression level and prognosis is calculated respectively. If a gene is associated with prognosis in the high and low expression populations obtained by at least two algorithms, the gene is determined as a prognosis. associated genes.
  • the technical solutions adopted in the embodiments of the present application further include: the inputting the clinical survival status information and prognosis-related genes into a statistical model for molecular marker screening includes:
  • Described statistical model is lasso punishment Cox proportional hazards model, described model adopts cross-validation to carry out the Lambda value operation of the set number of times, and determines the final Lambda value according to the minimum mean standard error value of all Lambda values, according to the final Lambda value Value screening of molecular markers of the prognosis-related genes.
  • the technical solution adopted in the embodiment of the present application further includes: the scoring of the patient according to the molecular marker and the gene expression level includes:
  • the regression coefficient of each molecular marker is multiplied by the gene expression value of the molecular marker in the patient to obtain the score value of each molecular marker in the patient, and then all the molecular markers of the patient are calculated. The scores of the subjects are added to obtain the final score of the patient.
  • the classification of tumor risk levels for the patient according to the score distribution includes:
  • a tumor risk grade classification system comprising:
  • Data acquisition module used to acquire the patient's transcriptome data and clinical survival status information; wherein, the transcriptome data at least includes the gene expression value of each gene in the patient;
  • Prognosis-related gene acquisition module used to obtain the prognosis-related genes of the patient according to the transcriptome data and clinical survival status information;
  • Molecular marker screening module used to input the clinical survival status information and prognosis-related genes into a statistical model for molecular marker screening, and obtain the gene expression values of the screened molecular markers in the patient;
  • Risk grading module used to score the patient according to the molecular marker and the gene expression value, and divide the patient according to the score distribution to tumor risk grading.
  • a terminal includes a processor and a memory coupled to the processor, wherein,
  • the memory stores program instructions for implementing the method for classifying tumor risk levels
  • the processor is configured to execute the program instructions stored in the memory to control tumor risk grading.
  • a storage medium storing program instructions executable by a processor, where the program instructions are used to execute the tumor risk level classification method.
  • the beneficial effects of the embodiments of the present application are: the tumor risk level classification method of the embodiments of the present application obtains the prognosis-related genes of the patients based on the transcriptome data of the patients, and on this basis, uses a statistical model to perform multiple Operation is performed to screen molecular markers of prognosis-related genes, and score each patient in combination with the screened molecular markers and gene expression levels, and divide the patients into risk levels according to the score distribution.
  • the embodiment of the present application can maximize the retention of all prognosis-related genes, so that the obtained prognosis-related genes have better accuracy, and the acquisition of molecular markers has a lower error rate, and the risk level classification results have better rationality.
  • FIG. 1 is a flowchart of a method for classifying tumor risk levels according to the first embodiment of the present application
  • FIG. 2 is a flowchart of a method for classifying tumor risk levels according to a second embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a tumor risk grade classification system according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • the tumor risk grade classification method of the embodiment of the present application is based on the transcriptome data of all the study patients, combined with their clinical survival information, to obtain the prognosis-related genes of the patients, and on this basis, a statistical model is used to conduct multiple
  • the second operation is performed to screen the molecular markers of prognosis-related genes, and score each patient in combination with the screened molecular markers and gene expression levels, and classify the patients according to the score distribution.
  • FIG. 1 is a flowchart of the method for classifying tumor risk levels according to the first embodiment of the present application.
  • the tumor risk grade classification method according to the first embodiment of the present application includes the following steps:
  • the transcriptome data is input in the form of a matrix, each row is a single patient individual, each column is a gene name, and the value of each cell in the matrix represents the gene expression level of each gene in each individual patient (that is, the gene expression level). expression value).
  • S110 Preprocess the transcriptome data to obtain the clinical survival status information of all patients
  • the preprocessing operation specifically includes: first, using statistics to perform logarithmic transformation on the gene expression value of each gene in the transcriptome data, remove genes with low expression levels, and only keep genes that exceed the set ratio (this application The ratio is set to 90%, which can be set according to the actual operation), and there are expressed genes in the patients; then the clinical survival status information such as the ID number, survival status, and survival time of all patients is obtained from the transcriptome data;
  • the clinical survival status information also exists in the form of a matrix. Each row of the matrix is the ID number of a single patient, and each column corresponds to the survival status and survival time of each patient.
  • S120 respectively obtain the high and low expression populations corresponding to each gene according to the gene expression level
  • the algorithms for obtaining high and low gene expression populations include, but are not limited to, the method of dichotomy, the method of thirds, and the method of quartering, etc.
  • Each obtaining algorithm is specifically:
  • Trichotomy Rank the gene expression value of each gene in all patients, and intercept the top 1/3 population and the bottom 1/3 population as the high and low expression population of the corresponding gene. ;
  • Quartile method Rank the gene expression values of each gene in all patients, take the quartile as the cutoff point, and intercept the highest quartile and the lowest quartile as the high and low expression of the corresponding gene. crowd.
  • S130 Combine the clinical survival status information and the high and low expression population of each gene to calculate the correlation between the expression level of each gene and prognosis, obtain the prognosis-related genes, and extract the gene expression values of each prognosis-related gene in all patients , generate a prognostic-associated gene expression matrix;
  • the statistical computing platform R uses the survival tool to calculate the prognosis correlation of all genes, and outputs the regression coefficient and P value of each gene.
  • the threshold is 0.01, which can be set according to the actual operation) as a significant correlation condition to filter all genes to obtain prognosis-related genes. Since the high and low expression populations of genes obtained by different algorithms are different, the obtained prognosis-related genes are also different. Therefore, in the actual operation, the embodiments of the present application respectively calculate the high and low expression populations obtained based on different algorithms. If the high and low expression populations of at least two algorithms are associated with prognosis, the gene is determined as a prognosis-associated gene.
  • S140 Based on the clinical survival status information and the prognosis-related gene expression matrix, molecular markers are screened for the prognosis-related genes through a statistical model, and the gene expression values of each molecular marker in each patient are obtained separately, and each molecular marker is generated.
  • the molecular marker expression matrix of the marker
  • the lasso penalty Cox proportional hazards statistical model is used to screen molecular markers.
  • the model uses cross-validation to select a statistic Lambda value, and the corresponding molecular marker is determined according to the Lambda value.
  • the Lambda value obtained by each operation of the model is different, resulting in a great randomness of the molecular markers to be screened. Therefore, in the embodiment of the present application, the model cycle is used for a set number of times (for example, 100 times or 1000 times).
  • S150 Calculate the score value of each molecular marker in each patient according to the regression coefficient of each molecular marker and the gene expression value in each patient, and then add all the score values of each patient to obtain final score for each patient;
  • the regression coefficient of each molecular marker is multiplied by the gene expression value in each patient to obtain the score value of each molecular marker in each patient; The scores for the markers were added together to obtain the final score for each patient.
  • S160 Combining the high and low score distribution of each patient and the clinical survival status information to classify tumor patients into high and low risk levels;
  • the survminer tool is used to classify the risk levels through the statistical computing platform R.
  • the scores of each patient are sorted according to their high and low levels, and combined with the clinical survival status information, the high and low score distribution and clinical survival status information of each patient are checked.
  • use the statistical method to carry out the maximum choice test, and calculate the maximum choice log-rank statistic value, and use this value as the cut-off point to classify patients with different risk levels; and use the survival tool to evaluate the risk level classification results.
  • the size of the P value to determine whether there is a significant difference in the survival probability of high and low risk levels, if there is a significant difference, it indicates that the risk level classification results are reasonable.
  • the examples of the present application use the transcriptome data of all patients of the same tumor type to screen for prognosis-related genes. Since the transcriptome data includes the expression levels of all genes in the patient, it can maximize the retention of all prognosis-related genes without missing Select other potential prognostic associated genes. In the calculation of prognosis-related genes, the influence of different gene high-low population acquisition methods on the final prognosis-related gene judgment is fully considered, and the prognosis-related gene results obtained by common high-low population group acquisition methods are comprehensively considered, so that the obtained prognosis-related genes have better accuracy.
  • the lasso penalized Cox proportional hazards model was used for multiple operations, and the final Lambda value was determined by the minimum average standard error value of the calculation results, so as to realize the screening of prognosis-related genes, and to obtain molecular markers with a lower error rate. Finally, based on the distribution of patient scores, statistical methods are used to automatically classify patients into high and low risk levels, respecting the true distribution of the data and having better rationality.
  • FIG. 2 is a flowchart of a method for classifying tumor risk levels according to the second embodiment of the present application.
  • This embodiment is to apply the present application to the classification of risk levels of melanoma patients, and specifically includes the following steps:
  • S210 Preprocess the transcriptome data, perform log2 logarithmic transformation on the gene expression value of each gene in the transcriptome data, retain the genes that are expressed in more than 90% of the patients, and obtain the ID numbers, survival numbers of all patients Clinical survival status information such as status and survival time;
  • S220 Use the survival tool to calculate the prognostic association of all genes through the statistical computing platform R, output the regression coefficient and P value of each gene, and filter all genes according to the P value ⁇ 0.01 threshold as a significant correlation condition, and get 1086 prognosis Associate genes, and extract the gene expression values of each prognosis-associated gene in all patients to generate a prognosis-associated gene expression matrix;
  • S230 Combine the clinical survival status information and the prognosis-related gene expression matrix, use the glmnet tool in the statistical computing platform R to run 100 times in a loop to calculate the minimum average standard error value to determine the final Lambda value, and screen out 21 based on the Lambda value. molecular markers;
  • S240 Extract the gene expression values of each molecular marker in all patients, make a molecular marker expression matrix, and calculate each molecule according to the regression coefficient of each molecular marker and the gene expression value in each patient.
  • the survival tool was used in the statistical computing platform R to evaluate the risk level classification results of melanoma patients, and the P value was less than 0.001, indicating that there were significant differences between different risk levels, that is, the risk level classification results were very reasonable. .
  • FIG. 3 is a schematic structural diagram of a tumor risk grade classification system according to an embodiment of the present application.
  • the tumor risk level classification system 40 according to the embodiment of the present application includes:
  • Data acquisition module 41 used to acquire transcriptome data and clinical survival status information of the patient; wherein, the transcriptome data at least includes the gene expression value of each gene in the patient;
  • Prognosis-related gene acquisition module 42 used to acquire the prognosis-related genes of the patient according to the transcriptome data and the clinical survival state information;
  • Molecular marker screening module 43 used to input the clinical survival status information and prognosis-related genes into a statistical model for molecular marker screening, and obtain the gene expression values of the screened molecular markers in the patient;
  • Risk level classification module 44 used to score the patient according to the molecular marker and the gene expression value, and classify the patient according to the score distribution to tumor risk level.
  • FIG. 4 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • the terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
  • the memory 52 stores program instructions for implementing the above-described tumor risk level classification method.
  • the processor 51 is configured to execute program instructions stored in the memory 52 to control the classification of tumor risk levels.
  • the processor 51 may also be referred to as a CPU (Central Processing Unit, central processing unit).
  • the processor 51 may be an integrated circuit chip with signal processing capability.
  • the processor 51 may also be a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components .
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • the storage medium of this embodiment of the present application stores a program file 61 capable of implementing all the above methods, wherein the program file 61 may be stored in the above-mentioned storage medium in the form of a software product, and includes several instructions to enable a computer device (which can be It is a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the various embodiments of the present invention.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only). Memory), random access memory (RAM, Random Various media that can store program codes, such as Access Memory), magnetic disks or CD-ROMs, or terminal devices such as computers, servers, mobile phones, and tablets.

Abstract

A tumor risk grading method and system, a terminal, and a storage medium. The method comprises: acquiring transcriptome data and clinical survival state information of patients; acquiring prognosis-associated genes of the patients according to the transcriptome data and the clinical survival state information; inputting the clinical survival state information and the prognosis-associated genes into a statistical model for molecular marker screening, and acquiring gene expression values of screened molecular markers in the patients; and scoring the patients according to the molecular markers and the gene expression values, and conducting tumor risk grading on the patients according to the score distribution. All prognosis-associated genes can be maximally retained, such that the obtained prognosis-associated genes have better accuracy, the acquisition of the molecular marker has a lower error rate, and risk grading results are more reasonable.

Description

一种肿瘤风险等级划分方法、系统、终端以及存储介质A tumor risk grade classification method, system, terminal and storage medium 技术领域technical field
本申请属于生物医学技术领域,特别涉及一种肿瘤风险等级划分方法、系统、终端以及存储介质。The present application belongs to the field of biomedical technology, and in particular relates to a method, system, terminal and storage medium for classifying tumor risk levels.
背景技术Background technique
虽然现在对肿瘤有了一定的治疗策略,但肿瘤的预后一直较差,由于肿瘤存在很高的快递发展以及转移的可能性,因此对肿瘤进行不同风险人群划分有助于及早发现高风险疾病群体,对于肿瘤的诊断和提供合适的监管措施发挥重大的作用。传统上,肿瘤患者的所处临床阶段、转移情况等临床表型特征常被用来对患者风险进行评估,但该手段未充分考虑到患者内在固有分子水平差异,使得对风险判断缺乏准确性,在实际操作中对肿瘤患者无法提供精准预后判断。Although there are certain treatment strategies for tumors, the prognosis of tumors has always been poor. Due to the high possibility of express development and metastasis of tumors, dividing tumors into different risk groups is helpful for early detection of high-risk disease groups , plays an important role in tumor diagnosis and providing appropriate regulatory measures. Traditionally, clinical phenotypic characteristics such as the clinical stage and metastasis of tumor patients are often used to assess the risk of patients, but this method does not fully consider the inherent molecular level differences of patients, which makes the risk judgment lack of accuracy. In practice, accurate prognostic judgment cannot be provided for tumor patients.
近年来基于基因的表达数据获取跟患者预后相关联分子标志物得以广泛应用,目前基于基因表达水平进行不同风险人群划分的通用方法主要包括三大步骤:In recent years, molecular markers based on gene expression data acquisition and patient prognosis have been widely used. The current general method for classifying different risk groups based on gene expression levels mainly includes three steps:
1)人群分组,比如将人群划分为肿瘤组和正常组,转移组和未转移组,然后根据基因表达水平获取组间差异表达基因;1) Population grouping, such as dividing the population into tumor group and normal group, metastasis group and non-metastasis group, and then obtaining differentially expressed genes between groups according to gene expression levels;
2)结合患者的生存信息,采用生存分析获取跟预后相关联的基因,进一步采用统计模型筛选预后关联的分子标志物;2) Combined with the survival information of patients, use survival analysis to obtain genes associated with prognosis, and further use statistical models to screen molecular markers associated with prognosis;
3)根据分子标志物的基因表达水平结合生存分析中的回归系数进行个体打分,然后根据分值高低排序,以中位数为分界点将人群划分为不同风险人群。3) Individuals are scored according to the gene expression levels of molecular markers combined with regression coefficients in survival analysis, and then ranked according to the scores, and the median is used as the cut-off point to divide the population into different risk groups.
技术问题technical problem
综上所述,现有的风险人群划分方法存在的缺陷包括:To sum up, the shortcomings of the existing risk group classification methods include:
一、现有方法单纯依靠简单的人群划分获取差异表达基因,忽略了人群间真实差异表达基因,即每个患者个体内部的分子组成都不一样,在组间存在表达差异的基因在组内可能也存在巨大的表达差异;另外,有些基因在人群划分所得不同组间虽然不存在差异表达,但其仍然跟预后相关联,而现有方法存在漏掉其他潜在预后关联基因的可能。1. Existing methods rely solely on simple population division to obtain differentially expressed genes, ignoring the real differentially expressed genes between populations, that is, the molecular composition of each individual patient is different, and genes with expression differences between groups may be within the group. There are also huge differences in expression; in addition, some genes are not differentially expressed between different groups obtained by dividing the population, but they are still associated with prognosis, and existing methods may miss other potential prognosis-associated genes.
二、现有方法在对预后关联基因进行筛选的过程中采用lasso惩罚Cox比例风险模型,该模型需要进行交叉验证来选择Lambda值,基于该值再次确定分子标志物,而现有方法结果往往具备随机性,即每运算一次结果所得Lambda值都不一样。2. Existing methods use lasso penalized Cox proportional hazards model in the process of screening prognostic-related genes. This model requires cross-validation to select the Lambda value, and based on this value, molecular markers are re-determined, and the results of existing methods often have Randomness, that is, the Lambda value obtained from the result of each operation is different.
三、现有方法在对人群进行风险划分时,采用打分值的中位数为分界点划分,该值未必是所有分值划分的最佳点,从而会导致最终的风险划分结果存在一定的错误性。3. When the existing method divides the risk of the population, the median of the scoring value is used as the dividing point, and this value may not be the best point for dividing all the scores, which will lead to certain errors in the final risk division result. sex.
技术解决方案technical solutions
本申请提供了一种肿瘤风险等级划分方法、系统、终端以及存储介质,旨在至少在一定程度上解决现有技术中的上述技术问题之一。The present application provides a method, system, terminal, and storage medium for classifying tumor risk levels, aiming to solve one of the above-mentioned technical problems in the prior art at least to a certain extent.
为了解决上述问题,本申请提供了如下技术方案:In order to solve the above problems, the application provides the following technical solutions:
一种肿瘤风险等级划分方法,包括以下步骤:A tumor risk grade classification method, comprising the following steps:
获取患者的转录组数据以及临床生存状态信息;其中,所述转录组数据至少包括每个基因在所述患者中的基因表达值;Obtaining the patient's transcriptome data and clinical survival status information; wherein, the transcriptome data at least includes the gene expression value of each gene in the patient;
根据所述转录组数据以及临床生存状态信息获取所述患者的预后关联基因;Obtain the prognosis-related genes of the patient according to the transcriptome data and the clinical survival status information;
将所述临床生存状态信息和预后关联基因输入统计模型进行分子标志物筛选,并获取所述筛选的分子标志物在所述患者中的基因表达值;Inputting the clinical survival status information and prognosis-related genes into a statistical model for molecular marker screening, and obtaining the gene expression values of the screened molecular markers in the patient;
根据所述分子标志物以及所述基因表达值对所述患者进行打分,根据分值分布对所述患者进行肿瘤风险等级划分。The patient is scored according to the molecular marker and the gene expression value, and the patient is divided into tumor risk levels according to the score distribution.
本申请实施例采取的技术方案还包括:所述获取患者的转录组数据以及临床生存状态信息包括:The technical solution adopted in the embodiment of the present application further includes: the obtaining of the patient's transcriptome data and clinical survival status information includes:
采用统计学对所述转录组数据中每一个基因的基因表达值分别进行对数转换,剔除低表达水平的基因,保留在超过设定比例的患者中存在表达的基因;然后从所述转录组数据中获取所述患者的临床生存状态信息;The gene expression value of each gene in the transcriptome data was logarithmically transformed by statistics, the genes with low expression levels were eliminated, and the genes that were expressed in more than a set proportion of patients were retained; Obtain the clinical survival status information of the patient from the data;
所述临床生存状态信息包括患者ID号、生存状态以及生存时间。The clinical survival status information includes patient ID number, survival status and survival time.
本申请实施例采取的技术方案还包括:所述根据所述转录组数据以及临床生存状态信息获取所述患者的预后关联基因包括:The technical solution adopted in the embodiment of the present application further includes: the obtaining of the patient's prognosis-related genes according to the transcriptome data and the clinical survival status information includes:
对每个基因在所述患者中的基因表达值进行排序,分别获取每个基因对应的高低表达人群;Sort the gene expression values of each gene in the patient, and obtain the high and low expression populations corresponding to each gene respectively;
结合所述临床生存状态信息和高低表达人群分别计算每个基因的表达水平高低与预后的关联情况,输出每个基因的回归系数和P值,根据P值大小对所述患者的所有基因进行过滤,得到所述患者的预后关联基因,并提取每个预后关联基因在所述患者中的基因表达值,生成预后关联基因表达矩阵。Calculate the correlation between the expression level of each gene and prognosis in combination with the clinical survival status information and the high and low expression population, output the regression coefficient and P value of each gene, and filter all the genes of the patient according to the size of the P value , obtain the prognosis-related genes of the patient, and extract the gene expression value of each prognosis-related gene in the patient to generate a prognosis-related gene expression matrix.
本申请实施例采取的技术方案还包括:所述对所述对每个基因在所述患者中的基因表达值进行排序,分别获取每个基因对应的高低表达人群包括:The technical solution adopted in the embodiment of the present application further includes: the sorting of the gene expression values of each gene in the patient, and obtaining the high and low expression populations corresponding to each gene respectively include:
分别采用二分法、三分法或四分法获取每个基因对应的高低表达人群。The high and low expression populations corresponding to each gene were obtained by the method of dichotomy, trichotomy, or quartile, respectively.
本申请实施例采取的技术方案还包括:所述根据所述转录组数据以及临床生存状态信息获取所述患者的预后关联基因包括:The technical solution adopted in the embodiment of the present application further includes: the obtaining of the patient's prognosis-related genes according to the transcriptome data and the clinical survival status information includes:
基于不同算法获得的高低表达人群分别进行所述表达水平高低与预后关联情况的计算,如果某一基因在至少两种算法获得的高低表达人群中均与预后相关联,则判定所述基因为预后关联基因。Based on the high and low expression populations obtained by different algorithms, the relationship between the expression level and prognosis is calculated respectively. If a gene is associated with prognosis in the high and low expression populations obtained by at least two algorithms, the gene is determined as a prognosis. associated genes.
本申请实施例采取的技术方案还包括:所述将所述临床生存状态信息和预后关联基因输入统计模型进行分子标志物筛选包括:The technical solutions adopted in the embodiments of the present application further include: the inputting the clinical survival status information and prognosis-related genes into a statistical model for molecular marker screening includes:
所述统计模型为lasso惩罚Cox比例风险模型,所述模型采用交叉验证进行设定次数的Lambda值运算,并根据所有Lambda值的最小平均标准误差值确定最终的Lambda值,根据所述最终的Lambda值筛选所述预后关联基因的分子标志物。Described statistical model is lasso punishment Cox proportional hazards model, described model adopts cross-validation to carry out the Lambda value operation of the set number of times, and determines the final Lambda value according to the minimum mean standard error value of all Lambda values, according to the final Lambda value Value screening of molecular markers of the prognosis-related genes.
本申请实施例采取的技术方案还包括:所述根据所述分子标志物以及所述基因表达水平对所述患者进行打分包括:The technical solution adopted in the embodiment of the present application further includes: the scoring of the patient according to the molecular marker and the gene expression level includes:
将每个分子标志物的回归系数与所述分子标志物在所述患者中的基因表达值相乘,得到每个分子标志物在所述患者中的打分值,然后将所述患者所有分子标志物的打分值相加,得到所述患者的最终分值。The regression coefficient of each molecular marker is multiplied by the gene expression value of the molecular marker in the patient to obtain the score value of each molecular marker in the patient, and then all the molecular markers of the patient are calculated. The scores of the subjects are added to obtain the final score of the patient.
本申请实施例采取的技术方案还包括:所述根据分值分布对所述患者进行肿瘤风险等级划分包括:The technical solution adopted in the embodiment of the present application further includes: the classification of tumor risk levels for the patient according to the score distribution includes:
分析所述患者的分值高低分布与临床生存状态信息的关系,并采用统计学方法进行最大选择检验,计算最大选择对数秩统计量数值,以该数值为分界点对所述患者进行风险等级划分。Analyze the relationship between the distribution of the patient's score and the clinical survival status information, and use a statistical method to perform a maximum selection test, calculate the maximum selection log-rank statistic value, and use this value as a cut-off point to carry out the risk level of the patient. Divide.
本申请实施例采取的另一技术方案为:一种肿瘤风险等级划分系统,包括:Another technical solution adopted in the embodiment of the present application is: a tumor risk grade classification system, comprising:
数据获取模块:用于获取患者的转录组数据以及临床生存状态信息;其中,所述转录组数据至少包括每个基因在所述患者中的基因表达值;Data acquisition module: used to acquire the patient's transcriptome data and clinical survival status information; wherein, the transcriptome data at least includes the gene expression value of each gene in the patient;
预后关联基因获取模块:用于根据所述转录组数据以及临床生存状态信息获取所述患者的预后关联基因;Prognosis-related gene acquisition module: used to obtain the prognosis-related genes of the patient according to the transcriptome data and clinical survival status information;
分子标志物筛选模块:用于将所述临床生存状态信息和预后关联基因输入统计模型进行分子标志物筛选,并获取所述筛选的分子标志物在所述患者中的基因表达值;Molecular marker screening module: used to input the clinical survival status information and prognosis-related genes into a statistical model for molecular marker screening, and obtain the gene expression values of the screened molecular markers in the patient;
风险等级划分模块:用于根据所述分子标志物以及所述基因表达值对所述患者进行打分,根据分值分布对所述患者进行肿瘤风险等级划分。Risk grading module: used to score the patient according to the molecular marker and the gene expression value, and divide the patient according to the score distribution to tumor risk grading.
本申请实施例采取的又一技术方案为:一种终端,所述终端包括处理器、与所述处理器耦接的存储器,其中,Another technical solution adopted by the embodiments of the present application is: a terminal, the terminal includes a processor and a memory coupled to the processor, wherein,
所述存储器存储有用于实现所述肿瘤风险等级划分方法的程序指令;The memory stores program instructions for implementing the method for classifying tumor risk levels;
所述处理器用于执行所述存储器存储的所述程序指令以控制肿瘤风险等级划分。The processor is configured to execute the program instructions stored in the memory to control tumor risk grading.
本申请实施例采取的又一技术方案为:一种存储介质,存储有处理器可运行的程序指令,所述程序指令用于执行所述肿瘤风险等级划分方法。Another technical solution adopted by the embodiments of the present application is: a storage medium storing program instructions executable by a processor, where the program instructions are used to execute the tumor risk level classification method.
有益效果beneficial effect
相对于现有技术,本申请实施例产生的有益效果在于:本申请实施例的肿瘤风险等级划分方法基于患者的转录组数据获取患者的预后关联基因,在此基础之上采用统计模型进行多次运算,筛选预后关联基因的分子标志物,结合所筛选的分子标志物以及基因表达水平对每个患者进行打分,根据分值分布对患者进行风险等级划分。相对于现有技术,本申请实施例可以最大化保留所有预后关联基因,使得所得预后关联基因具备更好的准确性,并使得分子标志物的获取具备更低的错误率,风险等级划分结果具备更好的合理性。Compared with the prior art, the beneficial effects of the embodiments of the present application are: the tumor risk level classification method of the embodiments of the present application obtains the prognosis-related genes of the patients based on the transcriptome data of the patients, and on this basis, uses a statistical model to perform multiple Operation is performed to screen molecular markers of prognosis-related genes, and score each patient in combination with the screened molecular markers and gene expression levels, and divide the patients into risk levels according to the score distribution. Compared with the prior art, the embodiment of the present application can maximize the retention of all prognosis-related genes, so that the obtained prognosis-related genes have better accuracy, and the acquisition of molecular markers has a lower error rate, and the risk level classification results have better rationality.
附图说明Description of drawings
图1是本申请第一实施例的肿瘤风险等级划分方法的流程图;FIG. 1 is a flowchart of a method for classifying tumor risk levels according to the first embodiment of the present application;
图2是本申请第二实施例的肿瘤风险等级划分方法的流程图;FIG. 2 is a flowchart of a method for classifying tumor risk levels according to a second embodiment of the present application;
图3为本申请实施例的肿瘤风险等级划分系统结构示意图;3 is a schematic structural diagram of a tumor risk grade classification system according to an embodiment of the present application;
图4为本申请实施例的终端结构示意图;FIG. 4 is a schematic structural diagram of a terminal according to an embodiment of the present application;
图5为本申请实施例的存储介质的结构示意图。FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
为了解决现有技术的不足,本申请实施例的肿瘤风险等级划分方法基于所有研究患者的转录组数据,结合其临床生存信息,获取患者的预后关联基因,在此基础之上采用统计模型进行多次运算,筛选预后关联基因的分子标志物,结合所筛选的分子标志物以及基因表达水平对每个患者进行打分,根据分值分布对患者进行风险等级划分。In order to solve the deficiencies of the prior art, the tumor risk grade classification method of the embodiment of the present application is based on the transcriptome data of all the study patients, combined with their clinical survival information, to obtain the prognosis-related genes of the patients, and on this basis, a statistical model is used to conduct multiple The second operation is performed to screen the molecular markers of prognosis-related genes, and score each patient in combination with the screened molecular markers and gene expression levels, and classify the patients according to the score distribution.
具体地,请参阅图1,是本申请第一实施例的肿瘤风险等级划分方法的流程图。本申请第一实施例的肿瘤风险等级划分方法包括以下步骤:Specifically, please refer to FIG. 1 , which is a flowchart of the method for classifying tumor risk levels according to the first embodiment of the present application. The tumor risk grade classification method according to the first embodiment of the present application includes the following steps:
S100:获取同一肿瘤类型的所有患者的转录组数据;S100: Obtain transcriptome data of all patients of the same tumor type;
本步骤中,转录组数据以矩阵形式输入,每一行为单个患者个体,每一列为基因名,矩阵中每一格的数值分别代表每个基因在每个患者个体中的基因表达水平(即基因表达值)。In this step, the transcriptome data is input in the form of a matrix, each row is a single patient individual, each column is a gene name, and the value of each cell in the matrix represents the gene expression level of each gene in each individual patient (that is, the gene expression level). expression value).
S110:对转录组数据进行预处理,获取所有患者的临床生存状态信息;S110: Preprocess the transcriptome data to obtain the clinical survival status information of all patients;
本步骤中,预处理操作具体包括:首先采用统计学对转录组数据中每一个基因的基因表达值分别进行log对数转换,剔除低表达水平的基因,只保留在超过设定比例(本申请实施设定该比例为90%,具体可根据实际操作进行设定)的患者中存在表达的基因;然后从转录组数据中获取所有患者的ID号、生存状态以及生存时间等临床生存状态信息;临床生存状态信息同样以矩阵形式存在,矩阵的每一行为单个患者的ID号,每一列为每个患者对应的生存状态以及生存时间。In this step, the preprocessing operation specifically includes: first, using statistics to perform logarithmic transformation on the gene expression value of each gene in the transcriptome data, remove genes with low expression levels, and only keep genes that exceed the set ratio (this application The ratio is set to 90%, which can be set according to the actual operation), and there are expressed genes in the patients; then the clinical survival status information such as the ID number, survival status, and survival time of all patients is obtained from the transcriptome data; The clinical survival status information also exists in the form of a matrix. Each row of the matrix is the ID number of a single patient, and each column corresponds to the survival status and survival time of each patient.
S120:根据基因表达水平分别获取每个基因对应的高低表达人群;S120: respectively obtain the high and low expression populations corresponding to each gene according to the gene expression level;
本步骤中,基因高低表达人群获取算法包括但不限于二分法、三分法和四分法等,各获取算法具体为:In this step, the algorithms for obtaining high and low gene expression populations include, but are not limited to, the method of dichotomy, the method of thirds, and the method of quartering, etc. Each obtaining algorithm is specifically:
1)二分法:对每一个基因在所有患者中的基因表达值进行高低排序,并以所有基因表达值的均值或中值为分界点,获取每一个基因的高低表达人群;1) Dichotomy: sort the gene expression values of each gene in all patients, and take the mean or median of all gene expression values as the dividing point to obtain the high and low expression population of each gene;
2)三分法:对每一个基因在所有患者中的基因表达值进行高低排序,分别截取排序最靠前的1/3人群和排序最靠后的1/3人群作为对应基因的高低表达人群;2) Trichotomy: Rank the gene expression value of each gene in all patients, and intercept the top 1/3 population and the bottom 1/3 population as the high and low expression population of the corresponding gene. ;
3)四分法:对每一个基因在所有患者中的基因表达值进行高低排序,以四分位数值为分界点,分别截取最高四分位数和最低四分位数作为对应基因的高低表达人群。3) Quartile method: Rank the gene expression values of each gene in all patients, take the quartile as the cutoff point, and intercept the highest quartile and the lowest quartile as the high and low expression of the corresponding gene. crowd.
S130:结合临床生存状态信息和每一个基因的高低表达人群分别计算每个基因的表达水平高低与预后的关联情况,得到预后关联基因,并提取每个预后关联基因在所有患者中的基因表达值,生成预后关联基因表达矩阵;S130: Combine the clinical survival status information and the high and low expression population of each gene to calculate the correlation between the expression level of each gene and prognosis, obtain the prognosis-related genes, and extract the gene expression values of each prognosis-related gene in all patients , generate a prognostic-associated gene expression matrix;
其中,本申请实施例通过统计计算平台R采用survival工具计算所有基因的预后关联情况,输出每个基因的回归系数和P值,根据P值小于某一设定阈值(本申请实施例设定该阈值为0.01,具体可根据实际操作进行设定)作为显著性相关条件对所有基因进行过滤,得到预后关联基因。由于不同算法获得的基因高低表达人群都不相同,所得的预后关联基因也均不相同,因此在实际操作中本申请实施例对基于不同算法获得的高低表达人群分别进行运算,如果某一基因在至少两种算法的高低表达人群中均与预后相关联,则判定该基因为预后关联基因。Among them, in the embodiment of the present application, the statistical computing platform R uses the survival tool to calculate the prognosis correlation of all genes, and outputs the regression coefficient and P value of each gene. The threshold is 0.01, which can be set according to the actual operation) as a significant correlation condition to filter all genes to obtain prognosis-related genes. Since the high and low expression populations of genes obtained by different algorithms are different, the obtained prognosis-related genes are also different. Therefore, in the actual operation, the embodiments of the present application respectively calculate the high and low expression populations obtained based on different algorithms. If the high and low expression populations of at least two algorithms are associated with prognosis, the gene is determined as a prognosis-associated gene.
S140:基于临床生存状态信息和预后关联基因表达矩阵,通过统计模型对预后关联基因进行分子标志物筛选,并分别获取每个分子标志物在每个患者中的基因表达值,生成每个分子标志物的分子标志物表达矩阵;S140: Based on the clinical survival status information and the prognosis-related gene expression matrix, molecular markers are screened for the prognosis-related genes through a statistical model, and the gene expression values of each molecular marker in each patient are obtained separately, and each molecular marker is generated. The molecular marker expression matrix of the marker;
本步骤中,采用lasso惩罚Cox比例风险统计模型进行分子标志物的筛选,该模型采用交叉验证选择统计量Lambda值,根据Lambda值确定对应的分子标志物。在计算过程中,模型每一次运算所得的Lambda值都不一样,导致所筛选的分子标志物有很大的随机性,因此本申请实施例采用模型循环进行设定次数(例如100次或1000次,具体可根据实际操作进行设定)的Lambda值运算,并根据所有Lambda值的最小平均标准误差值确定最终的Lambda值,根据该Lambda值筛选预后关联基因的分子标志物,然后提取所有分子标志物在所有患者中的基因表达值,制作成分子标志物表达矩阵。In this step, the lasso penalty Cox proportional hazards statistical model is used to screen molecular markers. The model uses cross-validation to select a statistic Lambda value, and the corresponding molecular marker is determined according to the Lambda value. In the calculation process, the Lambda value obtained by each operation of the model is different, resulting in a great randomness of the molecular markers to be screened. Therefore, in the embodiment of the present application, the model cycle is used for a set number of times (for example, 100 times or 1000 times). , which can be set according to the actual operation), and determine the final Lambda value according to the minimum average standard error value of all Lambda values, screen the molecular markers of prognosis-related genes according to the Lambda value, and then extract all molecular markers The gene expression values of the markers in all patients were made into a molecular marker expression matrix.
S150:根据每个分子标志物的回归系数以及在每个患者中的基因表达值分别计算每个分子标志物在每个患者中的打分值,然后将每个患者的所有打分值相加,得到每个患者的最终分值;S150: Calculate the score value of each molecular marker in each patient according to the regression coefficient of each molecular marker and the gene expression value in each patient, and then add all the score values of each patient to obtain final score for each patient;
本步骤中,首先,将每个分子标志物的回归系数与在每个患者中的基因表达值相乘,得到每个分子标志物在每个患者中的打分值;然后将每个患者所有分子标志物的打分值相加,即为每个患者的最终分值。In this step, first, the regression coefficient of each molecular marker is multiplied by the gene expression value in each patient to obtain the score value of each molecular marker in each patient; The scores for the markers were added together to obtain the final score for each patient.
S160:结合每个患者的分值高低分布与临床生存状态信息对肿瘤患者进行高低风险等级的划分;S160: Combining the high and low score distribution of each patient and the clinical survival status information to classify tumor patients into high and low risk levels;
本步骤中,通过统计计算平台R采用survminer工具进行风险等级划分,首先将每个患者的分值按照高低进行排序,结合临床生存状态信息,查看每个患者的分值高低分布与临床生存状态信息的关系;然后采用统计学方法进行最大选择检验,并计算最大选择对数秩统计量数值,以该数值为分界点对患者进行不同风险等级的划分;并采用survival工具对风险等级划分结果进行评估,根据P值大小判断高低风险等级的生存概率是否存在显著性差异,如果存在显著性差异,则表明风险等级划分结果合理。In this step, the survminer tool is used to classify the risk levels through the statistical computing platform R. First, the scores of each patient are sorted according to their high and low levels, and combined with the clinical survival status information, the high and low score distribution and clinical survival status information of each patient are checked. Then use the statistical method to carry out the maximum choice test, and calculate the maximum choice log-rank statistic value, and use this value as the cut-off point to classify patients with different risk levels; and use the survival tool to evaluate the risk level classification results. , according to the size of the P value to determine whether there is a significant difference in the survival probability of high and low risk levels, if there is a significant difference, it indicates that the risk level classification results are reasonable.
基于上述,本申请实施例通过同一肿瘤类型的所有患者的转录组数据进行预后关联基因的筛选,由于转录组数据包括了患者所有基因的表达水平,因此可最大化保留所有预后关联基因而不漏选其他潜在的预后关联基因。在预后关联基因计算环节,充分考虑了不同基因高低人群获取方法对最终预后关联基因判断的影响,以综合考虑常见的高低人群组获取方法所得预后关联基因结果,使得所得预后关联基因具备更好的准确性。采用多次运算lasso惩罚Cox比例风险模型,以计算结果的最小平均标准误差值确定最终Lambda值,从而实现对预后关联基因的筛选,使得分子标志物的获取具备更低的错误率。最后基于患者分值的分布情况采用统计学方法自动对患者进行高低风险等级的划分,尊重数据的真实分布情况,具备更好的合理性。Based on the above, the examples of the present application use the transcriptome data of all patients of the same tumor type to screen for prognosis-related genes. Since the transcriptome data includes the expression levels of all genes in the patient, it can maximize the retention of all prognosis-related genes without missing Select other potential prognostic associated genes. In the calculation of prognosis-related genes, the influence of different gene high-low population acquisition methods on the final prognosis-related gene judgment is fully considered, and the prognosis-related gene results obtained by common high-low population group acquisition methods are comprehensively considered, so that the obtained prognosis-related genes have better accuracy. The lasso penalized Cox proportional hazards model was used for multiple operations, and the final Lambda value was determined by the minimum average standard error value of the calculation results, so as to realize the screening of prognosis-related genes, and to obtain molecular markers with a lower error rate. Finally, based on the distribution of patient scores, statistical methods are used to automatically classify patients into high and low risk levels, respecting the true distribution of the data and having better rationality.
请参阅图2,是本申请第二实施例的肿瘤风险等级划分方法的流程图。该实施例为将本申请应用于黑色素瘤患者的风险等级划分,具体包括以下步骤:Please refer to FIG. 2 , which is a flowchart of a method for classifying tumor risk levels according to the second embodiment of the present application. This embodiment is to apply the present application to the classification of risk levels of melanoma patients, and specifically includes the following steps:
S200:从ICGC数据库中下载黑色素瘤患者的转录组数据;S200: Download transcriptome data of melanoma patients from ICGC database;
S210:对转录组数据进行预处理,对转录组数据中每一个基因的基因表达值进行log2对数转换,保留在超过90%的患者中存在表达的基因,并获取所有患者的ID号、生存状态以及生存时间等临床生存状态信息;S210: Preprocess the transcriptome data, perform log2 logarithmic transformation on the gene expression value of each gene in the transcriptome data, retain the genes that are expressed in more than 90% of the patients, and obtain the ID numbers, survival numbers of all patients Clinical survival status information such as status and survival time;
S220:通过统计计算平台R采用survival工具计算所有基因的预后关联情况,输出每个基因的回归系数和P值,根据P值<0.01阈值作为显著性相关条件对所有基因进行过滤,得到1086个预后关联基因,并分别提取每个预后关联基因在所有患者中的基因表达值,生成预后关联基因表达矩阵;S220: Use the survival tool to calculate the prognostic association of all genes through the statistical computing platform R, output the regression coefficient and P value of each gene, and filter all genes according to the P value <0.01 threshold as a significant correlation condition, and get 1086 prognosis Associate genes, and extract the gene expression values of each prognosis-associated gene in all patients to generate a prognosis-associated gene expression matrix;
S230:结合临床生存状态信息和预后关联基因表达矩阵,在统计计算平台R中采用glmnet工具循环运行100次,以计算最小的平均标准误差值来确定最终Lambda值,根据该Lambda值筛选出21个分子标志物;S230: Combine the clinical survival status information and the prognosis-related gene expression matrix, use the glmnet tool in the statistical computing platform R to run 100 times in a loop to calculate the minimum average standard error value to determine the final Lambda value, and screen out 21 based on the Lambda value. molecular markers;
S240:分别提取每个分子标志物在所有患者中的基因表达值,制作成分子标志物表达矩阵,根据每个分子标志物的回归系数以及在每个患者中的基因表达值分别计算每个分子标志物在每个患者中的打分值,根据每个患者的分值高低采用统计学方法对黑色素瘤患者进行高低风险等级划分;S240: Extract the gene expression values of each molecular marker in all patients, make a molecular marker expression matrix, and calculate each molecule according to the regression coefficient of each molecular marker and the gene expression value in each patient. The score of the marker in each patient, according to the score of each patient, a statistical method is used to classify the melanoma patients into high and low risk grades;
其中,通过在统计计算平台R采用survival工具对黑色素瘤患者的风险等级划分结果进行评估,其P值<0.001,表明不同风险等级间存在显著性差异,即风险等级划分结果具备很好的合理性。Among them, the survival tool was used in the statistical computing platform R to evaluate the risk level classification results of melanoma patients, and the P value was less than 0.001, indicating that there were significant differences between different risk levels, that is, the risk level classification results were very reasonable. .
请参阅图3,是本申请实施例的肿瘤风险等级划分系统的结构示意图。本申请实施例的肿瘤风险等级划分系统40包括:Please refer to FIG. 3 , which is a schematic structural diagram of a tumor risk grade classification system according to an embodiment of the present application. The tumor risk level classification system 40 according to the embodiment of the present application includes:
数据获取模块41:用于获取患者的转录组数据以及临床生存状态信息;其中,所述转录组数据至少包括每个基因在所述患者中的基因表达值;Data acquisition module 41: used to acquire transcriptome data and clinical survival status information of the patient; wherein, the transcriptome data at least includes the gene expression value of each gene in the patient;
预后关联基因获取模块42:用于根据所述转录组数据以及临床生存状态信息获取所述患者的预后关联基因;Prognosis-related gene acquisition module 42: used to acquire the prognosis-related genes of the patient according to the transcriptome data and the clinical survival state information;
分子标志物筛选模块43:用于将所述临床生存状态信息和预后关联基因输入统计模型进行分子标志物筛选,并获取所述筛选的分子标志物在所述患者中的基因表达值;Molecular marker screening module 43: used to input the clinical survival status information and prognosis-related genes into a statistical model for molecular marker screening, and obtain the gene expression values of the screened molecular markers in the patient;
风险等级划分模块44:用于根据所述分子标志物以及所述基因表达值对所述患者进行打分,根据分值分布对所述患者进行肿瘤风险等级划分。Risk level classification module 44: used to score the patient according to the molecular marker and the gene expression value, and classify the patient according to the score distribution to tumor risk level.
请参阅图4,为本申请实施例的终端结构示意图。该终端50包括处理器51、与处理器51耦接的存储器52。Please refer to FIG. 4 , which is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
存储器52存储有用于实现上述肿瘤风险等级划分方法的程序指令。The memory 52 stores program instructions for implementing the above-described tumor risk level classification method.
处理器51用于执行存储器52存储的程序指令以控制肿瘤风险等级划分。The processor 51 is configured to execute program instructions stored in the memory 52 to control the classification of tumor risk levels.
其中,处理器51还可以称为CPU(Central Processing Unit,中央处理单元)。处理器51可能是一种集成电路芯片,具有信号的处理能力。处理器51还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 51 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 51 may be an integrated circuit chip with signal processing capability. The processor 51 may also be a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components . A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
请参阅图5,为本申请实施例的存储介质的结构示意图。本申请实施例的存储介质存储有能够实现上述所有方法的程序文件61,其中,该程序文件61可以以软件产品的形式存储在上述存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本发明各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。Please refer to FIG. 5 , which is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium of this embodiment of the present application stores a program file 61 capable of implementing all the above methods, wherein the program file 61 may be stored in the above-mentioned storage medium in the form of a software product, and includes several instructions to enable a computer device (which can be It is a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only). Memory), random access memory (RAM, Random Various media that can store program codes, such as Access Memory), magnetic disks or CD-ROMs, or terminal devices such as computers, servers, mobile phones, and tablets.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本申请中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本申请所示的这些实施例,而是要符合与本申请所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined in this application may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

  1. 一种肿瘤风险等级划分方法,其特征在于,包括以下步骤:A method for classifying tumor risk levels, comprising the following steps:
    获取患者的转录组数据以及临床生存状态信息;其中,所述转录组数据至少包括每个基因在所述患者中的基因表达值;Obtaining the patient's transcriptome data and clinical survival status information; wherein, the transcriptome data at least includes the gene expression value of each gene in the patient;
    根据所述转录组数据以及临床生存状态信息获取所述患者的预后关联基因;Obtain the prognosis-related genes of the patient according to the transcriptome data and the clinical survival status information;
    将所述临床生存状态信息和预后关联基因输入统计模型进行分子标志物筛选,并获取所述筛选的分子标志物在所述患者中的基因表达值;Inputting the clinical survival status information and prognosis-related genes into a statistical model for molecular marker screening, and obtaining the gene expression values of the screened molecular markers in the patient;
    根据所述分子标志物以及所述基因表达值对所述患者进行打分,根据分值分布对所述患者进行肿瘤风险等级划分。The patient is scored according to the molecular marker and the gene expression value, and the patient is divided into tumor risk levels according to the score distribution.
  2. 根据权利要求1所述的肿瘤风险等级划分方法,其特征在于,所述获取患者的转录组数据以及临床生存状态信息包括:The method for classifying tumor risk levels according to claim 1, wherein the obtaining of the patient's transcriptome data and clinical survival status information comprises:
    采用统计学对所述转录组数据中每一个基因的基因表达值分别进行对数转换,剔除低表达水平的基因,保留在超过设定比例的患者中存在表达的基因;然后从所述转录组数据中获取所述患者的临床生存状态信息;The gene expression value of each gene in the transcriptome data was logarithmically transformed by statistics, the genes with low expression levels were eliminated, and the genes that were expressed in more than a set proportion of patients were retained; Obtain the clinical survival status information of the patient from the data;
    所述临床生存状态信息包括患者ID号、生存状态以及生存时间。The clinical survival status information includes patient ID number, survival status and survival time.
  3. 根据权利要求2所述的肿瘤风险等级划分方法,其特征在于,所述根据所述转录组数据以及临床生存状态信息获取所述患者的预后关联基因包括:The method for classifying tumor risk levels according to claim 2, wherein the obtaining of the patient's prognosis-related genes according to the transcriptome data and clinical survival status information comprises:
    对每个基因在所述患者中的基因表达值进行排序,分别获取每个基因对应的高低表达人群;Sort the gene expression values of each gene in the patient, and obtain the high and low expression populations corresponding to each gene respectively;
    结合所述临床生存状态信息和高低表达人群分别计算每个基因的表达水平高低与预后的关联情况,输出每个基因的回归系数和P值,根据P值大小对所述患者的所有基因进行过滤,得到所述患者的预后关联基因,并提取每个预后关联基因在所述患者中的基因表达值,生成预后关联基因表达矩阵。Calculate the correlation between the expression level of each gene and prognosis in combination with the clinical survival status information and the high and low expression population, output the regression coefficient and P value of each gene, and filter all the genes of the patient according to the size of the P value , obtain the prognosis-related genes of the patient, and extract the gene expression value of each prognosis-related gene in the patient to generate a prognosis-related gene expression matrix.
  4. 根据权利要求3所述的肿瘤风险等级划分方法,其特征在于,所述对对每个基因在所述患者中的基因表达值进行排序,分别获取每个基因对应的高低表达人群包括:The method for classifying tumor risk levels according to claim 3, wherein the sorting of the gene expression values of each gene in the patient, and obtaining the high and low expression populations corresponding to each gene respectively comprises:
    分别采用二分法、三分法或四分法获取每个基因对应的高低表达人群。The high and low expression populations corresponding to each gene were obtained by the method of dichotomy, trichotomy, or quartile, respectively.
  5. 根据权利要求4所述的肿瘤风险等级划分方法,其特征在于,所述根据所述转录组数据以及临床生存状态信息获取所述患者的预后关联基因包括:The method for classifying tumor risk levels according to claim 4, wherein the obtaining of the patient's prognosis-related genes according to the transcriptome data and clinical survival status information comprises:
    基于不同算法获得的高低表达人群分别进行所述表达水平高低与预后关联情况的计算,如果某一基因在至少两种算法获得的高低表达人群中均与预后相关联,则判定所述基因为预后关联基因。Based on the high and low expression populations obtained by different algorithms, the relationship between the expression level and prognosis is calculated respectively. If a gene is associated with prognosis in the high and low expression populations obtained by at least two algorithms, the gene is determined as a prognosis. associated genes.
  6. 根据权利要求3所述的肿瘤风险等级划分方法,其特征在于,所述将所述临床生存状态信息和预后关联基因输入统计模型进行分子标志物筛选包括:The method for classifying tumor risk levels according to claim 3, wherein the inputting the clinical survival status information and prognosis-related genes into a statistical model for molecular marker screening comprises:
    所述统计模型为lasso惩罚Cox比例风险模型,所述模型采用交叉验证进行设定次数的Lambda值运算,并根据所有Lambda值的最小平均标准误差值确定最终的Lambda值,根据所述最终的Lambda值筛选所述预后关联基因的分子标志物。Described statistical model is lasso punishment Cox proportional hazards model, described model adopts cross-validation to carry out the Lambda value operation of the set number of times, and determines the final Lambda value according to the minimum mean standard error value of all Lambda values, according to the final Lambda value Value screening of molecular markers of the prognosis-related genes.
  7. 根据权利要求6所述的肿瘤风险等级划分方法,其特征在于,所述根据所述分子标志物以及所述基因表达水平对所述患者进行打分包括:The method for classifying tumor risk levels according to claim 6, wherein the scoring of the patient according to the molecular marker and the gene expression level comprises:
    将每个分子标志物的回归系数与所述分子标志物在所述患者中的基因表达值相乘,得到每个分子标志物在所述患者中的打分值,然后将所述患者所有分子标志物的打分值相加,得到所述患者的最终分值。The regression coefficient of each molecular marker is multiplied by the gene expression value of the molecular marker in the patient to obtain the score value of each molecular marker in the patient, and then all the molecular markers of the patient are calculated. The scores of the subjects are added to obtain the final score of the patient.
  8. 根据权利要求1至7任一项所述的肿瘤风险等级划分方法,其特征在于,所述根据分值分布对所述患者进行肿瘤风险等级划分包括:The method for classifying tumor risk levels according to any one of claims 1 to 7, wherein the performing tumor risk level classification on the patient according to the score distribution comprises:
    分析所述患者的分值高低分布与临床生存状态信息的关系,并采用统计学方法进行最大选择检验,计算最大选择对数秩统计量数值,以该数值为分界点对所述患者进行风险等级划分。Analyze the relationship between the distribution of the patient's score and the clinical survival status information, and use a statistical method to perform a maximum selection test, calculate the maximum selection log-rank statistic value, and use this value as a cut-off point to carry out the risk level of the patient. Divide.
  9. 一种肿瘤风险等级划分系统,其特征在于,包括:A tumor risk grading system, characterized in that it includes:
    数据获取模块:用于获取患者的转录组数据以及临床生存状态信息;其中,所述转录组数据至少包括每个基因在所述患者中的基因表达值;Data acquisition module: used to acquire the patient's transcriptome data and clinical survival status information; wherein, the transcriptome data at least includes the gene expression value of each gene in the patient;
    预后关联基因获取模块:用于根据所述转录组数据以及临床生存状态信息获取所述患者的预后关联基因;Prognosis-related gene acquisition module: used to obtain the prognosis-related genes of the patient according to the transcriptome data and clinical survival status information;
    分子标志物筛选模块:用于将所述临床生存状态信息和预后关联基因输入统计模型进行分子标志物筛选,并获取所述筛选的分子标志物在所述患者中的基因表达值;Molecular marker screening module: used to input the clinical survival status information and prognosis-related genes into a statistical model for molecular marker screening, and obtain the gene expression values of the screened molecular markers in the patient;
    风险等级划分模块:用于根据所述分子标志物以及所述基因表达值对所述患者进行打分,根据分值分布对所述患者进行肿瘤风险等级划分。Risk grading module: used to score the patient according to the molecular marker and the gene expression value, and divide the patient according to the score distribution to tumor risk grading.
  10. 一种终端,其特征在于,所述终端包括处理器、与所述处理器耦接的存储器,其中,A terminal, characterized in that the terminal includes a processor and a memory coupled to the processor, wherein,
    所述存储器存储有用于实现权利要求1-8任一项所述的肿瘤风险等级划分方法的程序指令;The memory stores program instructions for implementing the method for classifying tumor risk levels according to any one of claims 1-8;
    所述处理器用于执行所述存储器存储的所述程序指令以控制肿瘤风险等级划分。The processor is configured to execute the program instructions stored in the memory to control tumor risk grading.
  11. 一种存储介质,其特征在于,存储有处理器可运行的程序指令,所述程序指令用于执行权利要求1至8任一项所述肿瘤风险等级划分方法。A storage medium, characterized in that it stores program instructions executable by a processor, and the program instructions are used to execute the method for classifying tumor risk levels according to any one of claims 1 to 8.
PCT/CN2020/139298 2020-09-03 2020-12-25 Tumor risk grading method and system, terminal, and storage medium WO2022048071A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010915159.1 2020-09-03
CN202010915159.1A CN112117003A (en) 2020-09-03 2020-09-03 Tumor risk grading method, system, terminal and storage medium

Publications (1)

Publication Number Publication Date
WO2022048071A1 true WO2022048071A1 (en) 2022-03-10

Family

ID=73805163

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139298 WO2022048071A1 (en) 2020-09-03 2020-12-25 Tumor risk grading method and system, terminal, and storage medium

Country Status (2)

Country Link
CN (1) CN112117003A (en)
WO (1) WO2022048071A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115346656A (en) * 2022-06-10 2022-11-15 江门市中心医院 Three-group chemistry IDC (internet data center) prognosis model establishing method and prognosis model system based on CAFs (computer aided design), WSIs (wireless sensors and information systems) and clinical information
CN115620808A (en) * 2022-12-19 2023-01-17 广东工业大学 Cancer gene prognosis screening method and system based on improved Cox model

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112117003A (en) * 2020-09-03 2020-12-22 中国科学院深圳先进技术研究院 Tumor risk grading method, system, terminal and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017156594A1 (en) * 2016-03-18 2017-09-21 University Of Melbourne Use of laminins as biomarkers for cancer diagnosis and prognosis
CN110273002A (en) * 2019-07-18 2019-09-24 北京泱深生物信息技术有限公司 Application of the biomarker in melanoma metastasis diagnosis
CN111575376A (en) * 2020-05-14 2020-08-25 复旦大学附属肿瘤医院 Combined genome for evaluating kidney clear cell carcinoma prognosis and application thereof
CN112117003A (en) * 2020-09-03 2020-12-22 中国科学院深圳先进技术研究院 Tumor risk grading method, system, terminal and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202969B (en) * 2016-08-01 2018-10-23 东北大学 A kind of tumor cells parting forecasting system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017156594A1 (en) * 2016-03-18 2017-09-21 University Of Melbourne Use of laminins as biomarkers for cancer diagnosis and prognosis
CN110273002A (en) * 2019-07-18 2019-09-24 北京泱深生物信息技术有限公司 Application of the biomarker in melanoma metastasis diagnosis
CN111575376A (en) * 2020-05-14 2020-08-25 复旦大学附属肿瘤医院 Combined genome for evaluating kidney clear cell carcinoma prognosis and application thereof
CN112117003A (en) * 2020-09-03 2020-12-22 中国科学院深圳先进技术研究院 Tumor risk grading method, system, terminal and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115346656A (en) * 2022-06-10 2022-11-15 江门市中心医院 Three-group chemistry IDC (internet data center) prognosis model establishing method and prognosis model system based on CAFs (computer aided design), WSIs (wireless sensors and information systems) and clinical information
CN115346656B (en) * 2022-06-10 2023-10-27 江门市中心医院 Three-group IDC prognosis model building method and prognosis model system based on CAFs, WSIs and clinical information
CN115620808A (en) * 2022-12-19 2023-01-17 广东工业大学 Cancer gene prognosis screening method and system based on improved Cox model
CN115620808B (en) * 2022-12-19 2023-03-31 广东工业大学 Cancer gene prognosis screening method and system based on improved Cox model

Also Published As

Publication number Publication date
CN112117003A (en) 2020-12-22

Similar Documents

Publication Publication Date Title
WO2022048071A1 (en) Tumor risk grading method and system, terminal, and storage medium
Badsha et al. Imputation of single‐cell gene expression with an autoencoder neural network
CN112766428B (en) Tumor molecule typing method and device, terminal device and readable storage medium
CN110634563A (en) Differential diagnosis device for diabetic nephropathy and non-diabetic nephropathy
CN111062425B (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
CN115938590B (en) Construction method and prediction system of colorectal cancer postoperative LARS prediction model
WO2021120587A1 (en) Method and apparatus for retina classification based on oct, computer device, and storage medium
WO2022011855A1 (en) False positive structural variation filtering method, storage medium, and computing device
Wang et al. Hybrid density-and partition-based clustering algorithm for data with mixed-type variables
Keenan et al. Cluster analysis and genotype–phenotype assessment of geographic atrophy in age-related macular degeneration: Age-Related Eye Disease Study 2 report 25
CN110739072A (en) Bleeding event occurrence evaluation method and system
Lee et al. Gene-gene interaction analysis for quantitative trait using cluster-based multifactor dimensionality reduction method
CN116705310A (en) Data set construction method, device, equipment and medium for perioperative risk assessment
CN110008972A (en) Method and apparatus for data enhancing
Jung et al. A machine learning method for selection of genetic variants to increase prediction accuracy of type 2 diabetes mellitus using sequencing data
WO2018209704A1 (en) Sample source detection method, device, and storage medium based on dna sequencing data
Sharma et al. A Comparative Study of Data Mining, Digital Image Processing and Genetical Approach for Early Detection of Liver Cancer
CN111383716A (en) Method and device for screening gene pairs, computer equipment and storage medium
Berghout et al. Single subject transcriptome analysis to identify functionally signed gene set or pathway activity
CN117393171B (en) Method and system for constructing prediction model of LARS development track after rectal cancer operation
TW202029075A (en) Statistic performance evaluation method for different grouping sets
TWI817795B (en) Cancer progression discriminant method and system thereof
Tsai et al. Significance analysis of ROC indices for comparing diagnostic markers: applications to gene microarray data
Lee et al. Cluster-based multifactor dimensionality reduction method to identify gene-gene interactions for quantitative traits in genome-wide studies
KR20190126606A (en) IDENTIFYING METHOD FOR TUMOR PATIENT BASED ON miRNA IN EXOSOME AND APPARATUS FOR THE SAME

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20952312

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20952312

Country of ref document: EP

Kind code of ref document: A1