CN115873956A - Kit, system, use and modeling method of prediction model for predicting risk of colorectal cancer of subject - Google Patents

Kit, system, use and modeling method of prediction model for predicting risk of colorectal cancer of subject Download PDF

Info

Publication number
CN115873956A
CN115873956A CN202211720516.4A CN202211720516A CN115873956A CN 115873956 A CN115873956 A CN 115873956A CN 202211720516 A CN202211720516 A CN 202211720516A CN 115873956 A CN115873956 A CN 115873956A
Authority
CN
China
Prior art keywords
clostridium
colorectal cancer
subject
bacteroides
risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211720516.4A
Other languages
Chinese (zh)
Inventor
朱政农
许晓敏
殷晓晨
寇岩
谭验
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Weizhijun Biological Technology Co ltd
Original Assignee
Shenzhen Weizhijun Biological Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Weizhijun Biological Technology Co ltd filed Critical Shenzhen Weizhijun Biological Technology Co ltd
Priority to CN202211720516.4A priority Critical patent/CN115873956A/en
Publication of CN115873956A publication Critical patent/CN115873956A/en
Pending legal-status Critical Current

Links

Images

Abstract

The present application relates to a kit, system, use and modeling method for a predictive model for predicting a subject's risk of having colorectal cancer, the biomarkers including 25 levels of enterobacteria such as s _ Gemella _ morbillonium. According to the kit, the system, the application of the reagent for detecting the existence information of the biomarker and the modeling method of the prediction model, whether the subject is at risk of suffering from colorectal cancer can be efficiently and accurately predicted by detecting the existence information of 25 levels of intestinal bacteria in a manner of being low-invasive to the subject and being compatible with the existing colorectal cancer-related medical procedures, and the method comprises the steps of accurately eliminating the low colorectal cancer risk, avoiding further surgical intervention means and timely identifying the high colorectal cancer risk so that the subject does not miss the optimal treatment occasion such as surgical intervention.

Description

Kit, system, use and modeling method of prediction model for predicting risk of colorectal cancer of subject
Technical Field
The application relates to the field of biomedicine, in particular to a kit, a system, an application and a modeling method of a prediction model for predicting the risk of a subject suffering from colorectal cancer.
Background
The colorectal cancer refers to malignant tumors occurring in colon and rectum, and becomes the second highest malignant tumor in China with the change of living habits and dietary structures, in recent years, the prevalence rate of the colorectal cancer is not reduced, the colorectal cancer is younger and younger, no symptom exists in the early stage, or the symptom is not obvious, the disease is late, the malignancy degree is high, the progress is rapid, and the prognosis is very poor.
Traditional colorectal cancer screening and diagnosis needs to be carried out by a colonoscope, and corresponding resistance is generated to colorectal cancer screening of patients in terms of time cost, intestinal tract preparation before patient examination and invasive operation. Therefore, it is an important research direction to predict the risk of colorectal cancer by using a stool sample of a subject, but in the prior art, because a constructed prediction model is not accurate enough and cannot be verified on a large crowd sample, the result is often not accurate enough when the risk of colorectal cancer is predicted, and the clinical application value is not high.
Therefore, a model and a method for accurately predicting the risk of colorectal cancer without pain and without wound, conveniently and quickly are urgently needed.
Disclosure of Invention
It is an object of the present application to provide a biomarker combination that can be used to predict the risk of a subject to suffer from colorectal cancer.
It is a further object of the present application to provide a kit for predicting the risk of a subject to suffer from colorectal cancer.
It is a further object of the present application to provide a system for predicting the risk of a subject to suffer from colorectal cancer.
It is still another object of the present application to provide a use of a reagent for detecting biomarker presence information in the manufacture of a kit for predicting a subject's risk of having colorectal cancer.
It is a further object of the present application to provide a modeling method for a predictive model for predicting a subject's risk of having colorectal cancer.
It is intended to provide the above-mentioned kit, system, use of a reagent for detecting presence information of biomarkers and modeling method of a prediction model for predicting a subject's risk of having colorectal cancer, to efficiently and accurately predict whether a subject is at risk of having colorectal cancer in a manner that is less invasive to the subject and compatible with existing colorectal cancer-related medical procedures, including being able to accurately rule out a low risk of having colorectal cancer, avoiding further surgical intervention means, and identifying a high risk of having colorectal cancer in time so that the subject does not miss optimal treatment opportunities such as surgical intervention.
<xnotran> , , , s __ Gemella _ morbillorum, s __ Lactobacillus _ gasseri, s __ Parvimonas _ micra, s __ Rothia _ dentocariosa, s __ Bifidobacterium _ breve, s __ Lactococcus _ lactis, s __ Clostridium _ clostridioforme, s __ Solobacterium _ moorei, s __ Eggerthella _ lenta, s __ Fusobacterium _ nucleatum, s __ Haemophilus _ parainfluenzae, s __ Alistipes _ shahii, s __ Granulicatella _ adiacens, s __ Clostridium _ leptum, s __ Bacteroides _ eggerthii, s __ Clostridium _ bartlettii, s __ Dasheen _ mosaic _ virus, s __ Bacteroides _ massiliensis, s __ Bacteroides _ dorei, s __ Clostridium _ bolteae, s __ Akkermansia _ muciniphila, s __ Eubacterium _ hallii, </xnotran>
s__Anaerotruncus_colihominis、s__Desulfovibrio_desulfuricans、s__Atopobium_parvulum。
<xnotran> , , , s __ Gemella _ morbillorum, s __ Lactobacillus _ gasseri, s __ Parvimonas _ micra, s __ Rothia _ dentocariosa, s __ Bifidobacterium _ breve, s __ Lactococcus _ lactis, s __ Clostridium _ clostridioforme, s __ Solobacterium _ moorei, s __ Eggerthella _ lenta, s __ Fusobacterium _ nucleatum, s __ Haemophilus _ parainfluenzae, s __ Alistipes _ shahii, s __ Granulicatella _ adiacens, s __ Clostridium _ leptum, s __ Bacteroides _ eggerthii, s __ Clostridium _ bartlettii, s __ Dasheen _ mosaic _ virus, s __ Bacteroides _ massiliensis, s __ Bacteroides _ dorei, s __ Clostridium _ bolteae, s __ Akkermansia _ muciniphila, s __ Eubacterium _ hallii, s __ Anaerotruncus _ colihominis, s __ Desulfovibrio _ desulfuricans, s __ Atopobium _ parvulum. </xnotran>
In yet another aspect, the application provides a system for predicting a subject's risk of having colorectal cancer, the system comprising a processor and a display, the processor configured to: obtaining the presence information of the following 25 biomarkers of the subject: <xnotran> s __ Gemella _ morbillorum, s __ Lactobacillus _ gasseri, s __ Parvimonas _ micra, s __ Rothia _ dentocariosa, s __ Bifidobacterium _ breve, s __ Lactococcus _ lactis, s __ Clostridium _ clostridioforme, s __ Solobacterium _ moorei, s __ Eggerthella _ lenta, s __ Fusobacterium _ nucleatum, s __ Haemophilus _ parainfluenzae, s __ Alistipes _ shahii, s __ Granulicatella _ adiacens, s __ Clostridium _ leptum, s __ Bacteroides _ eggerthii, s __ Clostridium _ bartlettii, s __ Dasheen _ mosaic _ virus, s __ Bacteroides _ massiliensis, s __ Bacteroides _ dorei, s __ Clostridium _ bolteae, s __ Akkermansia _ muciniphila, s __ Eubacterium _ hallii, s __ Anaerotruncus _ colihominis, s __ Desulfovibrio _ desulfuricans, s __ Atopobium _ parvulum; </xnotran> Predicting a risk parameter for the subject to suffer from colorectal cancer based on the obtained information on the presence of the 25 biomarkers; and causing the display to present a risk parameter of the subject suffering from colorectal cancer.
In yet another aspect, a modeling method for a predictive model of a subject's risk of having colorectal cancer, the modeling method comprising the steps performed by a processor of: obtaining a first annotated gut flora structural data set with annotations, comprising patient data with colorectal cancer and data of healthy persons; filtering out intestinal flora with relative abundance lower than a first threshold value in the first annotated intestinal flora structure data set to obtain a second annotated intestinal flora structure data set; dividing the second annotated intestinal flora structure data set into a training set and a validation set; performing feature coding on the second annotated intestinal flora structure data set by using a gradient lifting decision tree model, performing super-parameter tuning on the gradient lifting decision tree model through cross validation based on the training set and the validation set, and determining the importance degree of each feature; performing descending order arrangement on each feature according to the determined importance degree, and training the colorectal cancer regression prediction model by using each feature after descending order arrangement, the training set and the verification set to obtain a trained colorectal cancer regression prediction model and a corresponding optimal feature combination; determining an optimal combination of intestinal bacteria as an optimal biomarker combination based on the optimal feature combination, wherein the biomarkers in the optimal biomarker combination reach species levels; the trained colorectal cancer regression prediction model is configured for giving a prediction result of the risk of the subject to suffer from colorectal cancer based on the presence information of each biomarker in the optimal biomarker combination.
The present application establishes a correlation between 25 characteristics (including 25 levels of gut bacteria as biomarkers) and the risk of a subject to suffer from colorectal cancer and their corresponding predictive models. The acquisition of these characteristic information is less invasive to the subject and is well compatible with existing colorectal cancer-related medical procedures. Particularly, the application can efficiently and accurately predict whether the testee is at risk of suffering from colorectal cancer by using only the stool sample of the testee and the existence information of the biomarker obtained by detection, and the prediction model of the application completes the verification on a large population sample, and the negative prediction rate of the prediction model can reach 80.77 percent by verification so as to identify the sample of the testee without the risk of colorectal cancer, wherein the sample can accurately eliminate low risk of colorectal cancer and avoid unnecessary further detection and treatment; the positive prediction rate can reach 78.85 percent so as to identify samples with colorectal cancer risks, thereby identifying high colorectal cancer risks in time and ensuring that a subject does not miss the optimal treatment opportunity such as surgical intervention. Therefore, the method and the device can accurately predict and evaluate the analysis that the tested person suffers from the colorectal cancer in a painless, noninvasive, efficient, convenient and low-cost mode in clinical detection and examination, guide clinical decision and form an individualized and accurate diagnosis and treatment scheme.
Drawings
In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar parts throughout the different views. Like reference numerals having letter suffixes or different letter suffixes may represent different instances of similar components. The drawings illustrate various embodiments generally by way of example, and not by way of limitation, and together with the description and claims serve to explain the disclosed embodiments. The same reference numbers will be used throughout the drawings to refer to the same or like parts, where appropriate. Such embodiments are illustrative, and are not intended to be exhaustive or exclusive embodiments of the present apparatus or method.
Fig. 1 shows a schematic diagram of a system for predicting a subject's risk of having colorectal cancer according to an embodiment of the application.
Fig. 2 shows a flow chart of a method for modeling a predictive model of a subject's risk of having colorectal cancer according to an embodiment of the application.
FIG. 3 shows a schematic diagram of the structure and training of a decision tree model using gradient boosting according to an embodiment of the present application.
FIG. 4 shows a schematic diagram of accuracy curves corresponding to different feature combinations according to an embodiment of the present application.
Fig. 5 shows the prediction of data in independent validation datasets by a trained colorectal cancer regression prediction model based on optimal biomarker combinations according to an embodiment of the present application.
Detailed Description
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Singular references include plural references unless explicitly stated or clearly evident from the context are not intended to do so.
While the invention is susceptible to various modifications and alternative forms, specific examples will be described and illustrated in detail below. It should be understood, however, that these are not intended to limit the invention to the particular disclosure, and the invention includes all modifications, equivalents, and alternatives thereto without departing from the spirit and technical scope of the invention.
<xnotran> , , s __ Gemella _ morbillorum ( ), s __ Lactobacillus _ gasseri ( ), s __ Parvimonas _ micra ( ), s __ Rothia _ dentocariosa ( ), s __ Bifidobacterium _ breve ( ), s __ Lactococcus _ lactis ( ), s __ Clostridium _ clostridioforme ( ), s __ Solobacterium _ moorei ( ), s __ Eggerthella _ lenta ( ), s __ Fusobacterium _ nucleatum ( ), s __ Haemophilus _ parainfluenzae ( ), s __ Alistipes _ shahii ( ), s __ Granulicatella _ adiacens ( ), s __ Clostridium _ leptum ( ), s __ Bacteroides _ eggerthii ( ), s __ Clostridium _ bartlettii ( ), s __ Dasheen _ mosaic _ virus ( ), s __ Bacteroides _ massiliensis ( ), s __ Bacteroides _ dorei ( ), s __ Clostridium _ bolteae ( ), s __ Akkermansia _ muciniphila ( ), s __ Eubacterium _ hallii ( ), s __ Anaerotruncus _ colihominis ( ), s __ Desulfovibrio _ desulfuricans ( ), </xnotran> s _ Atopobium _ parvulum (Atorubium atrophaeus).
In the present application, in the name of the above-mentioned biomarker, "s _" represents that the biomarker is at a species level, the next character is identified as the genus of the biomarker, and then, the character after "_" identifies a specific species in the genus. Specifically, the biomarkers are intestinal bacteria, and are all intestinal bacteria at species level.
In some embodiments, the reagent is for detecting information on the presence of the biomarker contained in a sample comprising the subject's intestinal microbial flora, wherein the sample comprises a subject's intestinal tissue sample or fecal sample.
In some embodiments, the reagent may be, for example, qPCR primers for the biomarkers.
In some embodiments, the presence information of the biomarker may be obtained by a PCR reaction using the primers and genomic DNA of the intestinal microbial flora of the subject as templates.
In some embodiments, the presence of the biomarker can be detected by detecting the presence of a nucleotide sequence selected from the group consisting of seq id no:
1) Measles twin coccus Gemella _ morbillorus f atacagttattctcgccatgagags, r:
GGTTAGGTACCGTCTCTTACATG
2) Lactobacillus gasseri f AATACTCCCCGAAGCACGTCA, r:
TCATTGTGTTTGGCAATCGT
3) Micromonas _ micra f TCACAGTAGTCACAAGAGGAGAGGAT, r:
GGGAAGCATTGGCGGAAA
4) Rothia _ dentocariosa f, GGGTTGTAAACCTCTGTTAGCATC, r:
CGTACCCACTGCAAAACCAG
5) Bifidobacterium breve f TCATCATCACGGCAAGGTCAAGA, r:
GGCCAGAACAGCTGGAACAA
6) Lactococcus lactis f CTGTCGTTTCTGTTATGAAT, r:
GTGTATTCATCATAACCAAC
7) Clostridium clostridia _ Clostridium clotridioforme f GAAGTTTTTTCGGATGGAATCTTGA, r:
CACCGAAGGCTTTGCC
8) Clostridium Solobacterium moorei f CTCAACCCAATCCAGCCACT, r:
TATTGGCTCCCCACGGTTTC
9) Acinetobacter lentus Eggerthella _ lentif GAGTTTGATCCTGGCTCAG, r:
ACGGCTACCTTGTTACGACTT
10 Fusobacterium nucleatum _ nuclear f CAACCATTTACTTTAACTCTACCATGTTCA, r:
GTTGACTTTACAGAAGGAGATTATGTAAAAATC
11 Haemophilus parainfluenzae Haemophilus _ parainfluenzae f GAGAGACTGCGGTAGTCGATCC, r:
CCATCACTTGGTTTGATGCT
12 CTGATGCACACCACAAGTC, r:
GGTCATGTCGTAGGGCTTGT
13 GGTTTATCCTTAGAAAGGAGGT, r:
GAGCATTCGGTTGGGCACTCTAG
14 But Clostridium pasteurium _ bartlettii f GTAAGCTCTTGAAACTGGAG, r:
GAAAGATGCGATTAGGCATC
15 Bacteroides ovoides _ eggerthii f CCCGATAGTAGTTAGTTTTCCGC, r:
TCCTCTCAGAACCCCTATCCAT
16 Clostridium subterminale f GCACAAGCAGCAGTGGAGT, r CTTCCTCCGTTTTGTCA
17 Dasheen mosaic virus Dasheen _ mosaic _ virus f ATGGTHTGGTGYATHGARAAYGG, r:
TGCTGCKGCYTTCATYTG
18 Bacteroides massicus _ masssiliensis f GCGTTTCCG r CCATATTCGG
19 Bacteroides doroidis _ dorei f AAGCGGCTTCAAGAAACAGG, r:
GTGCCCTTTACCTTGGGAAC
20 Clostridium baumannii Clostridium _ bolete f CCTCTTGACCGGCGTGT, r:
CAGGTAGAGCTGGGCACTCTAGG
21 Akkermansia muciniphila Akkermansia _ muciniphila f CAGCACGTGAAGGTGGGGAC, r:
CCTTGCGGTTGGCTTCAGAT
22 Eubacterium _ halili f Eubacterium hareli GCGTAGGTGGCAGTGCAA, r:
GCACCGRAGCCTATACGG
23 Human anaerobic Corynebacterium Colihonis f GGAGCTTACGTTTGAAGTTTC, r CTGCTGCCTCCGTA
24 Desulfovibrio _ desulfuriicans f GGCATCTATAAGACCTCCTGTAGAC, r:
TGTAGATCGTAGGTAGCAAATGTCG
25 Atopobium paravulum f AGAGAGTTTGATCCTGGCTCAG, r:
TGCGGCACGGAAGAAATACTCCCC。
in some embodiments, the 25 biomarkers can be extracted using a predictive model for a subject's risk of having colorectal cancer based on annotated gut flora structure datasets comprising patients with colorectal cancer and healthy persons, as described in detail below in conjunction with fig. 2-5.
<xnotran> , , , , s __ Gemella _ morbillorum, s __ Lactobacillus _ gasseri, s __ Parvimonas _ micra, s __ Rothia _ dentocariosa, s __ Bifidobacterium _ breve, s __ Lactococcus _ lactis, s __ Clostridium _ clostridioforme, s __ Solobacterium _ moorei, s __ Eggerthella _ lenta, s __ Fusobacterium _ nucleatum, s __ Haemophilus _ parainfluenzae, s __ Alistipes _ shahii, s __ Granulicatella _ adiacens, s __ Clostridium _ leptum, s __ Bacteroides _ eggerthii, s __ Clostridium _ bartlettii, s __ Dasheen _ mosaic _ virus, s __ Bacteroides _ massiliensis, s __ Bacteroides _ dorei, s __ Clostridium _ bolteae, s __ Akkermansia _ muciniphila, s __ Eubacterium _ hallii, s __ Anaerotruncus _ colihominis, s __ Desulfovibrio _ desulfuricans, s __ Atopobium _ parvulum. </xnotran>
In a third aspect, the present application provides a system for predicting a subject's risk of having colorectal cancer, fig. 1 shows a schematic diagram of a system for predicting a subject's risk of having colorectal cancer according to embodiments of the present application.
As shown in fig. 1, the system 100 includes a processor 101 and a display 102. In some embodiments, the processor, e.g., 101, may be a processing device including more than one general-purpose processing device, such as a microprocessor, central Processing Unit (CPU), graphics Processing Unit (GPU), or the like. More specifically, the processor 101 may be a Complex Instruction Set Computing (CISC) microprocessor, reduced Instruction Set Computing (RISC) microprocessor, very Long Instruction Word (VLIW) microprocessor, processor running other instruction sets, or processors running a combination of instruction sets. In some embodiments, the processor 101 may also be one or more special-purpose processing devices, such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), a system on a chip (SoC), or the like. In some embodiments, the display 102 may employ, for example, an LED, an OLED, etc., which are not described herein.
In some embodiments, the processor 101 may be configured to, for example: obtaining information on the presence of 25 biomarkers from the subject 103, wherein the 25 biomarkers include gut bacteria at the following levels: <xnotran> s __ Gemella _ morbillorum, s __ Lactobacillus _ gasseri, s __ Parvimonas _ micra, s __ Rothia _ dentocariosa, s __ Bifidobacterium _ breve, s __ Lactococcus _ lactis, s __ Clostridium _ clostridioforme, s __ Solobacterium _ moorei, s __ Eggerthella _ lenta, s __ Fusobacterium _ nucleatum, s __ Haemophilus _ parainfluenzae, s __ Alistipes _ shahii, s __ Granulicatella _ adiacens, s __ Clostridium _ leptum, s __ Bacteroides _ eggerthii, s __ Clostridium _ bartlettii, s __ Dasheen _ mosaic _ virus, s __ Bacteroides _ massiliensis, s __ Bacteroides _ dorei, s __ Clostridium _ bolteae, s __ Akkermansia _ muciniphila, s __ Eubacterium _ hallii, s __ Anaerotruncus _ colihominis, s __ Desulfovibrio _ desulfuricans, s __ Atopobium _ parvulum. </xnotran>
In some embodiments, processor 101 may be further configured to predict a risk parameter of the subject having colorectal cancer based on the acquired presence information of the 25 biomarkers, and cause display 102 to present the risk parameter of the subject 103 having colorectal cancer, such as "risk of having colorectal cancer is 90%", and the like, not to mention here.
In some embodiments, the processor 101 may be further configured to: during the clinical testing test, the predicted risk parameter, whether the predicted risk parameter is below the low risk threshold, or whether the predicted risk parameter is above the high risk threshold are presented on the display 102 and corresponding prompting advice is issued, such as in performing colorectal cancer surgery, etc. Taking the threshold value set as 50% as an example, referring to fig. 1, if the display 102 shows that the risk parameter of the subject 103 suffering from colorectal cancer is 90%, it means that the risk parameter of the subject 103 suffering from colorectal cancer is extremely high, and indications for further detection and treatment including surgery are provided, and so on, which are not described herein again.
In some embodiments, the information on the presence of the 25 biomarkers is obtained by detecting a sample comprising the intestinal microbial flora of the subject 103. Specifically, the sample may include, for example, an intestinal tissue sample or a stool sample of the subject 103, wherein, in a case where the intestinal tissue sample is not included in the previous detection sample, the stool sample that can be conveniently collected without an invasive method can be utilized, so that the rejection psychology of the subject 103 can be reduced, the degree of coordination of the subject 103 and a medical system for the complete prediction operation of the prediction system can be improved, the workload of a doctor can be reduced, and the popularization of the novel prediction technology in the healthcare system can be facilitated.
Further, the processor 101, in predicting the risk parameter of the subject 103 having colorectal cancer based on the acquired presence information of the 25 biomarkers, may perform according to the following formula (1) and formula (2):
score=-0.2208191*s__Gemella_morbillorum_0+
1.92846425*s_Gemella_morbillorum_1+
-0.0287996*s__Lactobacillus_gasseri_0+
1.32441538*s__Lactobacillus_gasseri_1+
-0.1685278*sRothia_dentocariosa_0+
1.08832583*sRothia_dentocariosa_1+
0.29344154*s__Bifidobacterium_breve_0+
-1.0466365*s__Bifidobacterium_breve_1+
-0.4945794*s__Clostridium_clostridioforme_0+
0.64842325*s_Clostridium_clostridioforme_1+
-0.112998*s__Lactococcus_lactis_0+
0.7986691*s__Lactococcus_lactis_1+
-0.4837138*sAlistipes_shahii_0+
0.06659259*s__Alistipes_shahii_1+
0.3887919*s__Clostridium_bartlettii_0+
-0.1883331*s__Clostridium_bartlettii_1+
-0.5495225*s__Fusobacterium_nucleatum_0+
0.4539216*s__Fusobacterium_nucleatum_1+
-0.4146932*s__Solobacterium_moorei_0+
0.55847062*s__Solobacterium_moorei_1+
0.12401521*sDasheen_mosaic_virus_0+
-0.3558551*s_Dasheen_mosaic_virus_1+
0.53506041*s__Haemophilus_parainfluenzae_0+
-0.3355559*s__Haemophilus_parainfluenzae_1+
-0.4494161*s__Clostridium_leptum_0+
9.07E-05*s_Clostridium_leptum_1+
-0.0830943*s__Bacteroides_eggerthii_0+
0.42819628*s__Bacteroides_eggerthii_1+
0.2788531*s__Bacteroides_massiliensis_0+
-0.3119645*sBacteroides_massiliensis_1+
-0.2841483*s_Bacteroides_dorei_0+
1.08E-05*s__Bacteroides_dorei_1+
-0.3334665*s__Granulicatella_adiacens_0+
0.47181429*s__Granulicatella_adiacens_1+
-0.1836578*s__Akkermansia_muciniphila_0+
0.21298044*s__Akkermansia_muciniphila_1+
-0.2561297*s__Parvimonas_micra_0+
1.09095746*s__Parvimonas_micra_1+
-0.140284*s__Atopobium_parvulum_0+
0*s__Atopobium_parvulum_1+
-0.1045567*s__Eubacterium_hallii_0+
0.21065861*s__Eubacterium_hallii_1+
-0.0373602*s__Desulfovibrio_desulfuricans_0+
0.16118485*s_Desulfovibrio_desulfuricans_1+
-0.022273*s__Eggerthella_lenta_0+
0.55154416*s__Eggerthella_lenta_1+
-0.1920693*sAnaerotruncus_colihominis_0+
0.06595296*sAnaerotruncus_colihominis_1+
-0.2559028*s__Clostridium_bolteae_0+
0.14338132 s _Clostridium _boltea _1formula (1)
Figure BDA0004029587010000101
Wherein score in formula (1) is a composite score calculated based on the presence and weight coefficients of each biomarker, _0 indicates the absence of the biomarker, _1 indicates the presence of the biomarker, and the previous value of each biomarker is the corresponding weight coefficient of the biomarker; p in formula (2) is a risk parameter for the subject 103 to have colorectal cancer calculated from the composite score.
In the above formula (1), each biomarker and its corresponding weighting coefficient can be obtained by using a prediction model for predicting the risk of colorectal cancer of a subject based on an annotated intestinal flora structure data set including patients and healthy persons suffering from colorectal cancer, and the implementation manner of the modeling method of the specific prediction model will be described in detail below with reference to fig. 2 to 5.
After calculating the risk parameter of the subject 103 suffering from colorectal cancer according to the formula (1) and the formula (2), the processor 101 may further determine whether the risk parameter exceeds a preset threshold range, thereby predicting the risk of the subject 103 suffering from colorectal cancer. For example only, the preset threshold range may be set to a single value of 0.5, in which case when the risk parameter p is greater than 0.5, it is indicative that the subject 103 is at risk of having colorectal cancer, and when the risk parameter p is less than or equal to 0.5, it may be considered that the subject 103 is not at risk of having colorectal cancer. In other embodiments, a multi-value threshold range may also be preset, for example, in a case where the risk parameter p is smaller than the low risk threshold, the risk of the subject 103 suffering from colorectal cancer is determined as low risk, in a case where the risk parameter p is larger than the high risk threshold, the risk of the subject 103 suffering from colorectal cancer is determined as high risk, and in a case where the risk parameter p is between the low risk threshold and the high risk threshold, the risk of the subject 103 suffering from colorectal cancer is determined as medium risk, and related diagnosis and treatment advice and the like may be given in combination with other clinical factors. The specific low risk threshold, the high risk threshold, etc. may be specifically set according to statistical data or clinical experience, which is not limited in this application.
On the basis of predicting the risk of the subject 103 to suffer from colorectal cancer, the processor 101 may further cause the display 102 to present the predicted risk of the subject 103 to suffer from colorectal cancer, for example, "risk of suffering from colorectal cancer", "risk of not suffering from colorectal cancer" may be displayed, and the risk of suffering from or not may be displayed in combination with the risk parameter p, for example, "risk of suffering from colorectal cancer is 90%" and the like, which are not listed herein.
According to the system for predicting the risk of the subject suffering from the colorectal cancer, the risk parameter of the subject suffering from the colorectal cancer can be efficiently, accurately and inexpensively predicted based on the acquired existence information of the 25 biomarkers of the subject, on one hand, the low risk of the colorectal cancer can be accurately excluded, and unnecessary further detection and treatment can be avoided; on the other hand, the risk of the high colorectal cancer can be identified in time, so that the testee does not miss the optimal treatment opportunity such as surgical intervention, clinical decision can be effectively guided, and an individualized and accurate diagnosis and treatment scheme is formed.
In another aspect of the present application, a modeling method for a predictive model of a subject at risk of having colorectal cancer is presented, fig. 2 shows a flow chart of a modeling method for a predictive model of a subject at risk of having colorectal cancer according to an embodiment of the application. As shown in FIG. 2, the modeling method includes steps S201-S207 performed by a processor.
In step S201, the processor first obtains a first annotated intestinal flora structure data set with annotations, which includes data of a patient with colorectal cancer (CRC) and data of a healthy person (health).
In step S202, intestinal flora in the first annotated intestinal flora structure data set with a relative abundance lower than a first threshold are filtered out to obtain a second annotated intestinal flora structure data set. Since bacteria with too low relative abundance are not believed to be in annotation accuracy, in the case that the first threshold is 0.01%, bacteria with the relative abundance of the annotation result lower than 0.01% can be accumulated, and for example only, the bacteria can be classified as "low abundance bacteria" in the same class to participate in modeling training and the like in the subsequent steps, so that the number of classes of the bacteria can be reduced, errors caused by software annotation can be removed, and false positive results caused by identifying the bacteria with too low relative abundance as the biomarker in the subsequent process of searching for the biomarker can be reduced.
In step S203, the second annotated intestinal flora structure data set is divided into a training set and a validation set. For example only, in a case where the ratio of the training set to the verification set is 3.
In step S204, feature coding is performed on the second annotated intestinal flora structure data set by using a gradient boosting decision tree model, and based on the training set and the validation set, hyper-parameter tuning is performed on the gradient boosting decision tree model (GBDT) through cross validation, and the importance degree of each feature is determined.
Specifically, a Gradient Boosting Decision Tree (GBDT) is a type of integrated algorithm based on a Decision Tree, wherein the Gradient Boosting (GB) is one of the integrated methods Boosting, and a new learner is iterated through Gradient descent. It should be appreciated that any modified or modified algorithm of the GBDT algorithm, such as an extreme gradient boost (XGBoost) model, may also be employed according to embodiments of the present application. The classical GBDT algorithm may, for example, first train a CART decision Tree (Classification And Regression Tree) using a training set And ground truth values, and then predict the training set using the CART decision Tree to obtain a predicted value of each sample, and subtract the predicted value from the truth value to obtain a "residual error". Next, a second tree is trained, where the truth is no longer used, but the residual is used as the standard answer. After the training of two trees is completed, the residual error of each sample can be obtained again, and then a third tree is further trained, and so on. The total number of trees can be specified manually, or some indicator (e.g., error in the validation set) can be monitored to stop training. When a new sample is predicted, the output values of each tree are added to obtain the final prediction result of the sample. In some embodiments, each tree may have a different weight, the prediction result of the sample may be a weighted sum of output values of each tree, and so on, which is not limited in this application.
FIG. 3 shows a schematic diagram of the structure and training with a gradient boosting decision tree model according to an embodiment of the application. The decision tree in fig. 3 is 2, which are a left sub-tree 31 and a right sub-tree 32, respectively, wherein the left sub-tree 31 has 3 leaf nodes, which are leaf nodes 311-313, respectively, and the right sub-tree 32 has 2 leaf nodes, leaf node 321 and leaf node 322. For the data sample x in the second annotated intestinal flora structure data set, it is assumed that it falls on the first leaf node 311 of the left sub-tree 31, and in the case of one-hot encoding (one-hot encoding), the corresponding element value of the leaf node where the data sample falls on in the feature vector is 1, and the corresponding element value of the leaf node where the data sample does not fall on is 0, so the feature vector generated by the left sub-tree 31 is [1, 0], and at the same time, the data sample x also falls on the second leaf node 322 of the right sub-tree 32, so the feature vector generated by the right sub-tree 32 is [0,1], so that the data sample x is connected, and the corresponding feature vector after feature encoding of the data sample x is [1,0, 1]. By analogy, the length of the feature vector is the sum of leaf node numbers contained in all decision trees in the GBDT model.
And after the gradient lifting decision tree model is used for carrying out feature coding on the second annotated intestinal flora structure data set, carrying out super-parameter tuning on the gradient lifting decision tree model through cross validation based on a training set and a validation set. Depending on the specific implementation of the gradient boosting decision tree model, different superparameters may be provided, for example, for the XGBoost model, multiple superparameters may be included, such as learning rate, subsample, colomple byte (a randomly selected score for training features of each tree), colomple byte (a randomly selected score for training features of each tree in each node), and scale _ pos _ weight (a parameter for adjusting sample imbalance). subsample controls the proportion of random samples per tree, the algorithm is more conservative and avoids overfitting, reducing the value of this parameter, typical values may be for example between 0.3 and 1. colomplejbyte is used to control the fraction of the number of columns per random sample (each column is a feature), and typical values may be between 0.2 and 1, for example. colomplejbylev is used to control the ratio of samples to the number of columns per split per each level of the tree, and typical values may be between 0.2 and 1, for example. scale _ pos _ weight is used to regulate the convergence of unbalanced samples in each class, and usually when the samples in each class are quite unbalanced, the value can be set to a positive value so as to make the algorithm converge more quickly. In some cases, the hyper-parameters max _ depth, which defines the maximum depth of the tree, and nrounds, which defines the number of decision trees in the final model, are also optimized, the larger max _ depth the more specific and local sample features will be learned by the model. And for the gradient boosting decision tree model trained by the training set, verifying on the verification set, and performing cross verification on different values of each hyper-parameter by using a grid search strategy and the like to select the optimal combination of the values of the hyper-parameters, namely completing the hyper-parameter tuning.
Furthermore, a variety of feature importance functions are usually built in the gradient boosting decision tree algorithm to calculate the importance degree of each feature, and as an example, xgb. Feature _ attributes is another function used by the XGBoost algorithm to calculate feature importance, for example, and when the contribution degree calculation method of a feature is node splitting, the feature brings an average value optimized by information gain (objective function). For example, the word may be expressed by model =
Xgbrfclasifier (import _ type = 'cover') recalls model. When the tree model is split, the number of samples covered by the leaf node under the feature is divided by the number of times the feature is used to split, so that the closer the split is to the root, the greater the feature contribution. Therefore, the importance degree of each feature can be determined based on the gradient lifting decision tree model after the hyper-parameter tuning.
Next, in step S205, the features are sorted in descending order according to the determined importance degree, and the colorectal cancer regression prediction model is trained by using the sorted features, the training set and the verification set, so as to obtain a trained colorectal cancer regression prediction model and a corresponding optimal feature combination.
Specifically, the colorectal cancer Regression prediction model is constructed by combining a gradient lifting decision tree model and a Logistic Regression (LR) model, that is, each feature in descending order is superimposed with a plurality of (for example, 10) features into the gradient lifting decision tree model each time, and in the process, the relative abundance of the enterobacteria corresponding to the features is set to 1, and the relative abundance is set to 0, that is, only the existence or the absence is considered, the colorectal cancer Regression prediction model is trained by using an L1 normalization method, and the training set and the verification set and the cross-validation method are used again, and particularly, the weight of each feature is optimized, so that a plurality of alternative colorectal cancer Regression prediction models on which different numbers (for example, 10, 20, 30, and so on) of features are superimposed are obtained.
On the basis, each alternative colorectal cancer regression prediction model is evaluated on a verification set, for example, the best colorectal cancer regression prediction model is selected by drawing an accuracy (accuracy) curve, and meanwhile, the corresponding feature combination is the optimal feature combination.
Fig. 4 shows a schematic diagram of accuracy curves corresponding to different feature combinations according to an embodiment of the present application, and it can be seen from fig. 4 that, in the case of stacking 10 features each time, when the feature number is 60, the performance of the colorectal cancer regression prediction model is best, i.e., the accuracy of the model prediction result is highest.
In step S206, an optimal combination of intestinal bacteria is determined as an optimal biomarker combination based on the optimal feature combination, wherein the biomarkers in the optimal biomarker combination reach a seed level.
Specifically, since each feature has a definite corresponding relationship with the enteric bacteria, the optimal enteric bacteria combination can be determined based on the optimal feature combination, and since the enteric bacteria according to the present application are at the seed level, each biomarker in the optimal biomarker combination reaches the seed level. Compared with the biomarkers only marked to genus level in the prior art, the marking to species level can enable the characteristics and models and the prediction result to be more accurate, and provides a greater possibility for the follow-up bacterial strain druggy.
In step S207, the trained regression prediction model for colorectal cancer is used to give a prediction result of the risk of the subject to suffer from colorectal cancer based on the presence information of each biomarker in the optimal biomarker combination.
Fig. 5 shows the prediction of data in independent validation datasets by a trained colorectal cancer regression prediction model based on optimal biomarker combinations according to an embodiment of the present application. The independent validation data set in fig. 5 (source study: PRJEB12449, CRC =52, health control = 52) did not participate in the training of the colorectal cancer regression prediction model, 0 represents a healthy sample, 1 represents a disease sample, and when prediction was performed using 25 enterobacteria in the above-described optimal biomarker combination and the trained colorectal cancer regression prediction model, it was assumed that the probability was greater than 0.5 and the probability was less than 0.5, that the prediction was healthy (did not suffer from colorectal cancer), 42 of 52 healthy people in the prediction result was predicted to be correct, and 41 of 52 CRCs was predicted to be correct, and the accuracy was about 0.8, so the performance of the colorectal cancer regression prediction model according to the embodiment of the present application was better, and the prediction accuracy was better than most of the prior art.
According to the modeling method for the prediction model of the risk of the colorectal cancer of the subject, firstly, the GBDT model is used for sequencing the importance degree of each feature, so that the important head features with high contribution degree are screened out, the tail features with low contribution degree are removed, the robustness of the model is enhanced, meanwhile, the feature dimension reduction effect is achieved, the complexity of the model and the prediction cost are reduced, and the prediction result of the risk of the colorectal cancer of the subject can be conveniently and accurately given by the trained regression prediction model of the colorectal cancer only by using the optimal biomarker combination containing a few types of biomarkers.
In some embodiments, obtaining the annotated first annotated gut flora structure data set with annotations may further comprise: obtaining a metagenomic raw sequencing dataset, wherein the metagenomic raw sequencing dataset comprises data of a patient suffering from colorectal cancer and data of a healthy person, and the ratio of the data of the patient to the data of the healthy person is a first ratio.
By way of example only, metagenomic raw sequencing data may be downloaded, for example, from curatedMetagenomicData (DOI: 10.18129/B9. Bioco. CuratedMetagenomicData), which is one dataset from the open source software Bioconductor, providing standardized human microbiome data from multiple studies, including gene family, species relative abundance, pathway information, etc., while also providing raw sequencing data. The present application downloaded a metagenomic raw sequencing dataset of two of the CRC-related studies (PRJEB 24748, PRJEB 6070), including 332 samples, with 165 patient data with colorectal cancer and 167 healthy person data, i.e., a first ratio of 165, close to 1. Next, the downloaded metagenome raw sequencing data may be input into metahlan 2 software, resulting in a first annotated flora structure data set, wherein metahlan 2
(https:// doi.org/10.1038/nmeth.2066) is an informatics computing tool developed by Huttenhouwer laboratories specifically for analyzing microbial community composition from metagenomic data.
Further, the second annotated intestinal flora structure data set obtained after the filtering process is divided into a training set and a verification set, and the proportions of CRC patients and healthy people in the training set and the verification set are set according to the first proportion. For example, 249 samples in the training set have 124 CRCs, 125 health, 83 samples in the validation set have 41 CRCs, and 42 health. Therefore, the CRC and the parity in the training set and the verification set are close to the first ratio, and therefore model deviation possibly caused by sample proportion deviation in training and verification can be avoided.
In some embodiments, determining the optimal combination of intestinal bacteria based on the optimal combination of features may specifically include as the optimal biomarker combination: and selecting the features with the weight more than 0 from the optimal feature combination, determining the intestinal bacteria corresponding to the features with the weight more than 0 as the combination of the optimal intestinal bacteria, and taking the intestinal bacteria as the optimal marker combination.
In the example of fig. 4, the feature with the weight of 0 in the optimal feature combination is further removed, so as to obtain 25 kinds of enteric bacteria corresponding to the optimal feature combination, that is: <xnotran> s __ Gemella _ morbillorum, s __ Lactobacillus _ gasseri, s __ Parvimonas _ micra, s __ Rothia _ dentocariosa, s __ Bifidobacterium _ breve, s __ Lactococcus _ lactis, s __ Clostridium _ clostridioforme, s __ Solobacterium _ moorei, s __ Eggerthella _ lenta, s __ Fusobacterium _ nucleatum, s __ Haemophilus _ parainfluenzae, s __ Alistipes _ shahii, s __ Granulicatella _ adiacens, s __ Clostridium _ leptum, s __ Bacteroides _ eggerthii, s __ Clostridium _ bartlettii, s __ Dasheen _ mosaic _ virus, s __ Bacteroides _ massiliensis, s __ Bacteroides _ dorei, s __ Clostridium _ bolteae, s __ Akkermansia _ muciniphila, s __ Eubacterium _ hallii, s __ Anaerotruncus _ colihominis, s __ Desulfovibrio _ desulfuricans, s __ Atopobium _ parvulum. </xnotran>
In the regression prediction model for colorectal cancer comprising the optimal biomarker combinations of 25 intestinal bacteria, the weighting coefficients of the respective biomarkers and the specific method for calculating the risk of colorectal cancer of the subject can be expressed by formula (1) and formula (2), and the list is not repeated here.
The sample size of the colorectal cancer prediction model in the prior art for searching the biomarkers is often small, so the searched markers cannot be generally suitable for real people. Compared with the prior art, the method has the advantages that the population and the sample size are enlarged, model construction and training are carried out on 332 samples, model result verification is carried out on the independent verification set (not participating in model training) comprising 104 samples (52 CRC patients and 52 healthy people), the verification results fully prove that the 25 intestinal bacteria serving as the optimal biomarker combination can better distinguish the CRC patients from the healthy people, the performance of the method on the independent verification set is good, the accuracy of the prediction results reaches 0.80, 42 healthy people are predicted correctly, 41 people are predicted correctly in 52 CRC, and therefore the accuracy and the robustness of the prediction model are high. In addition, compared with the fact that most of the found markers in the prior art only reach the generic level, the 25 biomarkers in the application are all intestinal bacteria at the generic level, are higher in accuracy, and have higher guiding significance for tracking pathogenic strains and developing and applying formula bacteria in the subsequent diagnosis and treatment process.
Moreover, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments based on the present application with equivalent elements, modifications, omissions, combinations (e.g., of various embodiments across), adaptations or alterations. The elements of the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more versions thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the application. This should not be interpreted as an intention that a disclosed feature not claimed is essential to any claim. Rather, subject matter of the present application can lie in less than all features of a particular disclosed embodiment. Thus, the claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that such embodiments can be combined with each other in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims (17)

1. A kit for predicting a subject's risk of having colorectal cancer, the kit comprising reagents for detecting information on the presence of biomarkers comprising: <xnotran> s __ Gemella _ morbillorum, s __ Lactobacillus _ gasseri, s __ Parvimonas _ micra, s __ Rothia _ dentocariosa, s __ Bifidobacterium _ breve, s __ Lactococcus _ lactis, s __ Clostridium _ clostridioforme, s __ Solobacterium _ moorei, s __ Eggerthella _ lenta, s __ Fusobacterium _ nucleatum, s __ Haemophilus _ parainfluenzae, s __ Alistipes _ shahii, s __ Granulicatella _ adiacens, s __ Clostridium _ leptum, s __ Bacteroides _ eggerthii, s __ Clostridium _ bartlettii, s __ Dasheen _ mosaic _ virus, s __ Bacteroides _ massiliensis, s __ Bacteroides _ dorei, s __ Clostridium _ bolteae, s __ Akkermansia _ muciniphila, s __ Eubacterium _ hallii, s __ Anaerotruncus _ colihominis, s __ Desulfovibrio _ desulfuricans, s __ Atopobium _ parvulum. </xnotran>
2. The kit of claim 1, wherein the kit is an in vitro diagnostic kit.
3. A system for predicting a subject's risk of having colorectal cancer, the system comprising a processor and a display, the processor configured to:
obtaining information on the presence of 25 biomarkers from the subject: <xnotran> s __ Gemella _ morbillorum, s __ Lactobacillus _ gasseri, s __ Parvimonas _ micra, s __ Rothia _ dentocariosa, s __ Bifidobacterium _ breve, s __ Lactococcus _ lactis, s __ Clostridium _ clostridioforme, s __ Solobacterium _ moorei, s __ Eggerthella _ lenta, s __ Fusobacterium _ nucleatum, s __ Haemophilus _ parainfluenzae, s __ Alistipes _ shahii, s __ Granulicatella _ adiacens, s __ Clostridium _ leptum, s __ Bacteroides _ eggerthii, s __ Clostridium _ bartlettii, s __ Dasheen _ mosaic _ virus, s __ Bacteroides _ massiliensis, s __ Bacteroides _ dorei, s __ Clostridium _ bolteae, s __ Akkermansia _ muciniphila, s __ Eubacterium _ hallii, s __ Anaerotruncus _ colihominis, s __ Desulfovibrio _ desulfuricans, s __ Atopobium _ parvulum; </xnotran>
Predicting a risk parameter for the subject to suffer from colorectal cancer based on the obtained information on the presence of the 25 biomarkers; and
causing the display to present a risk parameter of the subject suffering from colorectal cancer.
4. The system according to claim 3, wherein predicting the risk parameter of the subject for having colorectal cancer based on the obtained information on the presence of the 25 biomarkers comprises in particular:
calculating a risk parameter for the subject to suffer from colorectal cancer according to formula (1) and formula (2):
<xnotran> score = -0.2208191*s __ Gemella _ morbillorum _0+1.92846425*s __ Gemella _ morbillorum _1+ -0.0287996*s __ Lactobacillus _ gasseri _0+1.32441538*s __ Lactobacillus _ gasseri _1+ -0.1685278*s __ Rothia _ dentocariosa _0+1.08832583*s __ Rothia _ dentocariosa _1+0.29344154*s __ Bifidobacterium _ breve _0+ -1.0466365*s __ Bifidobacterium _ breve _1+ -0.4945794*s __ Clostridium _ clostridioforme _0+0.64842325*s __ Clostridium _ clostridioforme _1+ -0.112998*s __ Lactococcus _ lactis _0+0.7986691*s __ Lactococcus _ lactis _1+ -0.4837138*s __ Alistipes _ shahii _0+0.06659259*s __ Alistipes _ shahii _1+0.3887919*s __ Clostridium _ bartlettii _0+ -0.1883331*s __ Clostridium _ bartlettii _1+ -0.5495225*s __ Fusobacterium _ nucleatum _0+0.4539216*s __ Fusobacterium _ nucleatum _1+ -0.4146932*s __ Solobacterium _ moorei _0+0.55847062*s __ Solobacterium _ moorei _1+0.12401521*s __ Dasheen _ mosaic _ virus _0+ -0.3558551*s __ Dasheen _ mosaic _ virus _1+0.53506041*s __ Haemophilus _ parainfluenzae _0+ -0.3355559*s __ Haemophilus _ parainfluenzae _1+ -0.4494161*s __ Clostridium _ leptum _0+9.07E-05*s __ Clostridium _ leptum _1+ -0.0830943*s __ Bacteroides _ eggerthii _0+0.42819628*s __ Bacteroides _ eggerthii _1+0.2788531*s __ Bacteroides _ massiliensis _0+ -0.3119645*s __ Bacteroides _ massiliensis _1+ -0.2841483*s __ Bacteroides _ dorei _0+1.08E-05*s __ Bacteroides _ dorei _1+ -0.3334665*s __ Granulicatella _ adiacens _0+0.47181429*s __ Granulicatella _ adiacens _1+ -0.1836578*s __ Akkermansia _ muciniphila _0+0.21298044*s __ Akkermansia _ muciniphila _1+ -0.2561297*s __ Parvimonas _ micra _0+1.09095746*s __ Parvimonas _ micra _1+ -0.140284*s __ Atopobium _ parvulum _0+0*s __ Atopobium _ parvulum _1+ -0.1045567*s __ Eubacterium _ hallii _0+0.21065861*s __ Eubacterium _ hallii _1+ -0.0373602*s __ Desulfovibrio _ desulfuricans _0+0.16118485*s __ Desulfovibrio _ desulfuricans _1+ -0.022273*s __ Eggerthella _ lenta _0+0.55154416*s __ Eggerthella _ lenta _1+ -0.1920693*s __ Anaerotruncus _ colihominis _0+0.06595296*s __ Anaerotruncus _ colihominis _1+ -0.2559028*s __ Clostridium _ bolteae _0+0.14338132*s __ Clostridium _ bolteae _1 (1) </xnotran>
Figure FDA0004029585000000031
Wherein score in formula (1) is a composite score calculated based on the presence and weight coefficients of each biomarker, _0 indicates the absence of the biomarker, _1 indicates the presence of the biomarker, and the previous value of each biomarker is the corresponding weight coefficient of the biomarker; p in formula (2) is a risk parameter for the subject to suffer from colorectal cancer calculated from the composite score.
5. The system of claim 4, wherein the processor is further configured to:
predicting the risk of the subject suffering from colorectal cancer based on whether the risk parameter of the subject suffering from colorectal cancer exceeds a preset threshold range; and
causing the display to present the predicted risk of the subject for colorectal cancer.
6. The system of claim 5, wherein the preset threshold range is 0.5.
7. The system of claim 4, wherein the individual biomarkers and their corresponding weighting coefficients are derived using a predictive model for the risk of the subject suffering from colorectal cancer based on annotated gut flora structure data sets comprising patients and healthy persons suffering from colorectal cancer.
8. Use of a reagent for detecting presence information of a biomarker in the manufacture of a kit for predicting a subject's risk of having colorectal cancer, wherein the biomarker comprises: <xnotran> s __ Gemella _ morbillorum, s __ Lactobacillus _ gasseri, s __ Parvimonas _ micra, s __ Rothia _ dentocariosa, s __ Bifidobacterium _ breve, s __ Lactococcus _ lactis, s __ Clostridium _ clostridioforme, s __ Solobacterium _ moorei, s __ Eggerthella _ lenta, s __ Fusobacterium _ nucleatum, s __ Haemophilus _ parainfluenzae, s __ Alistipes _ shahii, s __ Granulicatella _ adiacens, s __ Clostridium _ leptum, s __ Bacteroides _ eggerthii, s __ Clostridium _ bartlettii, s __ Dasheen _ mosaic _ virus, s __ Bacteroides _ massiliensis, s __ Bacteroides _ dorei, s __ Clostridium _ bolteae, s __ Akkermansia _ muciniphila, s __ Eubacterium _ hallii, s __ Anaerotruncus _ colihominis, s __ Desulfovibrio _ desulfuricans, s __ Atopobium _ parvulum. </xnotran>
9. Use according to claim 8, characterized in that: the reagent is used for detecting the existence information of the biomarker contained in a sample of intestinal microflora of the subject, wherein the sample comprises an intestinal tissue sample or a fecal sample of the subject.
10. Use according to claim 8 or 9, wherein the reagents are qPCR primers for the biomarkers.
11. The application of claim 10, further comprising: detecting presence information of the biomarker by a PCR reaction using the primer and genomic DNA of the subject's intestinal microbial flora as a template.
12. The application of claim 8, further comprising: detecting the presence of said biomarker by detecting the presence of a nucleotide sequence selected from the group consisting of:
1) Measles twin coccus Gemella _ morbillorus f atacagttattctcgccatgagags, r:
GGTTAGGTACCGTCTCTTACATG
2) Lactobacillus gasseri f AATACTCCGAAGCACGTCA, r:
TCATTGTGTTTGGCAATCGT
3) Micromonas _ micra f TCACAGTAGTCACAAGAGGAGAGGAT, r:
GGGAAGCATTGGCGGAAA
4) Rothia _ dentocariosa f, GGGTTGTAAACCTCTGTTAGCATC, r:
CGTACCCACTGCAAAACCAG
5) Bifidobacterium breve f TCATCATCACGGCAAGGTCAAGA, r:
GGCCAGAACAGCTGGAACAA
6) Lactococcus lactis f CTGTCGTTTCTGTTATGAAT, r:
GTGTATTCATCATAACCAAC
7) Clostridium clostridia _ Clostridium clotridioforme f GAAGTTTTTTCGGATGGAATCTTGA, r:
CACCGAAGGCTTTGCC
8) Clostridium sporogenes Solobacterium moorei f CTCAACCCAATCCAGCCACT, r:
TATTGGCTCCCCACGGTTTC
9) Acinetobacter egg-lenta f GAGTTTGATCCTGGCTCAG, r:
ACGGCTACCTTGTTACGACTT
10 Fusobacterium nucleatum _ nuclear f CAACCATTTACTTTAACTCTACCATGTTCA, r:
GTTGACTTTACAGAAGGAGATTATGTAAAAATC
11 Haemophilus parainfluenzae Haemophilus _ parainfluenzae f GAGAGACTGCGGTAGTCGATCC, r:
CCATCACTTGGTTTGATGCT
12 CTGATGCACACCACCAAGTC, r:
GGTCATGTCGTAGGGCTTGT
13 GGTTTATCCTTAGAAAGGAGGT, r:
GAGCATTCGGTTGGGCACTCTAG
14 But Clostridium pasteurium _ bartlettii f GTAAGCTCTTGAAACTGGAG, r:
GAAAGATGCGATTAGGCATC
15 Bacteroides ovoides _ eggerthii f CCCGATAGTAGTTAGTTTTCCGC, r:
TCCTCTCAGAACCCCTATCCAT
16 Clostridium tender _ led Clostridium f GCACAAGCAGCAGTGGAGT, r CTTCCTCCGTTTTGTCA 17) Dasheen mosaic virus Dasheen _ mosaic _ virus f ATGGTHTGGTGYATHGARAAYGG, r:
TGCTGCKGCYTTCATYTG
18 Bacteroides massicus _ masssiliensis f GCGTTTCCG r CCATATTCGG
19 Bacteroides dorteroides _ dorei f aagcggcttcaagaaaacagg, r:
GTGCCCTTTACCTTGGGAAC
20 Clostridium baumannii Clostridium _ bolete f CCTCTTGACCGGCGTGT, r:
CAGGTAGAGCTGGGCACTCTAGG
21 Akkermansia muciniphila f CAGCACGTGAAGGTGGGGAC, r:
CCTTGCGGTTGGCTTCAGAT
22 Eubacterium _ haliif of Eubacterium harzianum GCGTAGGTGGCAGTGCAA, r:
GCACCGRAGCCTATACGG
23 Human anaerobic coryneform colon bacterium Anaerotruncus _ colihominis f GGAGCTTACGTTTGAAGTTTTTC,
r:CTGCTGCCTCCCGTA
24 Desulfovibrio _ desulfuriicans f GGCATCTATAAGACCTCCTGTAGAC, r:
TGTAGATCGTAGGTAGCAAATGTCG
25 Atopobium paravulum f AGAGAGTTTGATCCTGGCTCAG, r:
TGCGGCACGGAAGAAATACTCCCC。
13. use according to claim 8 or 9, said biomarkers being extracted on the basis of annotated gut flora structure datasets comprising patients with colorectal cancer and healthy persons, using predictive models for the risk of subjects suffering from colorectal cancer.
14. A modeling method for a predictive model of a subject's risk of having colorectal cancer, the modeling method comprising the steps performed by a processor of:
obtaining a first annotated gut flora structural data set with annotations, comprising patient data with colorectal cancer and data of healthy persons;
filtering out intestinal flora with relative abundance lower than a first threshold value in the first annotated intestinal flora structure data set to obtain a second annotated intestinal flora structure data set;
dividing the second annotated intestinal flora structure data set into a training set and a validation set;
performing feature coding on the second annotated intestinal flora structure data set by using a gradient lifting decision tree model, performing super-parameter tuning on the gradient lifting decision tree model through cross validation based on the training set and the validation set, and determining the importance degree of each feature;
performing descending order arrangement on each feature according to the determined importance degree, and training the colorectal cancer regression prediction model by using each feature after descending order arrangement, the training set and the verification set to obtain a trained colorectal cancer regression prediction model and a corresponding optimal feature combination;
and determining an optimal combination of intestinal bacteria as an optimal biomarker combination based on the optimal feature combination, wherein the biomarkers in the optimal biomarker combination reach species levels, and the trained colorectal cancer regression prediction model is configured to give a prediction result of the colorectal cancer risk of the subject based on the presence information of each biomarker in the optimal biomarker combination.
15. The modeling method of claim 14, wherein obtaining a first annotated gut flora structure data set with annotations further comprises:
obtaining a metagenomic original sequencing dataset, wherein the metagenomic original sequencing dataset comprises data of a patient suffering from colorectal cancer and data of a healthy person, and the ratio of the data of the patient to the data of the healthy person is a first ratio;
annotating the intestinal flora structure in the metagenome original sequencing dataset to obtain a first annotated intestinal flora structure dataset;
dividing the second annotated gut flora structure data set into a training set and a validation set further comprises: and dividing the second annotated intestinal flora structure data set into a training set and a verification set according to the first ratio.
16. The modeling method according to claim 14 or 15, characterized in that a combination of optimal gut bacteria is determined based on the optimal combination of features, specifically comprising as optimal biomarker combination:
and selecting the features with the weight more than 0 from the optimal feature combination, determining the intestinal bacteria corresponding to the features with the weight more than 0 as the combination of the optimal intestinal bacteria, and taking the intestinal bacteria as the optimal marker combination.
17. A modelling method according to claim 14 or 15, wherein said optimal marker combination comprises: <xnotran> s __ Gemella _ morbillorum, s __ Lactobacillus _ gasseri, s __ Parvimonas _ micra, s __ Rothia _ dentocariosa, s __ Bifidobacterium _ breve, s __ Lactococcus _ lactis, s __ Clostridium _ clostridioforme, s __ Solobacterium _ moorei, s __ Eggerthella _ lenta, s __ Fusobacterium _ nucleatum, s __ Haemophilus _ parainfluenzae, s __ Alistipes _ shahii, s __ Granulicatella _ adiacens, s __ Clostridium _ leptum, s __ Bacteroides _ eggerthii, s __ Clostridium _ bartlettii, s __ Dasheen _ mosaic _ virus, s __ Bacteroides _ massiliensis, s __ Bacteroides _ dorei, s __ Clostridium _ bolteae, s __ Akkermansia _ muciniphila, s __ Eubacterium _ hallii, s __ Anaerotruncus _ colihominis, s __ Desulfovibrio _ desulfuricans, s __ Atopobium _ parvulum. </xnotran>
CN202211720516.4A 2022-12-30 2022-12-30 Kit, system, use and modeling method of prediction model for predicting risk of colorectal cancer of subject Pending CN115873956A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211720516.4A CN115873956A (en) 2022-12-30 2022-12-30 Kit, system, use and modeling method of prediction model for predicting risk of colorectal cancer of subject

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211720516.4A CN115873956A (en) 2022-12-30 2022-12-30 Kit, system, use and modeling method of prediction model for predicting risk of colorectal cancer of subject

Publications (1)

Publication Number Publication Date
CN115873956A true CN115873956A (en) 2023-03-31

Family

ID=85757464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211720516.4A Pending CN115873956A (en) 2022-12-30 2022-12-30 Kit, system, use and modeling method of prediction model for predicting risk of colorectal cancer of subject

Country Status (1)

Country Link
CN (1) CN115873956A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116344040A (en) * 2023-05-22 2023-06-27 北京卡尤迪生物科技股份有限公司 Construction method of integrated model for intestinal flora detection and detection device thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116344040A (en) * 2023-05-22 2023-06-27 北京卡尤迪生物科技股份有限公司 Construction method of integrated model for intestinal flora detection and detection device thereof
CN116344040B (en) * 2023-05-22 2023-09-22 北京卡尤迪生物科技股份有限公司 Construction method of integrated model for intestinal flora detection and detection device thereof

Similar Documents

Publication Publication Date Title
CN109943636B (en) Colorectal cancer microbial marker and application thereof
CN105296590B (en) Large intestine carcinoma marker and its application
CN109852714B (en) Early diagnosis of intestinal cancer and adenoma diagnosis marker and application
CN105132518B (en) Large intestine carcinoma marker and its application
JP2018537754A5 (en)
CN108345768B (en) Method for determining maturity of intestinal flora of infants and marker combination
CN107075453B (en) Biomarkers for coronary artery disease
CN107217089A (en) Determine the method and device of individual state
CN107075446A (en) Biomarker for obesity-related disorder
Carbonetto et al. Human microbiota of the argentine population-a pilot study
CN112852916A (en) Marker combination for intestinal microecology, auxiliary diagnosis model and application of marker combination
US20220293217A1 (en) System and method for risk assessment of multiple sclerosis
CN115873956A (en) Kit, system, use and modeling method of prediction model for predicting risk of colorectal cancer of subject
Tan-Torres Jr et al. Machine learning clustering and classification of human microbiome source body sites
CN107217088A (en) Ankylosing spondylitis microbial markers
CN111254207A (en) Intestinal microbial marker for distinguishing autoimmune hepatitis from healthy people and application thereof
CN114317725B (en) Crohn disease biomarker, kit and screening method of biomarker
CN111755129A (en) Multi-mode osteoporosis layering early warning method and system
KR20210145539A (en) Providing method for health information based on microbiome and analysis apparatus
CN115331737A (en) Method for analyzing pathogenic bacteria in intestinal flora and quantifying regional characteristics of flora
CN105733988B (en) Composition and application
CN114369673A (en) Colorectal adenoma biomarker, kit and screening method of biomarker
CN111192244B (en) Method and system for determining tongue characteristics based on key points
CN107217086A (en) Disease marker and application
JP2021151194A (en) Exercise habit examining method using intestinal bacterial flora

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination