CN114974432A - Screening method of biomarker and related application thereof - Google Patents

Screening method of biomarker and related application thereof Download PDF

Info

Publication number
CN114974432A
CN114974432A CN202210770641.XA CN202210770641A CN114974432A CN 114974432 A CN114974432 A CN 114974432A CN 202210770641 A CN202210770641 A CN 202210770641A CN 114974432 A CN114974432 A CN 114974432A
Authority
CN
China
Prior art keywords
strain
screening
gene cluster
strains
mgyg
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210770641.XA
Other languages
Chinese (zh)
Inventor
张陈陈
梁雅俊
朱瑞娟
兰周
常曌
张东亚
蒋先芝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Moon Guangzhou Biotech Co ltd
Original Assignee
Moon Guangzhou Biotech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Moon Guangzhou Biotech Co ltd filed Critical Moon Guangzhou Biotech Co ltd
Priority to CN202210770641.XA priority Critical patent/CN114974432A/en
Publication of CN114974432A publication Critical patent/CN114974432A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a screening method of a biomarker and related application thereof, and relates to the technical field of biological detection. The invention provides a method for screening biomarkers, which is characterized in that metabolic products of strains are not detected by a measuring instrument, but are directly predicted by predicting the functions of genome sequences of the strains to form a metabolic product sequence library, then metagenome sequencing data and the strain sequence library and the metabolic product sequence library are respectively compared to calculate the abundance of each strain and the abundance of each metabolic product, and then the strains are screened based on the abundance of the strains, the abundance of the metabolic products and the corresponding relation between the metabolic products and the strains. In the case where the metabolites are not actually measured, screening can also be performed based on both species and metabolites. Obviously reduces the time and cost for developing markers for disease diagnosis and curative effect prediction and drug screening, and provides a new way for disease mechanism research and drug development.

Description

Screening method of biomarker and related application thereof
Technical Field
The invention relates to the technical field of biological detection, in particular to a screening method of a biomarker and related application thereof.
Background
Metagenomics, also called Metagenomics, is mainly based on high-throughput sequencing of genomic DNA of all microorganisms in a given environment, analyzing strain classification information and functional annotation information of the microorganisms in a sample, and calculating relative abundance information of strains and functions. The metagenome does not need separate culture, and the DNA of the environmental sample is directly extracted for sequencing, so that the research difficulty of non-culturable microorganisms is avoided, and the metagenome sequencing can analyze the structure composition of the microbial community and deeply excavate the gene and functional information of the community. In recent years, with the discovery of immune check points, tumor immunotherapy has been rapidly developed, and many studies show that immunotherapy has a very good effect in the treatment of solid tumors such as melanoma and non-small cell lung cancer. However, this therapy only benefits a small number of patients, and therefore, it is of great interest to develop methods for improving the efficacy of immunotherapy.
Research shows that intestinal flora can directly influence the cancer immunotherapy effect, the existence, composition and diversity level of the intestinal flora directly influence the tumor therapy effect of patients, and the immune response efficiency can be improved through fecal flora transplantation or beneficial bacteria intervention, for example, the composition of the intestinal flora of 42 patients with metastatic melanoma is analyzed by Gajewski team, and the composition of the intestinal flora of the patients is found to be obviously related to the immunotherapy effect of the PD-1 inhibitor. Among the intestinal flora of patients with effective treatment, Bifidobacterium longum (Bifidobacterium longum), Coprinus aerogenes (Collinsella aerofaciens) and Enterococcus faecium (Enterococcus faecium) are high in abundance. Transplantation of effective patient intestinal flora to germ-free mice can improve tumor control, T cell response and therapeutic effect of PD-L1 inhibitor. (science.2018,359: 104-108). The team of the Proc. Guido Kroemer and the Proc. Laurence Zitvogel found, by sampling analysis of patients with lung and kidney cancer, that those patients who failed to benefit from immunotherapy were deficient in a bacterium called Akkermansia muciniphila. Subsequently, they demonstrated the benefits of this bacterium in mouse experiments. First, they used a fecal transplantation method to implant the flora of patients who "responded to immunotherapy" and "not responded to immunotherapy" in mice treated with antibiotics (which themselves were not responsive to immunotherapy), respectively. As the investigators expected, the former restored the response to immunotherapy, while the latter remained unresponsive to immunotherapy. More interestingly, the efficacy of immunotherapy can be reshaped if the latter is given a further oral administration of Akkermansia muciniphila. Therefore, it is very important to select a strain that can improve the immunotherapeutic effect.
In recent years, it has been found that metabolites of the gut flora can directly cross the intestinal barrier and act on the host, influence the physiology of the host, and activate the immunity of the host, and these substances include short chain fatty acids, indole propionic acid, tryptamine, secondary bile acids, and the like. Among them, short-chain fatty acids and secondary bile acids are important research directions in the aspects of intestinal flora metabolites and tumor immunity. Short-chain fatty acids (SCFAs) are among the most characteristic classes of microbial metabolising strains currently known to affect host immunity, and may affect cytokine production, macrophage and dendritic cell function, and B-cell class switching. For example, 7.1.2021, research team of the university of marburg philips, germany published in Nature Communications journal with the title: microshort-chain fatty acids modulated CD8 + A study paper of T cell responses and reactive adaptive immunotherapy for cancer. The research proves that two microbial metabolites, namely valeric acid and butyric acid, can enhance CD8 through experiments for the first time + Production of potent cytokines in T cells, thereby enhancing anti-tumor effect of immune cellsTumor activity.
Although many researches show that the intestinal flora can directly influence the cancer immunotherapy effect, many problems can be faced in the actual drug development process, for example, the strains of the microorganisms are too large, so how to select the strains beneficial to the immunotherapy effect can lead to very long later animal experiment and clinical experiment processes and increase very much research and development cost if the selected strains are too large, and therefore, how to accurately and effectively screen the strains is a problem which needs to be solved urgently for accelerating the development of the microbial drugs.
The abundance of each strain is typically calculated using metagenomic sequencing data and strains that are beneficial for immunotherapeutic effect are screened based on their abundance in combination with statistical methods. In addition, the metabolite gene cluster of each strain can be detected, and whether the strain has the ability to produce a metabolite beneficial to immunotherapy can be deduced according to the detection value of the metabolite gene cluster of the strain, so that the strain beneficial to the immunotherapy effect can be further screened. In the prior art, the metabolite of the strain is detected, and the detection mode firstly needs a special detection instrument; secondly, it takes a certain time to culture the strain, and the detection also costs a certain amount, which is time-consuming and labor-consuming, and hinders the strain screening process beneficial to immunotherapy.
In the prior art, although a nuclear magnetic resonance spectrometer is used for measuring related metabolites in blood, urine or feces, metabolite gene clusters of strains cannot be obtained, and metabolites of the blood, urine and feces cannot be used as metabolites of the strains to screen microbial drugs for treatment.
In view of this, the invention is particularly proposed.
Disclosure of Invention
The invention aims to provide a biomarker screening method and related application thereof.
The invention is realized by the following steps:
in a first aspect, the embodiments of the present invention provide a method for screening biomarkers, S1, establishing a representative strain genome sequence library: clustering sequences in the obtained genome sequence library according to a set threshold value to obtain strain clusters of different strain levels or species levels; screening to obtain a representative strain sequence of each strain cluster to establish a representative strain genome sequence library; s2, establishing a metabolite gene cluster sequence library: performing gene annotation on the obtained sequence of the genome sequence library, predicting a metabolite gene cluster of each strain or each strain, performing similarity clustering on the metabolite gene cluster to obtain a gene cluster family, and merging the gene cluster families to obtain a metabolite gene cluster sequence library; the operation sequences of step S1 and step S2 may be interchanged; s3, obtaining metagenome sequencing data of the sample, and comparing the metagenome sequencing data with the representative strain genome sequence library in the step S1 and the metabolite gene cluster sequence library in the step S2 respectively to obtain the relative abundance of each strain or each strain and the relative abundance of the metabolite; s4, screening the obviously different strains and the obviously different metabolites; s5, taking the candidate strain or the candidate strain (strain) with obvious difference in the strain or the strain and the metabolite thereof as the biomarker.
In a second aspect, an embodiment of the present invention provides a screening apparatus for biomarkers, including: the representative strain genome sequence library construction module is used for establishing a representative strain genome sequence library: clustering sequences in the obtained genome sequence library according to a set threshold value to obtain strain clusters of different strain levels or species levels; screening to obtain a representative strain sequence of each strain cluster to establish a representative strain genome sequence library; the metabolic product gene cluster sequence library construction module is used for establishing a metabolic product gene cluster sequence library: performing gene annotation on the obtained sequence of the genome sequence library, predicting a metabolite gene cluster of each strain or each strain, performing similarity clustering on the metabolite gene clusters to obtain a gene cluster family, and merging the gene cluster families to obtain a metabolite gene cluster sequence library; the relative abundance calculating module is used for acquiring metagenome sequencing data of the sample, comparing the metagenome sequencing data with the representative strain genome sequence library in the step S1 and the metabolite gene cluster sequence library in the step S2 respectively, and obtaining the relative abundance of each strain or each strain and the relative abundance of the metabolite; the screening module is used for screening the obviously different strains and the obviously different metabolites; and the marker generation module is used for taking the candidate strain or the candidate strain (strain) with obvious difference in the strain or the strain and the metabolite thereof as the biomarker.
In a third aspect, an embodiment of the present invention provides a method for training a prediction model, including: obtaining a detection result of a biomarker in a training sample and a corresponding labeling result; wherein the biomarker is obtained by screening according to the screening method described in the previous embodiment or by the biomarker screening device described in the previous embodiment, and the labeled result is a label representing the disease risk and/or the prognosis effect of the sample; inputting the detection result of the biomarker into a pre-constructed prediction model to obtain a prediction result; the pre-constructed prediction model is a classifier which can judge the disease risk and/or the prognosis curative effect of the sample according to the detection result of the biomarker; and updating parameters of the prediction model based on the marking result and the prediction result.
In a fourth aspect, embodiments of the present invention provide a prediction device for predicting risk of disease and/or prognostic efficacy, including: an obtaining module, configured to obtain a detection result of a biomarker in a sample to be detected, where the biomarker is obtained by screening with the screening method described in the foregoing embodiment or by using the screening apparatus for the biomarker received in the foregoing embodiment; the prediction model is used for inputting the obtained detection result into the prediction model trained by the training method in the embodiment to obtain a prediction result;
in a fifth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes: a processor and a memory; the memory for storing a program that, when executed by the processor, causes the processor to: the biomarker screening method as described in the previous examples; or, implementing the training method of the prediction model of the foregoing embodiment: the biomarker is obtained by the screening method and the screening method as described in the previous embodiment, and the obtained biomarker detection result is input into the prediction model trained by the training method as described in the previous embodiment to obtain the prediction result.
In a sixth aspect, an embodiment of the present invention provides a computer-readable medium, where the computer program, when executed by a processor, implements: the biomarker screening method as described in the previous examples; or, implementing the training method as described in the previous embodiment: obtaining the detection result of the biomarker in the sample to be detected, wherein the biomarker is obtained by screening with the screening method of the previous embodiment or by screening with the screening device of the biomarker of the previous embodiment; and inputting the obtained detection result into the prediction model trained by the training method in the embodiment to obtain a prediction result.
In a seventh aspect, the embodiments of the present invention provide a use of a reagent for detecting a biomarker obtained by screening with the screening method described in the previous embodiments or with the biomarker screening apparatus described in the previous embodiments in the preparation of a product for treating a disease or a product for predicting the risk of a disease and/or the prognostic efficacy.
In an eighth aspect, the embodiments of the present invention provide a product for predicting risk of colorectal cancer or prognosis effect, which includes the reagent for detecting biomarkers as described in the previous embodiments.
The invention has the following beneficial effects:
the invention provides a novel method for screening biomarkers, which is characterized in that metabolic products of strains are not detected by a measuring instrument, but the metabolic products of each strain are predicted directly by predicting the functions of genome sequences of the strains, a metabolic product sequence library is formed, metagenome sequencing data and the strain sequence library and the metabolic product sequence library are respectively compared subsequently, the abundance of each strain and the abundance of each metabolic product are calculated, and then the strains are screened based on the strain abundances, the metabolic product abundances and the corresponding relation between the metabolic products and the strains and a statistical method. In the case where metabolites are not actually measured, strains can also be screened on both the basis of strain and metabolite. The method has the following advantages:
(1) simply and rapidly realizing diagnosis, drug screening and predicting diagnosis/treatment effect at any time and any place;
(2) the accuracy and the effectiveness of screening are improved; the strains (strains) are screened based on the two aspects of the strains (strains) and the metabolites, compared with the method for screening the strains (strains) only by the abundance of the strains (strains), the number of the strains screened can be greatly reduced, and the metabolites play a great role in immunotherapy, so that the screening result can be more accurate by screening the metabolites;
(3) the drug development time is shortened, and the research and development cost is reduced; the more accurate the screened strains, the later animal experiment and clinical experiment time can be greatly reduced, and the research and development cost can be greatly reduced.
(4) Biomarkers can be applied in disease diagnosis or predicting prognostic risk.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a schematic diagram of the abundance calculation of species (species);
FIG. 2 is a flow chart of the solution of example 1;
FIG. 3 is a flow chart of the technical solution of example 2;
FIG. 4 is a flow chart of the solution of example 3;
FIG. 5 is the number of optimal model strains for predicting whether a disease is present in example 4;
FIG. 6 shows the degree of importance of the strain of example 4 in the optimal model;
FIG. 7 is the evaluation results of the training set of random forest models of example 4;
fig. 8 is the evaluation results of the random forest model test set of example 4.
Fig. 9 is a flow chart of the biomarker screening technique of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below. The examples, in which specific conditions are not specified, were conducted under conventional conditions or conditions recommended by the manufacturer. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products available commercially.
The embodiment of the invention provides a method for screening biomarkers, which comprises the following steps:
s1, establishing a representative strain genome sequence library: clustering sequences in the obtained genome sequence library according to a set threshold value to obtain strain clusters of different strain levels and/or species levels; screening to obtain a representative strain sequence of each strain cluster to establish a representative strain genome sequence library;
s2, establishing a metabolite gene cluster sequence library: performing gene annotation on the obtained sequence of the genome sequence library, predicting a metabolite gene cluster of each strain and/or each strain, performing similarity clustering on the metabolite gene cluster to obtain a gene cluster family, and merging the gene cluster families to obtain a metabolite gene cluster sequence library; the order of operations of step S1 and step S2 may be interchanged;
s3, obtaining metagenome sequencing data of the sample, and comparing the metagenome sequencing data with the representative strain genome sequence library in the step S1 and the metabolite gene cluster sequence library in the step S2 respectively to obtain the relative abundance of each strain and/or each strain and the relative abundance of the metabolite;
s4, screening the obviously different strains and the obviously different metabolites;
s5, taking the candidate strain or the candidate strain (strain) with obvious difference in the strain or the strain and the metabolite thereof as the biomarker.
In some embodiments, in the step S1, the set threshold is greater than or equal to 95%.
In some embodiments, the set threshold is ≧ 99%.
In some embodiments, in the step S1, the strain clusters of different strain levels are obtained according to a set threshold of 99% or more, and when the strain clusters of different species levels are obtained according to a set threshold of 95% or more, representative strains of the strain clusters are obtained.
In some embodiments, the step of screening to obtain a representative strain sequence for each strain cluster to create a library of representative strain genomic sequences comprises: for the strains in each strain cluster, the gene sequence with the longest gene sequence length is selected as the representative strain sequence of the same strain cluster.
In some embodiments, in the S1 and/or S2 steps, the obtained genome sequence database comprises a genome database and/or a strain database.
In some embodiments, the genome database comprises at least one of a UHGG database and a human gut microorganism genome sequence database.
In some embodiments, after performing gene annotation in step S2, analyzing the annotated genes to predict metabolite gene clusters of each strain and/or each species is further included.
In some embodiments, the step of obtaining a gene cluster family by similarity clustering of metabolite gene clusters in step S2 includes any one of the steps of (a) to (c):
(a) extracting protein sequences of the metabolite gene clusters, performing redundancy filtration on the predicted metabolite gene clusters according to sequence similarity between the protein sequences to obtain a non-redundant gene cluster set, and clustering the metabolite gene clusters in the non-redundant gene cluster set according to a set similarity threshold to obtain a gene cluster family;
(b) merging the gene clusters of the same metabolite into a gene cluster set, and selecting a representative gene cluster of the gene cluster set; clustering the representative gene clusters according to a set similarity threshold value to obtain the gene cluster families;
(c) extracting protein sequences of metabolite gene clusters, performing redundant filtration on the predicted metabolite gene clusters according to sequence similarity between the protein sequences to obtain a non-redundant gene cluster set, calculating the distance between every two gene clusters in each non-redundant gene cluster set, and selecting the gene cluster with the minimum distance value as a representative gene cluster of the gene cluster set for merging to obtain the metabolite gene cluster sequence library.
In some embodiments, in item (b), the step of selecting a representative gene cluster of the set of gene clusters comprises: and calculating the distance between every two gene clusters in each gene cluster set, and selecting the gene cluster with the minimum distance value as a representative gene cluster of the gene cluster set for merging to obtain the metabolite gene cluster sequence library.
In some embodiments, the similarity threshold is ≧ 0.3.
In some embodiments, the means for performing the gene annotation comprises prokka.
In some embodiments, the means for predicting the metabolite gene cluster of each strain or each species comprises: gutsmash and/or anti-mash.
In some embodiments, after obtaining the metagenomic sequencing data in step S3, the screening method further includes filtering the metagenomic sequencing data (raw reads), removing low-quality sequences and host sequences, and obtaining clean reads, which may be specifically referred to in fig. 1.
In some embodiments, the step of screening for significantly different species and/or species and significantly different metabolites in step S4 comprises: for each cohort of data, strains and/or species with significant differences in species or abundance of the species were identified as significantly different strains, and metabolites with significant differences in species and abundance of the metabolites were identified as differentially significant metabolites.
In step S4, after the significantly different strains and/or strains and the significantly different metabolites are obtained through calculation, the correspondence between the significantly different metabolites and the significantly different strains can be obtained based on the gene clusters of the metabolites.
In some embodiments, in the step S5, when there are multiple cohorts of data, further comprising calculating heterogeneity of strains or species among the different cohorts, and retaining candidate strains or species with less heterogeneity as biomarkers.
In some embodiments, the less heterogeneous criteria include: i2< 40% and P > 0.1.
The sample in step S3 may collect relevant queue data based on different preset uses. The source of the queue data may be collected samples and an existing database, etc. The different predetermined uses include drug screening, prediction of risk and/or prognostic efficacy of diseases that may include existing tumors, including but not limited to at least one of esophageal cancer, breast cancer, lung cancer, gastric cancer, liver cancer, pancreatic cancer, gallbladder cancer, small intestine cancer, colon cancer, rectal cancer, kidney cancer, cervical cancer, ovarian cancer, peritoneal cancer, tongue cancer, lip cancer, laryngeal cancer, thyroid cancer, and the like.
In particular, cohort data may be collected based on the intended use of the biomarkers, and each cohort may include negative and positive samples. The queues may be sample data collected for different preset purposes or sample data collected for the same preset purpose. Where the intended use of the biomarker is disease diagnosis, cohort data may include diseased and non-diseased groups, and in some embodiments, may also include sample groups of different disease courses; when the preset purpose of the biomarker is therapeutic effect prediction, the queue data can comprise a good therapeutic effect group and a poor therapeutic effect group; when the predetermined use of the biomarker is drug screening, the different groups may include a post-treatment effective group and a post-treatment ineffective group, and in some embodiments, may further include a more effective group and a less effective group.
In some embodiments, the biomarker obtained by screening is used as an index for constructing a prediction model, and a strain or a strain capable of being effectively predicted is screened out as a final biomarker by using a cross-validation method. It is understood that the method for constructing the prediction model may be the same as the method for training the prediction model described in any of the following embodiments, and will not be described in detail. The 'strain capable of being effectively predicted' can be understood as a construction index (strain or strain combination) adopted for constructing a prediction model when the prediction accuracy reaches more than 70% -90%, specifically, the prediction accuracy can be reflected by an ROC curve, and if the AUC reaches more than 0.7-0.9, the prediction model is considered to be capable of being effectively predicted and can be applied to clinic.
Alternatively, the flow chart of the biomarker screening technique of the present application may be referred to fig. 9.
The embodiment of the invention also provides a screening device of the biomarker, which comprises:
and the representative strain genome sequence library construction module is used for establishing a representative strain genome sequence library: clustering sequences in the obtained genome sequence library according to a set threshold value to obtain strain clusters of different strain levels or species levels; screening to obtain a representative strain sequence of each strain cluster to establish a representative strain genome sequence library;
the metabolic product gene cluster sequence library construction module is used for establishing a metabolic product gene cluster sequence library: performing gene annotation on the obtained sequence of the genome sequence library, predicting a metabolite gene cluster of each strain or each strain, performing similarity clustering on the metabolite gene clusters to obtain a gene cluster family, and merging the gene cluster families to obtain a metabolite gene cluster sequence library;
the relative abundance calculating module is used for acquiring metagenome sequencing data of the sample, comparing the metagenome sequencing data with the representative strain genome sequence library in the step S1 and the metabolite gene cluster sequence library in the step S2 respectively, and obtaining the relative abundance of each strain or each strain and the relative abundance of the metabolite;
the screening module is used for screening the obviously different strains and the obviously different metabolites;
and the marker generation module is used for taking the candidate strain or the candidate strain (strain) with obvious difference in the strain or the strain and the metabolite thereof as the biomarker.
It is understood that each executed step may correspond to any of the foregoing embodiments, and is not described in detail.
The embodiment of the invention also provides a training method of the prediction model, which comprises the following steps: obtaining a detection result of a biomarker in a training sample and a corresponding labeling result; wherein the biomarker is obtained by screening with the screening method according to any of the preceding embodiments or by the screening device according to any of the preceding embodiments, and the labeled result is a label representing the disease risk and/or the prognosis effect of the sample;
inputting the detection result of the biomarker into a pre-constructed prediction model to obtain a prediction result; the pre-constructed prediction model is a classifier which can judge the disease risk and/or the prognosis curative effect of the sample according to the detection result of the biomarker;
and updating parameters of the prediction model based on the marking result and the prediction result.
In some embodiments, when the predictive model is used to predict the risk of colorectal cancer and/or the prognostic efficacy, the biomarkers include at least 4 of SGB6629, SGB6012, MGYG-HGUT-04562, MGYG-HGUT-01613, SGB6006, SGB6017, SGB5997, MGYG-HGUT-04629, MGYG-HGUT-01464, MGYG-HGUT-01459, MGYG-HGUT-01347, MGYG-HGUT-01326, SGB6013, and MNH 0000.
In some embodiments, the predictive model may be used to predict risk of disease and/or prognostic efficacy.
In some embodiments, the classifier comprises any one of LR, SVM, KNN, RF, GNB, DT, GBDT and AdaBoost, preferably RF.
In some embodiments, the detection of the biomarker comprises detection of the species of the marker and/or its abundance.
In some embodiments, the number of training samples can be selected based on actual conditions, and the sample size can be greater than or equal to 10-100. The training samples comprise two types of positive samples and negative samples, and can be an undiseased group and different ill risk groups when used for predicting ill risks, and can comprise an undiseased group and different curative effect groups when used for curative effect prediction.
In some embodiments, the tag may be specifically a character or a character string.
The embodiment of the invention also provides a prediction device for predicting the disease risk and/or the prognosis curative effect, which comprises:
an obtaining module, configured to obtain a detection result of a biomarker in a sample to be detected, where the biomarker is obtained by screening with the screening method according to any of the foregoing embodiments or by screening with the screening apparatus according to any of the foregoing embodiments;
and the prediction model is used for inputting the obtained detection result into the prediction model trained by the training method according to any embodiment to obtain a prediction result.
In some embodiments, when the predictive model is used to predict the risk of colorectal cancer and/or the prognostic efficacy, the biomarkers include at least 4 of SGB6629, SGB6012, MGYG-HGUT-04562, MGYG-HGUT-01613, SGB6006, SGB6017, SGB5997, MGYG-HGUT-04629, MGYG-HGUT-01464, MGYG-HGUT-01459, MGYG-HGUT-01347, MGYG-HGUT-01326, SGB 6013.
Alternatively, the modules may be stored in a memory in the form of software or Firmware (Firmware) or be fixed in an Operating System (OS) of the electronic device provided in the present application, and may be executed by a processor in the electronic device. Meanwhile, data, codes of programs, and the like required to execute the above modules may be stored in the memory.
An embodiment of the present invention further provides an electronic device, where the electronic device includes: a processor and a memory; the memory for storing a program that, when executed by the processor, causes the processor to: a biomarker screening method as described in any preceding example;
or, implementing the training method of the prediction model described in any of the foregoing embodiments: the biomarker is obtained by the screening method according to any of the preceding embodiments and by screening, and the obtained biomarker detection result is input into the prediction model trained by the training method according to any of the preceding embodiments to obtain the prediction result.
It will be appreciated that the method for predicting the risk of disease and/or the prognostic efficacy corresponds to the steps performed by the prediction means described in any of the preceding embodiments.
The Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Read Only Memory (EPROM), an electrically Erasable Read Only Memory (EEPROM), and the like.
The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc.; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In practical applications, the electronic device may be a server, a cloud platform, a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a handheld computer, a netbook, a Personal Digital Assistant (PDA), a wearable electronic device, a virtual reality device, and the like, and therefore, the embodiment of the present application does not limit the type of the electronic device.
In some embodiments, when the predictive model is used to predict the risk of colorectal cancer and/or the prognostic efficacy, the biomarkers include at least 4 of SGB6629, SGB6012, MGYG-HGUT-04562, MGYG-HGUT-01613, SGB6006, SGB6017, SGB5997, MGYG-HGUT-04629, MGYG-HGUT-01464, MGYG-HGUT-01459, MGYG-HGUT-01347, MGYG-HGUT-01326, SGB 6013.
An embodiment of the present invention provides a computer-readable medium, on which a computer program is stored, where the computer program, when executed by a processor, implements: a biomarker screening method as described in any preceding example; or, implementing a training method as described in any of the preceding embodiments: obtaining the detection result of the biomarker in the sample to be detected, wherein the biomarker is obtained by screening with the screening method according to any embodiment or by screening with the screening device of the biomarker according to any embodiment; and inputting the obtained detection result into the prediction model trained by the training method according to any embodiment to obtain a prediction result.
The computer readable medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.
The embodiment of the invention also provides application of a reagent for detecting the biomarkers, wherein the biomarkers are obtained by screening with the screening method according to any embodiment or the screening device according to any embodiment.
The embodiment of the invention also provides application of a reagent for detecting biomarkers in preparing a product for treating diseases or a product for predicting colorectal cancer risk and/or prognosis curative effect, wherein the biomarkers comprise: at least 4 of SGB6629, SGB6012, MGYG-HGUT-04562, MGYG-HGUT-01613, SGB6006, SGB6017, SGB5997, MGYG-HGUT-04629, MGYG-HGUT-01464, MGYG-HGUT-01459, MGYG-HGUT-01347, MGYG-HGUT-01326 and SGB 6013.
In some embodiments, the product for treating a disease may be a drug, a reagent or a kit for treating a disease, and the product for predicting the risk of a disease and/or the prognostic efficacy may be a related reagent, kit or prediction model.
In addition, the embodiment of the invention also provides a product for predicting the colorectal cancer risk and/or the prognosis curative effect, which comprises the reagent for detecting the biomarkers, which is described in any of the preceding embodiments.
The features and properties of the present invention are described in further detail below with reference to examples.
Example 1 metagenomic determination of flora & metabolite Gene Cluster combination for disease prediction
The obtained samples were divided into diseased groups vs. non-diseased groups, and screening of biomarkers for disease diagnosis and disease diagnosis were performed according to the following methods.
1. Establishing strain genome sequence library
And clustering sequences in the UHGG database and the Munn self-test strain database into different strain clusters, selecting a strain sequence with the longest sequence length in each strain cluster as a representative strain sequence of the strain cluster, and combining all the representative strain sequences into a strain genome sequence library finally used for analysis.
2. Establishment of a Gene Cluster library of metabolites
(1) Performing gene function annotation on a strain genome sequence by using prokka, and performing gutsmash/anti-mash analysis on the annotated gene to predict a metabolite gene cluster of each strain;
(2) extracting protein sequences of the gene clusters, and performing redundant filtration on the predicted gene clusters according to the mutual sequence similarity of the protein sequences to obtain a non-redundant gene cluster set;
(3) calculating the distance between every two gene clusters in each gene cluster set, and selecting the gene cluster with the minimum distance value as a representative gene cluster of the gene cluster set;
(4) performing new clustering on all representative gene clusters, regarding the gene clusters with similarity higher than 0.3 as the same gene cluster family, and combining all the gene cluster families to finally form a metabolite gene cluster sequence library for subsequent analysis;
(5) obtaining the metabolite gene cluster of each strain according to the prediction result of the steps, namely the corresponding relation between each strain and the metabolite gene cluster generated by the strain; and a metabolite gene cluster sequence library used for subsequent analysis of metabolites can also be obtained.
3. Calculating the abundance of the strain and the abundance of the metabolite
(1) Acquiring metagenome sequencing data, namely raw reads, of a related sample set; the set of related samples may be a plurality of queues collected for a common database;
(2) removing low-quality sequences from raw reads, and removing host sequences to obtain clean reads;
(3) comparing clean reads with a strain genome sequence library to obtain candidate strains (strains) and calculating the relative abundance of each candidate strain (strain);
(4) comparing clean reads with a metabolite sequence library to obtain candidate metabolites in the clean reads and calculating the relative abundance of each candidate metabolite;
4. calculating the obviously different strains and obviously different metabolites;
(1) calculating the difference significance of each candidate strain (strain) between the diseased group and the non-diseased group by using a statistical method;
(2) calculating the heterogeneity of each candidate strain in different queue data by using the statistical method;
(3) calculating the difference significance of each candidate metabolite between the diseased group and the non-diseased group by using a statistical method;
(4) screening out candidate strains (strains) with obvious differences in strains and metabolites of the strains by utilizing the corresponding relation between the candidate strains (strains) and the metabolites;
(5) according to the heterogeneity of strains in different queues, removing strains with high heterogeneity in different queues, and only keeping strains with low heterogeneity;
5. modeling using machine learning algorithms
Based on the strains (strains) screened in the step 4, combining all samples, randomly selecting 70% of the samples as a training set, using the rest samples as a test set, establishing a model for the training set data, screening out the strains (strains) with higher importance degree to the model by using a cross validation method as final markers, establishing the model to predict the test set data based on the strains (strains), and calculating AUC (AUC) for evaluating whether the screened markers can be used as markers for disease prediction; when a new sample exists, the disease prediction of the new sample can be carried out by using the model.
The flow diagram can be referred to in detail in fig. 2.
Example 2 metagenomic determination of flora & metabolite Gene Cluster combination for predicting therapeutic efficacy
The obtained samples were divided into an effective group vs. an ineffective group, and screening of biomarkers for efficacy prediction and efficacy prediction were performed according to the following methods.
1. Establishing strain genome sequence library
And clustering sequences in the UHGG database and the Munn self-test strain database into different strain clusters, selecting a strain sequence with the longest sequence length in each strain cluster as a representative strain sequence of the strain cluster, and combining all the representative strain sequences into a strain genome sequence library finally used for analysis.
2. Establishment of a Gene Cluster library of metabolites
(1) Performing gene function annotation on a strain genome sequence by using prokka, and performing gutsmash/anti-mash analysis on the annotated gene to predict a metabolite gene cluster of each strain;
(2) extracting protein sequences of the gene clusters, and performing redundant filtration on the predicted gene clusters according to the mutual sequence similarity of the protein sequences to obtain a non-redundant gene cluster set;
(3) calculating the distance between every two gene clusters in each gene cluster set, and selecting the gene cluster with the minimum distance value as a representative gene cluster of the gene cluster set;
(4) performing new clustering on all representative gene clusters, regarding the gene clusters with similarity higher than 0.3 as the same gene cluster family, and combining all the gene cluster families to finally form a metabolite gene cluster sequence library for subsequent analysis;
(5) obtaining the metabolite gene cluster of each strain according to the prediction result of the steps, namely the corresponding relation between each strain (bacterial strain) and the metabolite gene cluster generated by the strain (bacterial strain); and obtaining a metabolite gene cluster sequence library used for subsequent analysis of metabolites.
3. Calculating the abundance of strains and metabolites
(1) Acquiring metagenome sequencing data, namely raw reads, of a related sample set; the set of related samples may be a plurality of queues collected for a common database;
(2) removing low-quality sequences from raw reads, and removing host sequences to obtain clean reads;
(3) comparing clean reads with a strain genome sequence library to obtain candidate strains (strains) and calculating the relative abundance of each candidate strain (strain);
(4) comparing clean reads with a metabolite sequence library to obtain candidate metabolites in the clean reads and calculating the relative abundance of each candidate metabolite;
4. calculating obviously different bacterial strain and obviously different metabolic product
(1) Calculating the difference significance of each candidate strain (strain) between the effective group and the ineffective group by using a statistical method;
(2) calculating the heterogeneity of each candidate strain in different queue data by using the statistical method;
(3) calculating the difference significance of each candidate metabolite between the effective group and the ineffective group by using a statistical method;
(4) screening out candidate strains (strains) with obvious difference in strains and obvious difference in metabolites of the strains by utilizing the corresponding relation between the candidate strains (strains) and the metabolites;
(5) according to the heterogeneity of strains in different queues, removing strains with high heterogeneity in different queues, and only keeping strains with low heterogeneity;
5. modeling using machine learning algorithms
Based on the strains (strains) screened in the step 4, merging all samples, randomly selecting 70% of the samples as a training set, using the rest samples as a test set, establishing a model for the training set data, screening out the strains (strains) with higher importance degree to the model by using a cross validation method as final markers, establishing the model to predict the test set data based on the strains, and calculating AUC (AUC) for evaluating whether the screened markers can be used as markers for predicting curative effect; when a new sample exists, the model can be used for predicting the curative effect of the new sample.
The flow diagram can be referred to in detail in fig. 3.
Example 3 metagenomic determination of flora & metabolite Gene Cluster combination for drug screening
The obtained samples were divided into treatment-effective groups vs. treatment-ineffective groups, and biomarkers for screening drugs were screened according to the following methods.
1. Establishing strain genome sequence library
And clustering sequences in the UHGG database and the Munn self-test strain database into different strain clusters, selecting a strain sequence with the longest sequence length in each strain cluster as a representative strain sequence of the strain cluster, and combining all the representative strain sequences into a strain genome sequence library finally used for analysis.
2. Establishment of a Gene Cluster library of metabolites
(1) Performing gene function annotation on the genome sequence of the strain by using prokka, and performing gutsmash/anti-ismash analysis on the annotated gene to predict a metabolite gene cluster of each strain;
(2) extracting protein sequences of the gene clusters, and performing redundant filtration on the predicted gene clusters according to the mutual sequence similarity of the protein sequences to obtain a non-redundant gene cluster set;
(3) calculating the distance between every two gene clusters in each gene cluster set, and selecting the gene cluster with the minimum distance value as a representative gene cluster of the gene cluster set;
(4) performing new clustering on all representative gene clusters, regarding the gene clusters with similarity higher than 0.3 as the same gene cluster family, and combining all the gene cluster families to finally form a metabolite gene cluster sequence library for subsequent analysis;
(5) obtaining the metabolite gene cluster of each strain according to the prediction result of the steps, namely the corresponding relation between each strain and the metabolite gene cluster generated by the strain; and obtaining a metabolite gene cluster sequence library used for subsequent analysis of metabolites.
3. Calculating the abundance of the strains and the abundance of metabolites
(1) Acquiring metagenome sequencing data, namely raw reads, of a related sample set; the related sample set is queue data which is searched and downloaded and is effective or not for treating diseases;
(2) removing low-quality sequences from raw reads, and removing host sequences to obtain clean reads;
(3) comparing clean reads with a strain genome sequence library to obtain candidate strains (strains) and calculating the relative abundance of each candidate strain (strain);
(4) comparing clean reads with a metabolite sequence library to obtain candidate metabolites in the clean reads and calculating the relative abundance of each candidate metabolite;
4. calculating obviously different bacterial strains and obviously different metabolites
(1) Calculating the difference significance of each candidate strain (strain) between the treatment effective group and the treatment ineffective group by using a statistical method;
(2) calculating the heterogeneity of each candidate strain in different queue data by using the statistical method;
(3) calculating the difference significance of each candidate metabolite between the treatment effective group and the treatment ineffective group by using a statistical method;
(4) screening out candidate strains (strains) with obvious differences in strains and metabolites of the strains by utilizing the corresponding relation between the candidate strains (strains) and the metabolites;
(5) according to the heterogeneity of strains in different queues, removing strains (strains) with high heterogeneity in different queues, and only keeping strains (strains) with low heterogeneity;
5. modeling using machine learning algorithms
Based on the strains (strains) screened in the step 4, merging all samples, randomly selecting 70% of the samples as a training set, using the rest samples as a test set, establishing a model for the training set data, screening out the strains (strains) with higher importance degree to the model by using a cross validation method as final markers, establishing the model to predict the test set data based on the strains (strains), and calculating AUC (AUC) for evaluating whether the screened markers can be used for drug screening; when a new sample exists, the model can be used for carrying out drug screening on the new sample.
The flow diagram can be referred to fig. 4.
Example 4
Steps 1 and 2 are the same as in examples 1 to 3.
3. Calculating the abundance of the strain and the abundance of the metabolite
3.1 sample Collection
The metagenome sequencing data of 3 queues are collected and downloaded from a public database, wherein the total number of the metagenome sequencing data is 341 samples, the queue numbers are PRJEB12449, PRJEB10878 and ERP008279, and the specific grouping information of the samples is shown in Table 1.
TABLE 1 sample grouping information
Figure BDA0003723856280000131
3.2 removal of Low quality sequences and host sequences
And filtering the sequencing data of each sample, and removing the adapter pollution sequence, the low-quality sequence and the host genome sequence to obtain the high-quality sequence.
3.3 calculation of Strain abundance
Comparing the high-quality sequence obtained in the above steps with a strain representative genome sequence library, calculating the number of reads of each strain based on the comparison result, and further calculating the relative abundance of the strain in combination with the strain representative genome sequence length, wherein table 2 shows partial results.
Table 2 partial results
Figure BDA0003723856280000132
Figure BDA0003723856280000141
Remark |: in the table, the horizontal row represents the sample number, and the vertical row represents the gene sequence of the strain.
3.4 calculation of metabolite abundance
Comparing the high-quality sequence obtained in the step with a metabolite sequence library, calculating the number of reads compared to each metabolite based on the comparison result, and further calculating the relative abundance of the metabolite by combining the length of the metabolite sequence. Table 3 shows partial results.
Table 3 partial results
Figure BDA0003723856280000142
4. Screening of significantly different strains and metabolites
4.1 calculate the significance of the differences between the strains
For each cohort data, the significance of differences between different cohorts was calculated using Wilcoxon Rank-sum test based on strain abundance and cohort information (p-value <0.05 considered significant; p-value > 0.05 considered insignificant). Table 4 shows partial results.
Table 4 partial results
Species ID qvalue pvalue CRC_mean control_mean enriched
GCA_003466705 1.15E-02 5.93E-04 1.44E-03 3.60E-03 control
GCA_003466705 1.15E-02 5.93E-04 1.44E-03 3.60E-03 control
GCA_003466705 1.15E-02 5.93E-04 1.44E-03 3.60E-03 control
GCA_003466705 1.15E-02 5.93E-04 1.44E-03 3.60E-03 control
GCA_003482185 3.42E-02 3.88E-03 9.42E-04 1.58E-03 control
GCA_003482185 3.42E-02 3.88E-03 9.42E-04 1.58E-03 control
GCA_003482185 3.42E-02 3.88E-03 9.42E-04 1.58E-03 control
GCA_003482185 3.42E-02 3.88E-03 9.42E-04 1.58E-03 control
GCA_003482185 3.42E-02 3.88E-03 9.42E-04 1.58E-03 control
GCA_003482185 3.42E-02 3.88E-03 9.42E-04 1.58E-03 control
GCA_003482185 3.42E-02 3.88E-03 9.42E-04 1.58E-03 control
GCA_003482185 3.42E-02 3.88E-03 9.42E-04 1.58E-03 control
GCA_003482185 3.42E-02 3.88E-03 9.42E-04 1.58E-03 control
GCA_003482185 3.42E-02 3.88E-03 9.42E-04 1.58E-03 control
GCA_014287335 4.84E-02 6.85E-03 1.24E-04 1.61E-05 CRC
GCA_014287475 2.47E-02 2.13E-03 2.29E-05 4.80E-06 CRC
GCA_014287475 2.47E-02 2.13E-03 2.29E-05 4.80E-06 CRC
GCA_014287475 2.47E-02 2.13E-03 2.29E-05 4.80E-06 CRC
GCA_014287475 2.47E-02 2.13E-03 2.29E-05 4.80E-06 CRC
GCA_014287475 2.47E-02 2.13E-03 2.29E-05 4.80E-06 CRC
4.2 calculation of metabolite differential significance
For each cohort data, the significance of differences between different cohorts for each metabolite was calculated using Wilcoxon Rank-sum test based on metabolite abundance and cohort information (p-value <0.05 considered significant; p-value > 0.05 considered insignificant differences). Table 5 shows the results.
Table 5 partial results
Figure BDA0003723856280000151
Figure BDA0003723856280000161
Figure BDA0003723856280000171
4.3 preliminary screening of differential strains
Aiming at a single queue, different strains (strains) are preliminarily screened by combining the strain difference significance result, the metabolite difference significance result and the corresponding relation between the strains (strains) and the metabolites (table 5). The screening rule is as follows: the strains and metabolites need to be significantly different at the same time. According to the screening rule, 792 different strains (strains) are screened out in the PRJEB10878 queue, 361 different strains (strains) are screened out in the PRJEB12449 queue, and 1239 different strains (strains) are screened out in the ERP008279 queue. Table 6 shows partial results.
TABLE 5 correspondences between strains and metabolites
Figure BDA0003723856280000181
Figure BDA0003723856280000191
TABLE 6 partial results of different species (strains)
Figure BDA0003723856280000192
Figure BDA0003723856280000201
The differential strains in the 3 queues are respectively screened out by utilizing the steps, and the results are summarized, wherein the summarizing rule is as follows: if only screened in one of the cohorts, it is considered to be a different species (strain). According to the screening rule, 2060 different strains (strains) are screened out in total.
4.4 further screening of differential strains by Strain heterogeneity
Calculating the heterogeneity of the strains in the three queues, and further screening the strains according to the heterogeneity of the strains, wherein the screening rule is as follows: the strain with large heterogeneity was removed, and only the strain with small heterogeneity was retained as the differential strain, and 1320 differential strains were finally screened. The classification rules for the magnitude of heterogeneity are: i2< 40% and P >0.1 are considered less heterogeneous; otherwise, the heterogeneity is considered to be large. Table 7 shows partial results.
TABLE 7 partial results of the heterogeneity of the strains
Figure BDA0003723856280000202
Figure BDA0003723856280000211
5. Selecting candidate strains for disease treatment by using machine learning algorithm to establish model
5.1 training set and test set samples
341 samples (172 Case group samples; 169 Control group samples) were randomly divided into a training set (172 Case group samples; 169 Control group samples) and a test set (172 Case group samples; 169 Control group samples) in a 7:3 ratio.
5.2 screening of Strain markers Using training set data
Performing 5 times of cross validation by using a random forest classifier based on the abundance of strains in the training set, calculating the average error of the 5 times of cross validation, taking the minimum error in the average error plus the standard deviation as a critical value, listing all the strain (strain) marker sets with the average error smaller than the critical value, and selecting the marker set with the minimum number of the strains as an optimal set, namely a strain (strain) marker set. Meanwhile, the importance degree of each strain (strain) is output in the model, and the higher the importance degree is, the higher the importance of the strain (strain) for predicting whether the strain is sick or not is represented.
Through the model, 14 strains (strains) which can well predict whether the strain is diseased are screened out, and the importance degree of each strain (strain) in the optimal model is output, and the results are as follows. The number of the strains (strains) in the optimal model for predicting whether the disease is caused is shown in FIG. 5, and the importance degree of each strain (strain) in the optimal model is shown in FIG. 6.
And 5.3, predicting the training set and the test set by using the model, establishing a model based on the abundance of the strain marker, predicting the CRC probability of the training set sample by using the model, and drawing an ROC curve. And predicting the CRC probability of the test set sample through the model, and drawing an ROC curve.
The evaluation results of the random forest model training set are shown in fig. 7. The evaluation results of the random forest model test set are shown in fig. 8.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for screening biomarkers, comprising the steps of:
s1, establishing a representative strain genome sequence library: clustering sequences in the obtained genome sequence library according to a set threshold value to obtain strain clusters of different strain levels and/or species levels; screening to obtain a representative strain sequence of each strain cluster to establish a representative strain genome sequence library;
s2, establishing a metabolite gene cluster sequence library: performing gene annotation on the obtained sequence of the genome sequence library, predicting a metabolite gene cluster of each strain and/or each strain, performing similarity clustering on the metabolite gene cluster to obtain a gene cluster family, and merging the gene cluster families to obtain a metabolite gene cluster sequence library; the order of operations of step S1 and step S2 may be interchanged or step S1 and step S2 may be performed simultaneously;
s3, obtaining metagenome sequencing data of the sample, and comparing the metagenome sequencing data with the representative strain genome sequence library in the step S1 and the metabolite gene cluster sequence library in the step S2 respectively to obtain the relative abundance of each strain and/or each strain and the relative abundance of the metabolite;
s4, screening the obviously different strains and/or strains and obviously different metabolites;
s5, taking the candidate strains or the candidate strains (strains) with obvious differences of the strains or the strains and the metabolites thereof as biomarkers;
preferably, the biomarker obtained by screening is used as an index for constructing a prediction model, and a strain or strain capable of being effectively predicted is screened out by using a cross validation method to serve as a final biomarker.
2. The method for screening biomarkers according to claim 1, wherein in step S1, the set threshold is 95% or more; preferably, the set threshold is greater than or equal to 99%;
preferably, in the step S1, when the strain clusters with different strain levels are obtained according to a set threshold value of not less than 99%, and the strain clusters with different strain levels are obtained according to a set threshold value of not less than 95%, representative strains of the strain clusters are obtained;
preferably, the step of screening to obtain a representative strain sequence for each strain cluster to create a representative strain genome sequence library comprises: selecting a gene sequence with the longest gene sequence length as a representative strain sequence of the same strain cluster for strains in each strain cluster;
preferably, in the S1 and/or S2 steps, the obtained genome sequence database comprises a genome database and/or a strain database;
preferably, the genome database comprises at least one of a UHGG database, a human gut microorganism genome sequence database.
3. The method for screening biomarkers according to claim 1, wherein after the gene annotation in step S2, the method further comprises analyzing and analyzing the annotated gene to predict the metabolite gene cluster of each strain and/or each bacterial species;
preferably, in step S2, the step of obtaining a gene cluster family by similarity clustering of metabolite gene clusters includes any one of (a) to (c):
(a) extracting protein sequences of the metabolite gene clusters, performing redundancy filtration on the predicted metabolite gene clusters according to sequence similarity between the protein sequences to obtain a non-redundant gene cluster set, and clustering the metabolite gene clusters in the non-redundant gene cluster set according to a set similarity threshold to obtain a gene cluster family;
(b) merging the gene clusters of the same metabolite into a gene cluster set, and selecting a representative gene cluster of the gene cluster set; clustering the representative gene clusters according to a set similarity threshold value to obtain the gene cluster families;
(c) extracting protein sequences of metabolite gene clusters, performing redundant filtration on the predicted metabolite gene clusters according to sequence similarity between the protein sequences to obtain a non-redundant gene cluster set, calculating the distance between every two gene clusters in each non-redundant gene cluster set, and selecting the gene cluster with the minimum distance value as a representative gene cluster of the gene cluster set for merging to obtain a metabolite gene cluster sequence library;
preferably, in item (b), the step of selecting a representative gene cluster of the set of gene clusters comprises: calculating the distance between every two gene clusters in each gene cluster set, selecting the gene cluster with the minimum distance value as a representative gene cluster of the gene cluster set, and merging to obtain the metabolite gene cluster sequence library;
preferably, the similarity threshold value is more than or equal to 0.3;
preferably, the means for performing the gene annotation comprises prokka;
preferably, the means for predicting the metabolite gene cluster of each strain or each species comprises: gutsmash and/or anti-mash;
preferably, in the step of S4, the step of screening for significantly different strains and significantly different metabolites comprises: regarding the data of each queue, taking the strains and/or species with significant difference or abundance thereof as significantly different strains, and taking the metabolites with significant difference or abundance thereof as significantly different metabolites;
preferably, in the step S5, when there are multiple cohorts of data, further comprising calculating heterogeneity of strains and/or species between different cohorts, and retaining candidate strains or species with less heterogeneity as biomarkers;
preferably, the criteria of less heterogeneity include: i2< 40% and P > 0.1.
4. A screening device for biomarkers, comprising:
and the representative strain genome sequence library construction module is used for establishing a representative strain genome sequence library: clustering sequences in the obtained genome sequence library according to a set threshold value to obtain strain clusters of different strain levels or species levels; screening to obtain a representative strain sequence of each strain cluster to establish a representative strain genome sequence library;
the metabolic product gene cluster sequence library construction module is used for establishing a metabolic product gene cluster sequence library: performing gene annotation on the obtained sequence of the genome sequence library, predicting a metabolite gene cluster of each strain or each strain, performing similarity clustering on the metabolite gene clusters to obtain a gene cluster family, and merging the gene cluster families to obtain a metabolite gene cluster sequence library;
the relative abundance calculating module is used for acquiring metagenome sequencing data of the sample, comparing the metagenome sequencing data with the representative strain genome sequence library in the step S1 and the metabolite gene cluster sequence library in the step S2 respectively, and obtaining the relative abundance of each strain or each strain and the relative abundance of the metabolite;
the screening module is used for screening the obviously different strains and the obviously different metabolites;
and the marker generation module is used for taking the candidate strain or the candidate strain (strain) with obvious difference in the strain or the strain and the metabolite thereof as the biomarker.
5. A method for training a predictive model, comprising:
obtaining a detection result of a biomarker in a training sample and a corresponding labeling result; wherein the biomarker is obtained by screening according to the screening method of any one of claims 1 to 3 or the screening device of the biomarker of claim 4, and the labeled result is a label representing the disease risk and/or the prognosis effect of the sample;
inputting the detection result of the biomarker into a pre-constructed prediction model to obtain a prediction result; the pre-constructed prediction model is a classifier which can judge the disease risk and/or the prognosis curative effect of the sample according to the detection result of the biomarker;
updating parameters of a prediction model based on the mark result and the prediction result;
preferably, the classifier comprises any one of LR, SVM, KNN, RF, GNB, DT, GBDT and AdaBoost, preferably RF;
preferably, the detection of the biomarker comprises detection of the species of the marker and/or its abundance.
6. A prediction device for predicting risk of disease and/or prognostic efficacy, characterized in that it comprises:
an obtaining module, configured to obtain a detection result of a biomarker in a sample to be detected, where the biomarker is obtained by screening with the screening method according to any one of claims 1 to 3 or by screening with the screening device according to claim 4;
the prediction model is used for inputting the obtained detection result into the prediction model trained by the training method according to claim 5 to obtain a prediction result;
preferably, when the predictive model is used to predict the risk of colorectal cancer and/or the prognostic efficacy, the biomarkers include at least 4 of SGB6629, SGB6012, MGYG-HGUT-04562, MGYG-HGUT-01613, SGB6006, SGB6017, SGB5997, MGYG-HGUT-04629, MGYG-HGUT-01464, MGYG-HGUT-01459, MGYG-HGUT-01347, MGYG-HGUT-01326, and SGB 6013.
7. An electronic device, characterized in that the electronic device comprises: a processor and a memory; the memory for storing a program that, when executed by the processor, causes the processor to: a biomarker screening method according to any one of claims 1 to 3;
or, implementing the training method of the predictive model of claim 5: obtaining the biomarkers by the screening method according to any one of claims 1 to 3 and screening, inputting the detection results of the biomarkers into the prediction model trained by the training method according to claim 5, and obtaining the prediction results;
preferably, when the predictive model is used to predict the risk of colorectal cancer and/or the prognostic efficacy, the biomarkers include at least 4 of SGB6629, SGB6012, MGYG-HGUT-04562, MGYG-HGUT-01613, SGB6006, SGB6017, SGB5997, MGYG-HGUT-04629, MGYG-HGUT-01464, MGYG-HGUT-01459, MGYG-HGUT-01347, MGYG-HGUT-01326, and SGB 6013.
8. A computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements: a biomarker screening method according to any one of claims 1 to 3; or, implementing the training method of claim 5: obtaining the detection result of the biomarker in the sample to be detected, wherein the biomarker is obtained by screening with the screening method of any one of claims 1-3 or the screening device of the biomarker of claim 4; inputting the obtained detection result into a prediction model trained by the training method according to claim 5 to obtain a prediction result;
preferably, when the predictive model is used to predict the risk of colorectal cancer and/or the prognostic efficacy, the biomarkers include at least 4 of SGB6629, SGB6012, MGYG-HGUT-04562, MGYG-HGUT-01613, SGB6006, SGB6017, SGB5997, MGYG-HGUT-04629, MGYG-HGUT-01464, MGYG-HGUT-01459, MGYG-HGUT-01347, MGYG-HGUT-01326, and SGB 6013.
9. Use of a reagent for detecting a biomarker for the manufacture of a product for treating a disease or a product for predicting the risk of developing a disease and/or the prognostic efficacy, wherein the biomarker is selected by the screening method according to any one of claims 1 to 3, or by the screening device for biomarkers according to claim 4;
preferably, when the disease is colorectal cancer, the biomarkers comprise: at least 4 of SGB6629, SGB6012, MGYG-HGUT-04562, MGYG-HGUT-01613, SGB6006, SGB6017, SGB5997, MGYG-HGUT-04629, MGYG-HGUT-01464, MGYG-HGUT-01459, MGYG-HGUT-01347, MGYG-HGUT-01326 and SGB 6013.
10. A product for predicting the risk of colorectal cancer and/or the prognostic efficacy, characterized in that it comprises a reagent for detecting biomarkers according to claim 9.
CN202210770641.XA 2022-06-30 2022-06-30 Screening method of biomarker and related application thereof Pending CN114974432A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210770641.XA CN114974432A (en) 2022-06-30 2022-06-30 Screening method of biomarker and related application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210770641.XA CN114974432A (en) 2022-06-30 2022-06-30 Screening method of biomarker and related application thereof

Publications (1)

Publication Number Publication Date
CN114974432A true CN114974432A (en) 2022-08-30

Family

ID=82968295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210770641.XA Pending CN114974432A (en) 2022-06-30 2022-06-30 Screening method of biomarker and related application thereof

Country Status (1)

Country Link
CN (1) CN114974432A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115828093A (en) * 2022-11-02 2023-03-21 四川帕诺米克生物科技有限公司 Omics sample analysis method and device, electronic device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115828093A (en) * 2022-11-02 2023-03-21 四川帕诺米克生物科技有限公司 Omics sample analysis method and device, electronic device and storage medium
CN115828093B (en) * 2022-11-02 2024-04-05 四川帕诺米克生物科技有限公司 Method and device for analyzing histology sample, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
JP2022521492A (en) An integrated machine learning framework for estimating homologous recombination defects
Li et al. Machine learning for lung cancer diagnosis, treatment, and prognosis
US20230114581A1 (en) Systems and methods for predicting homologous recombination deficiency status of a specimen
CN110832510A (en) Variant classifier based on deep learning
CN111128299B (en) Construction method of ceRNA regulation and control network with significant correlation to colorectal cancer prognosis
CN113168886A (en) Systems and methods for germline and somatic variant calling using neural networks
CN109599157B (en) Accurate intelligent diagnosis and treatment big data system
Panagopoulou et al. Deciphering the methylation landscape in breast cancer: diagnostic and prognostic biosignatures through automated machine learning
US20220101135A1 (en) Systems and methods for using a convolutional neural network to detect contamination
CN114974432A (en) Screening method of biomarker and related application thereof
Chai et al. Integrating multi-omics data with deep learning for predicting cancer prognosis
Carrieri et al. A fast machine learning workflow for rapid phenotype prediction from whole shotgun metagenomes
Saei et al. A glance at DNA microarray technology and applications
KR20210110241A (en) Prediction system and method of cancer immunotherapy drug Sensitivity using multiclass classification A.I based on HLA Haplotype
WO2022192904A1 (en) Systems and methods for identifying microbial biosynthetic genetic clusters
Blazadonakis et al. Complementary gene signature integration in multiplatform microarray experiments
Lyudovyk et al. Pathway analysis of genomic pathology tests for prognostic cancer subtyping
CN116434830B (en) Tumor focus position identification method based on ctDNA multi-site methylation
Riccadonna et al. Supervised classification of combined copy number and gene expression data
US20240076744A1 (en) METHODS AND SYSTEMS FOR mRNA BOUNDARY ANALYSIS IN NEXT GENERATION SEQUENCING
Marić et al. Approaches to metagenomic classification and assembly
Gonzalez et al. Essentials in Metagenomics (Part II)
Jahangiri Predicting Neuroblastoma Patient Risk Groups, Outcomes, and Treatment Response Using Machine Learning Methods: A Review
Kebschull et al. Differential Expression, Functional and Machine Learning Analysis of High-Throughput–Omics Data Using Open-Source Tools
Patruno Computational strategies for single-cell multi-omics data analysis and integration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination