CN117551760A - Biomarkers for predicting advanced tuberculosis and non-advanced tuberculosis and uses thereof - Google Patents

Biomarkers for predicting advanced tuberculosis and non-advanced tuberculosis and uses thereof Download PDF

Info

Publication number
CN117551760A
CN117551760A CN202410038422.1A CN202410038422A CN117551760A CN 117551760 A CN117551760 A CN 117551760A CN 202410038422 A CN202410038422 A CN 202410038422A CN 117551760 A CN117551760 A CN 117551760A
Authority
CN
China
Prior art keywords
tuberculosis
progressive
predicting
population
biomarker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202410038422.1A
Other languages
Chinese (zh)
Inventor
陈心春
李兆东
蔡毅
冯思婉
史琛彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202410038422.1A priority Critical patent/CN117551760A/en
Publication of CN117551760A publication Critical patent/CN117551760A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6893Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids related to diseases not provided for elsewhere
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/26Infectious diseases, e.g. generalised sepsis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Immunology (AREA)
  • Organic Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Urology & Nephrology (AREA)
  • Mathematical Physics (AREA)
  • Biochemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computing Systems (AREA)
  • Microbiology (AREA)
  • Hematology (AREA)
  • Biomedical Technology (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)

Abstract

The application belongs to the technical field of biomedicine, and particularly relates to a biomarker for predicting advanced tuberculosis and non-advanced tuberculosis and application thereof. The biomarker provided herein includes: KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6, and bat 2. The biomarker can be used for specifically predicting progressive tuberculosis and non-progressive tuberculosis in a latent tuberculosis infection queue, so that progressive tuberculosis patients can be found more quickly and accurately in clinical diagnosis, and the biomarker is expected to be used for screening diagnosis of progressive tuberculosis, physical examination of healthy people, prediction and evaluation of tuberculosis treatment effect and the like, and provides powerful technical support for epidemic situation control of tuberculosis.

Description

Biomarkers for predicting advanced tuberculosis and non-advanced tuberculosis and uses thereof
Technical Field
The application belongs to the technical field of biomedicine, and particularly relates to a biomarker for predicting advanced tuberculosis and non-advanced tuberculosis and application thereof.
Background
Tuberculosis (Tuberculosis) is a chronic infectious disease caused by tubercle bacillus, which can invade various organs of human body and is commonly seen as pulmonary Tuberculosis. Latent tuberculosis infection (Latent Tuberculosis Infection, LTBI) refers to the presence of tubercle bacillus in an individual, but without the appearance of obvious symptoms, and without the development of active tuberculosis.
Advanced tuberculosis generally refers to a state in which pulmonary tuberculosis lesions continue to develop and spread in the body, and this form of tuberculosis is usually manifested by gradual increase of pulmonary lesions, and patients may develop symptoms such as cough, expectoration, fever, etc. Treatment of advanced tuberculosis often requires the use of antitubercular drugs to prevent further exacerbation of the condition. In contrast, non-progressive tuberculosis generally refers to a relatively stable state, without significant spread or progression, and the patient may not exhibit significant symptoms, or symptoms may be lighter. Antitubercular agents are still needed for the treatment of non-progressive tuberculosis, but the course of treatment may be generally shorter. Tuberculosis is so challenging to control and manage, in part because of its latency and potential activation mechanisms. Some patients may be in a latent state of tubercle bacillus for a long period of time, and at some point, tubercle bacillus is activated, resulting in an active tuberculosis outbreak. Thus, early identification of those people who may develop active tuberculosis becomes critical. In this context, it is of great strategic importance to establish a marker that can identify advanced tuberculosis.
Genes have attracted considerable attention in recent years as markers for tuberculosis diagnostic models. This emerging approach provides potential opportunities for early diagnosis of tuberculosis and epidemic monitoring. In the traditional tuberculosis diagnosis, methods such as bacterial culture, acid fast staining and the like are often used, and the methods require a long time and are complicated to operate, so that the early diagnosis capability is limited. And the gene-based diagnostic method has the characteristics of rapidness, sensitivity and specificity. By detecting the gene expression level in host cells related to tubercle bacillus infection, the immune response of a patient can be found in advance, and early diagnosis is expected to be realized, but no progressive tuberculosis related marker in a latent tuberculosis infection queue can be predicted at present.
Disclosure of Invention
The purpose of the application is to provide a biomarker for predicting progressive tuberculosis and non-progressive tuberculosis and application thereof, and aims to solve the technical problem of how to rapidly and accurately predict progressive tuberculosis and non-progressive tuberculosis in a latent tuberculosis infection queue.
In order to achieve the purposes of the application, the technical scheme adopted by the application is as follows:
the first aspect of the present application provides a biomarker for predicting advanced tuberculosis, the biomarker comprising the following genes: KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6, and bat 2.
In a second aspect, the application provides an application of a reagent for detecting the biomarker in preparation of products for predicting progressive tuberculosis groups and non-progressive tuberculosis groups in a latent tuberculosis infection queue.
The embodiment of the application obtains a group of biomarkers comprising 15 genes based on prospective queue research, machine learning elastic network algorithm and random forest tree modeling, and the marker combination can specifically predict progressive tuberculosis and non-progressive tuberculosis in a latent tuberculosis infection queue. Therefore, the biomarker provided by the embodiment of the application is beneficial to finding progressive tuberculosis patients more quickly and accurately in clinical diagnosis, is hopeful to be used for screening diagnosis of progressive tuberculosis, physical examination of healthy people, prediction and evaluation of tuberculosis treatment efficacy and the like, and provides powerful technical support for epidemic situation control of tuberculosis.
The embodiment of the application can specifically predict the progressive tuberculosis and the non-progressive tuberculosis in the latent tuberculosis infection queue based on the biomarker comprising 15 genes, so that the product for predicting the progressive tuberculosis group and the non-progressive tuberculosis group in the latent tuberculosis infection queue can be prepared by using the reagent for detecting the biomarker.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a graph of predictive results of analysis of progressive and non-progressive tuberculosis populations in a latent tuberculosis infection cohort using a GSE79362 prospective study cohort data set;
FIG. 2 is a diagram of screening characteristic genes by a machine learning elastic network algorithm, wherein A is the difference analysis of sequencing results of a progressive tuberculosis crowd and a non-progressive tuberculosis crowd to obtain 108 difference genes, and B is the screening of 15 characteristic genes by the machine learning elastic network algorithm based on the 108 difference genes;
FIG. 3 is a random forest model of 15 gene combinations for predicting progressive tuberculosis and non-progressive tuberculosis, where A and B are predictive power of 15 gene combination markers in the training dataset and C and D are predictive power of 15 gene combination markers in the test dataset;
FIG. 4 is a validation graph of 15 gene combination markers predictive of advanced tuberculosis in GSE79362 dataset, selecting all prospective queues prior to tuberculosis diagnosis, and dividing the prospective queues into a set of data at 180 days each interval;
FIG. 5 is a validation graph of 15 gene combination markers predictive of advanced tuberculosis in GSE79362 dataset, selecting all prospective queues prior to tuberculosis diagnosis, and dividing the prospective queues into a set of data at 360 days each interval;
FIG. 6 is a validation graph of 15 gene combination markers predicting progressive tuberculosis in GSE112104 and GSE94438 independent dataset; wherein, A is the accuracy of 15 gene combination markers in GSE112104 data set for predicting progressive tuberculosis, and B-D is the accuracy of 15 gene combination markers in GSE94438 data set for predicting progressive tuberculosis.
Detailed Description
In order to make the technical problems, technical schemes and beneficial effects to be solved by the present application more clear, the present application is further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s).
It should be understood that, in various embodiments of the present application, the sequence number of each process does not mean that the sequence of execution is sequential, and some or all of the steps may be executed in parallel or sequentially, where the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The markers capable of efficiently identifying progressive tuberculosis have important strategic significance, and can help clinicians to identify tuberculosis patients more accurately so as to intervene and treat early, thereby reducing the severity of diseases and improving the success rate of treatment. The method is beneficial to the health of patients, helps to reduce the transmission of tuberculosis in communities, and finally helps to realize the prevention and control of tuberculosis worldwide. Therefore, establishing markers capable of identifying advanced tuberculosis is one of the important problems to be solved in the current public health field.
Gene-based diagnostic methods are characterized by being faster, more sensitive and more specific. By detecting the gene expression level associated with tuberculosis infection in the host cell, the immune response of the patient can be found in advance, and an earlier diagnosis is expected to be realized. By analyzing large-scale gene data analysis, the transmission condition of tubercle bacillus infection in different areas and people can be tracked, and timely preventive control measures can be taken. However, while this approach is fully potential, it also faces challenges such as standardization, data analysis, and cost issues. Therefore, further research and development are required for gene as a tuberculosis diagnosis model marker to ensure its effectiveness and feasibility in clinical practice.
Machine learning algorithms play an increasingly important role in tuberculosis diagnosis. These algorithms can analyze large-scale medical data, including clinical information, imaging data, and molecular biomarkers, to help doctors diagnose tuberculosis more accurately. The machine learning model can identify specific tuberculosis modes and trends, assist medical decision making and improve early diagnosis accuracy. In addition, they can also be personalized based on data from different patients, selecting the optimal treatment regimen. Among them, random forests occupy an important place in machine learning algorithms. It is an integrated model consisting of a number of decision trees, each of which is a very complex nonlinear model. After constructing a plurality of decision trees, predictions of random forests are obtained by training and voting on these trees. The random forest model introduces randomness in the construction and prediction of each decision tree, which helps to increase the diversity of the model and reduce the risk of overfitting. For regression problems, the embodiment of the application adopts an average or weighted average mode to combine the predicted value of each tree into a final regression result. Thus, there is no single mathematical formula to represent the entire random forest model.
It is noted that the specific course of tuberculosis may vary from individual to individual, and that the definition of progressive tuberculosis and non-progressive tuberculosis may vary from medical literature to medical literature. In the examples of the present application, however, advanced tuberculosis (TB progress) refers to patients who are transformed into active tuberculosis by latent tuberculosis during the follow-up period, and non-advanced tuberculosis (TB non-progress) refers to patients who are not transformed into active tuberculosis by latent tuberculosis patients during the whole follow-up period, through prospective cohort studies. Based on the fact that no progressive tuberculosis related markers in the latent tuberculosis infection queue can be predicted at present, the application provides the following scheme.
In a first aspect, embodiments of the present application provide a biomarker for predicting advanced tuberculosis, the biomarker comprising the following genes: KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6, and bat 2. The 15 genes are all provided by the NCBI (National Center for Biotechnology Information (nih. Gov)) platform as follows:
KREMEN1: kringle containing transmembrane protein 1
DYSF: dysferlin
ALPK1: alpha kinase 1
ZNF438: zinc finger protein 438
ANKRD22: ankyrin repeat domain 22
C1QB: complement C1q B chain
WDFY3: WD repeat and FYVE domain containing 3
HIST1H3D: H3 clustered histone 4
BST1: bone marrow stromal cell antigen 1
SORT1: sortilin 1
GBP6: guanylate binding protein family member 6
OAS1: 2'-5'-oligoadenylate synthetase 1
TRIM25: tripartite motif containing 25
FBXO6: F-box protein 6
BATF2: basic leucine zipper ATF-like transcription factor 2。
the biomarkers provided in the embodiments of the present application are a set of gene combinations including the above-described KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6, and BATF2 genes. The 15 gene combinations are obtained based on prospective queue research, machine learning elastic network algorithm and random forest tree modeling. The 15 gene combinations can specifically predict progressive tuberculosis and non-progressive tuberculosis in the latent tuberculosis infection queue. In the aspect of clinical diagnosis, 15 gene combinations are favorable for finding progressive tuberculosis patients faster and more accurately, are expected to be used for screening diagnosis of progressive tuberculosis, physical examination of healthy people, prediction and evaluation of tuberculosis treatment effect and the like, and provide powerful technical support for epidemic situation control of tuberculosis.
In a second aspect, embodiments of the present application provide the use of an agent for detecting the biomarkers described above for the preparation of a product for predicting a population of progressive tuberculosis and a population of non-progressive tuberculosis in a latent tuberculosis infection queue.
Specifically, the reagent includes a substance that detects any one of the following detection objects (1) to (3):
(1) KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6, and bat 2 genes;
(2) KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6 and the mRNA encoded by the bat 2 gene;
(3) KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6 and BATF2 genes.
In some embodiments, the product for predicting the population of progressive tuberculosis and the population of non-progressive tuberculosis in the latent tuberculosis infection queue comprises a kit for predicting the population of progressive tuberculosis and the population of non-progressive tuberculosis in the latent tuberculosis infection queue. Accordingly, the kit comprises fluorescent quantitative PCR detection reagents.
In some embodiments, the product for predicting the population of progressive tuberculosis and the population of non-progressive tuberculosis in the latent tuberculosis infection queue comprises a system for predicting the population of progressive tuberculosis and the population of non-progressive tuberculosis in the latent tuberculosis infection queue. In particular, reagents and/or instrumentation required to detect the above-described biomarkers may be included, or any reagent and/or instrumentation capable of effecting quantitative detection of the above-described biomarkers may be used.
In some embodiments, a system includes:
a data acquisition unit: the method comprises the steps of carrying out gene detection on a sample of a latent tuberculosis infection queue to obtain expression level data of biomarkers in the sample;
a data analysis unit: obtaining disease risk scores of the samples by combining expression level data of the biomarkers by using a random forest integration model;
data prediction unit: and predicting whether the sample is a progressive tuberculosis group or a non-progressive tuberculosis group according to the disease risk score.
Specifically, in the data analysis unit, the random forest integration model is as follows:
input data set:D= {(x 1 , y 1 ), (x 2 , y 2 ), … , (x n , y n ) Wherein x is i Is a feature vector, y i Is the corresponding ending tag;
training data set for each decision tree:D sub whereinD sub Is composed ofDRandomly extracting to form;
prediction result of each decision tree:C(x) Representing a given input x-eigenvector atD sub Predictive results in training data;
disease risk score for random forests:wherein T is the number of decision trees;
the x feature vectors include KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6, and BATF2 genes; the outcome labels corresponding to y are progressive tuberculosis and non-progressive tuberculosis; the number of decision trees t=500;
in the data prediction unit: if the disease risk score is more than or equal to 0.5, the sample is a progressive tuberculosis crowd; if the disease risk score is less than 0.5, the sample is a non-progressive tuberculosis group.
Specifically, in the data acquisition unit, the gene detection includes detection using an Illumina HiSeq 2000 sequencing platform. Further, the expression level data is subjected to sample correction by variance stabilization transformation.
In the above application, the product comprises reagents and/or instruments required for detecting the detection object as described in any one of (1) to (3) below:
(1) KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6, and bat 2 genes;
(2) KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6 and the mRNA encoded by the bat 2 gene;
(3) KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6 and BATF2 genes.
In particular, a system for identifying populations of progressive and non-progressive tuberculosis in a latent tuberculosis infection queue includes the above-described reagents and/or instruments.
Among the above systems are reagents and/or instrumentation required for quantitative PCR detection of KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6 and BATF2 gene expression levels, or for detection of KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6 and BATF2 gene expression levels using a gene chip.
The gene expression level is obtained by measuring a test sample by an Illumina HiSeq 2000 sequencing platform and correcting the sample by variance stabilizing transformation.
Further, the above products further include products for detecting or diagnosing tuberculosis, products for detecting occurrence and/or development of tuberculosis.
From data analysis, the expression differences of KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6 and BATF2 genes are obvious in the advanced tuberculosis group (TB progress) and the non-advanced tuberculosis group (TB non-progress). The analysis is performed by using a GSE79362 prospective research queue, namely, the GSE79362 dataset is divided into a training dataset and a test dataset to construct a verification prediction model, and then the verified prediction model is reused for the GSE79362 dataset to analyze, and the result is that 15 gene combination markers are used for predicting the accuracy of progressive tuberculosis and non-progressive tuberculosis in a latent tuberculosis infection queue (LTBI (s)) as shown in figure 1, the Sensitivity (Sensitivity) is 89.1%, the Specificity (Specificity) is 92.2%, the positive prediction rate (positive predictive value, PPV) is 91.7%, and the negative prediction rate (negative predictive value, NPV) is 91.5%. Therefore, the 15 gene combinations can be used as markers for predicting the population with progressive tuberculosis and non-progressive tuberculosis in the latent tuberculosis infection queue, so as to monitor the occurrence and progress of tuberculosis.
The following description is made with reference to specific embodiments. Unless otherwise specified, the data analysis methods used in the following examples refer to conventional data analysis means.
Example 1
1. Screening of characteristic genes
1.1 data download
The embodiment of the application utilizes the GSE79362 data set for differential gene analysis, characteristic gene screening, model establishment and model verification. The present embodiments also utilize GSE94438 and GSE112104 independent datasets for model validation. The 3 data sets were obtained from the GEO (Home-GEO DataSets-NCBI (nih. Gov)) database.
1.2 differential Gene analysis
The embodiment of the application utilizes R software and limma function package to carry out statistical difference analysis on the progressive tuberculosis sample and the non-progressive tuberculosis sample in the GSE79362 data set. Wherein the threshold value log2 FC| is met>0.5,adjustedp-value<The gene of 0.05 was selected as the differential gene. For both progressive and non-progressive tuberculosis samples, the examples of the present application obtained a total of 108 differential genes.
1.3 screening of characteristic genes
For the 108 differential genes, 15 characteristic genes including KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6 and BATF2 characteristic genes are obtained by screening with an elastic network machine learning algorithm. These 15 characteristic genes have significant expression differences in the population with advanced tuberculosis and the population without advanced tuberculosis. As shown in fig. 2, a is a difference analysis of sequencing results of a population with advanced tuberculosis (TB progress) and a population with non-advanced tuberculosis (TB non-progress) in the examples of the present application, 108 differential genes were obtained, 68 up-regulated genes and 40 down-regulated genes; and B is that 15 characteristic genes are obtained by screening the differential genes through a machine learning elastic network algorithm, and the characteristic genes have obvious expression differences in a progressed tuberculosis group (TB progress) and a non-progressed tuberculosis group (TB non-progress).
2. Model building
2.1 modeling
The embodiment of the application utilizes a random forest model to construct a combined marker containing the 15 characteristic genes, and calculates disease risk scores of the 15 genes of the testee according to the following formula:
input data set:D= {(x 1 , y 1 ), (x 2 , y 2 ), … , (x n , y n ) Wherein x is i Is a feature vector, y i Is the corresponding ending tag;
training data set for each decision tree:D sub whereinD sub Is composed ofDRandomly extracting to form;
prediction result of each decision tree:C(x) Representing a given input x-eigenvector atD sub Predictive results in training data;
disease risk score for random forests:where T is the number of decision trees.
In the data analysis process of the embodiment of the application, the x feature vector includes KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6, and BATF2 genes; y corresponds to the outcome labels as progressive tuberculosis and non-progressive tuberculosis; the number of decision trees t=500.
2.2 principal component analysis
The principal component analysis preliminarily proves that the 15 gene combination markers of the embodiment of the application can obviously identify the progressive tuberculosis group and the non-progressive tuberculosis group in the latent tuberculosis infection queue.
2.3 subject working characteristics (receiver operating characteristic curve, ROC curves)
ROC curves further confirm the reliability of the markers: as shown in fig. 3, in the Training dataset (Training organs), the accuracy of marker-predicted advanced tuberculosis was 95.2%, sensitivity (Sensitivity) was 88.5%, specificity was 93.8%, positive predictive rate (PPV) was 91.2%, and negative predictive rate (NPV) was 91.4%; in the Test dataset (Test probes), the accuracy was 97.2%, the Sensitivity (Sensitivity) was 87.5%, the Specificity (Specificity) was 97.1%, the positive predictive rate (PPV) was 92.9%, and the negative predictive rate (NPV) was 91.7%. The training dataset accounted for 70% of the integrated dataset and the test dataset accounted for 30%.
3. Model verification
3.1 Verification of model accuracy using GSE79362 dataset 180 days apart before tuberculosis diagnosis
The present examples utilized a latent tuberculosis infection cohort 180 days apart prior to tuberculosis diagnosis to verify 15 gene combination markers. As shown in fig. 4, the disease risk score calculated by combining 15 gene expression levels with a random forest model is used for predicting the advanced tuberculosis group and the non-advanced tuberculosis group, and the obtained ROC curve confirms the reliability of the marker in predicting the advanced tuberculosis group in the latent tuberculosis infection queue, and the specific values are as follows: 1-180 days before tuberculosis diagnosis, the model accuracy is 93.8%, the sensitivity is 84.0%, the specificity is 94.3%, the positive prediction rate is 94.1%, and the negative prediction rate is 79.1%; 181-360 days before diagnosis, the model accuracy is 99.2%, the sensitivity is 100.0%, the specificity is 94.1%, the positive prediction rate is 83.3%, and the negative prediction rate is 97.0%; 361-540 days before diagnosis, the model accuracy is 98.1%, the sensitivity is 94.4%, the specificity is 94.8%, the positive prediction rate is 96.9%, and the negative prediction rate is 93.8%; 541-720 days before diagnosis, the model accuracy is 90.9%, the sensitivity is 83.3%, the specificity is 92.4%, the positive prediction rate is 76.5%, and the negative prediction rate is 92.5%;
3.2 verification of model accuracy using GSE79362 dataset at 360 day intervals prior to tuberculosis diagnosis
The present examples utilized a latent tuberculosis infection queue of 360 days apart prior to tuberculosis diagnosis to verify 15 gene combination markers. As shown in fig. 5, the disease risk score calculated by combining 15 gene expression levels with a random forest model predicts the advanced tuberculosis population and the non-advanced tuberculosis population, and the obtained ROC curve confirms the reliability of the marker in predicting the advanced tuberculosis population in the latent tuberculosis infection queue, and the specific values are as follows: 1-360 days before diagnosis, the model accuracy is 94.7%, the sensitivity is 80.6%, the specificity is 92.8%, the positive prediction rate is 89.7%, and the negative prediction rate is 86.8%; the model accuracy is 95.2%, sensitivity is 85.2%, specificity is 95.8%, positive predictive rate is 89.8% and negative predictive rate is 93.2% after 361-720 days before diagnosis.
3.3 verification of accuracy of model Using GSE94438 independent dataset
The examples herein utilize GSE94438 independent data sets to validate 15 gene combination markers. As shown in a in fig. 6, the disease risk score calculated by combining 15 gene expression levels with a random forest model predicts the progressive tuberculosis population and the non-progressive tuberculosis population, and the obtained ROC curve confirms the reliability of the marker in predicting the progressive tuberculosis population in the latent tuberculosis infection queue, and the specific values are as follows: the model accuracy is 93.5%, the sensitivity is 93.8%, the specificity is 81.0%, the positive prediction rate is 78.9%, and the negative prediction rate is 94.4%;
3.4 verification of model accuracy Using GSE112104 independent datasets
The examples herein utilize GSE112104 independent data sets to validate 15 gene combination markers. As shown in B-D in fig. 6, the disease risk score calculated by combining 15 gene expression levels with a random forest model predicts the progressive tuberculosis population and the non-progressive tuberculosis population, and the obtained ROC curve confirms the reliability of the marker in predicting the progressive tuberculosis population in the latent tuberculosis infection queue, and the specific values are as follows: 1-360 days before diagnosis, the model accuracy is 84.6%, the sensitivity is 59.0%, the specificity is 93.8%, the positive prediction rate is 76.3%, and the negative prediction rate is 79.6%; 361-720 days before diagnosis, the model accuracy is 75.7%, the sensitivity is 48.6%, the specificity is 88.3%, the positive prediction rate is 100.0%, and the negative prediction rate is 88.7%.
The verification example proves that the 15 gene combination markers have firm reliability in identifying progressive tuberculosis groups and non-progressive tuberculosis groups in the latent tuberculosis infection queue.
The foregoing description of the preferred embodiments of the present application is not intended to be limiting, but is intended to cover any and all modifications, equivalents, and alternatives falling within the spirit and principles of the present application.

Claims (10)

1. A biomarker for predicting advanced tuberculosis and non-advanced tuberculosis, the biomarker comprising the following genes: KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6, and bat 2.
2. Use of an agent that detects a biomarker of claim 1 in the preparation of a product for predicting a population of progressive tuberculosis and a population of non-progressive tuberculosis in a latent tuberculosis infection queue.
3. The use of claim 2, wherein the reagent comprises a reagent that detects the gene of the biomarker and at least one of its mRNA and expressed protein.
4. The use of claim 2, wherein the product comprises a kit for predicting a population of progressive tuberculosis and a population of non-progressive tuberculosis in a cohort of latent tuberculosis infections.
5. The use according to claim 4, wherein the kit comprises fluorescent quantitative PCR detection reagents.
6. The use of claim 2, wherein the product comprises a system for predicting a population of progressive tuberculosis and a population of non-progressive tuberculosis in a cohort of latent tuberculosis infections.
7. The use according to claim 6, wherein the system comprises:
a data acquisition unit: the method comprises the steps of carrying out gene detection on a sample of a latent tuberculosis infection queue, and obtaining expression level data of the biomarker in the sample;
a data analysis unit: obtaining a disease risk score for the sample using a random forest integration model in combination with the expression level data;
data prediction unit: and predicting whether the sample is a progressive tuberculosis group or a non-progressive tuberculosis group according to the disease risk score.
8. The use according to claim 7, wherein in the data analysis unit, the disease risk score of the random forest integrated model is as follows:
input data set:D = {(x 1 , y 1 ), (x 2 , y 2 ), … , (x n , y n ) Wherein x is i Is a feature vector, y i Is the corresponding ending tag;
training data set for each decision tree:D sub whereinD sub Is composed ofDRandomly extracting to form;
prediction result of each decision tree:C(x) Representing a given input x-eigenvector atD sub Predictive results in training data;
disease risk score for random forests:wherein T is the number of decision trees;
the x feature vectors include KREMEN1, DYSF, ALPK1, ZNF438, ANKRD22, C1QB, WDFY3, HIST1H3D, BST1, SORT1, GBP6, OAS1, TRIM25, FBXO6, and BATF2 genes; the outcome labels corresponding to y are progressive tuberculosis and non-progressive tuberculosis; the number of decision trees t=500;
the data prediction unit is: if the disease risk score is more than or equal to 0.5, the sample is a progressive tuberculosis group; if the disease risk score is less than 0.5, the sample is a non-progressive tuberculosis population.
9. The use of claim 7, wherein the gene detection in the data acquisition unit comprises detection using an Illumina HiSeq 2000 sequencing platform.
10. The use according to claim 7, wherein in the data acquisition unit, the expression level data is subjected to sample correction by variance stabilizing transformation.
CN202410038422.1A 2024-01-11 2024-01-11 Biomarkers for predicting advanced tuberculosis and non-advanced tuberculosis and uses thereof Withdrawn CN117551760A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410038422.1A CN117551760A (en) 2024-01-11 2024-01-11 Biomarkers for predicting advanced tuberculosis and non-advanced tuberculosis and uses thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410038422.1A CN117551760A (en) 2024-01-11 2024-01-11 Biomarkers for predicting advanced tuberculosis and non-advanced tuberculosis and uses thereof

Publications (1)

Publication Number Publication Date
CN117551760A true CN117551760A (en) 2024-02-13

Family

ID=89811450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410038422.1A Withdrawn CN117551760A (en) 2024-01-11 2024-01-11 Biomarkers for predicting advanced tuberculosis and non-advanced tuberculosis and uses thereof

Country Status (1)

Country Link
CN (1) CN117551760A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994646A (en) * 2023-08-01 2023-11-03 东莞市滨海湾中心医院(东莞市太平人民医院、东莞市第五人民医院) Construction method and application of fungus yang active tuberculosis risk assessment model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150315643A1 (en) * 2012-12-13 2015-11-05 Baylor Research Institute Blood transcriptional signatures of active pulmonary tuberculosis and sarcoidosis
CN110499364A (en) * 2019-07-30 2019-11-26 北京凯昂医学诊断技术有限公司 A kind of probe groups and its kit and application for detecting the full exon of extended pattern hereditary disease
US20210140977A1 (en) * 2018-04-16 2021-05-13 University Of Cape Town A three-protein proteomic biomarker for prospective determination of risk for development of active tuberculosis
US20220050119A1 (en) * 2019-01-17 2022-02-17 Proteinlogic Limited Biomarkers
CN114231612A (en) * 2021-12-27 2022-03-25 深圳大学 MiRNA marker related to active tuberculosis and application thereof
CN114381509A (en) * 2021-12-27 2022-04-22 深圳大学 Plasma miRNA marker related to non-tuberculous pneumonia and application thereof
CN117129678A (en) * 2023-08-18 2023-11-28 深圳大学 Use of biomarkers in connection with assessment of tuberculous pleural effusion

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150315643A1 (en) * 2012-12-13 2015-11-05 Baylor Research Institute Blood transcriptional signatures of active pulmonary tuberculosis and sarcoidosis
US20210140977A1 (en) * 2018-04-16 2021-05-13 University Of Cape Town A three-protein proteomic biomarker for prospective determination of risk for development of active tuberculosis
US20220050119A1 (en) * 2019-01-17 2022-02-17 Proteinlogic Limited Biomarkers
CN110499364A (en) * 2019-07-30 2019-11-26 北京凯昂医学诊断技术有限公司 A kind of probe groups and its kit and application for detecting the full exon of extended pattern hereditary disease
CN114231612A (en) * 2021-12-27 2022-03-25 深圳大学 MiRNA marker related to active tuberculosis and application thereof
CN114381509A (en) * 2021-12-27 2022-04-22 深圳大学 Plasma miRNA marker related to non-tuberculous pneumonia and application thereof
CN117129678A (en) * 2023-08-18 2023-11-28 深圳大学 Use of biomarkers in connection with assessment of tuberculous pleural effusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994646A (en) * 2023-08-01 2023-11-03 东莞市滨海湾中心医院(东莞市太平人民医院、东莞市第五人民医院) Construction method and application of fungus yang active tuberculosis risk assessment model
CN116994646B (en) * 2023-08-01 2024-06-11 东莞市滨海湾中心医院(东莞市太平人民医院、东莞市第五人民医院) Construction method and application of fungus yang active tuberculosis risk assessment model

Similar Documents

Publication Publication Date Title
US20210057046A1 (en) Methods and systems for analyzing microbiota
US20190367995A1 (en) Biomarkers for colorectal cancer
JP2022519897A (en) Methods and systems for determining a subject&#39;s pregnancy-related status
CN105368944B (en) Biomarker of detectable disease and application thereof
KR20200047626A (en) Methods and systems for characterization of female reproductive system-related conditions related to microorganisms
CN107075446B (en) Biomarkers for obesity related diseases
CN107847464A (en) Diagnosis and the method for the treatment of acute respiratory infections
CN117551760A (en) Biomarkers for predicting advanced tuberculosis and non-advanced tuberculosis and uses thereof
CN110205378B (en) Vertebral column tuberculosis plasma miRNA combined diagnosis marker and application thereof
US20210324473A1 (en) Indices of Microbial Diversity Relating To Health
EP3374523B1 (en) Biomarkers for prospective determination of risk for development of active tuberculosis
Zheng et al. Gene expression signatures can aid diagnosis of sexually transmitted infection-induced endometritis in women
WO2016016879A1 (en) System, method and software for predicting drug efficacy in a patient
CN117551761A (en) Biomarkers for diagnosing high risk and low risk populations in a latent tuberculosis infection queue and uses thereof
Kaforou et al. Host RNA signatures for diagnostics: an example from paediatric tuberculosis in Africa
EP2206063A1 (en) Estimation of diagnostic markers
Biglarbeigi et al. Early prediction of sepsis considering early warning scoring systems
CN108060220A (en) The identification of Chronic Infection of Toxoplasma male mice reproductive system target gene and its application clinically
CN105177130B (en) It is used for assessing the mark of aids patient generation immune reconstitution inflammatory syndrome
EP2893354A2 (en) Use of interleukin-27 as a diagnostic biomarker for bacterial infection in critically ill patients
Kumar et al. Role of Genomics in Smart Era and Its Application in COVID‐19
WO2020176369A1 (en) Indices of microbial diversity relating to health
CN114386530B (en) Deep learning-based ulcerative colitis immunophenotyping classification method and system
Tarca et al. Human blood gene signature as a marker for smoking exposure: computational approaches of the top ranked teams in the sbv IMPROVER Systems Toxicology challenge
Nguyen et al. Diagnosis of Sepsis Based on Potential Immune-Related Biomarker and Machine Learning Method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20240213

WW01 Invention patent application withdrawn after publication