CN114854847A - Method for constructing machine learning model for identifying infectious diseases and non-infectious diseases - Google Patents

Method for constructing machine learning model for identifying infectious diseases and non-infectious diseases Download PDF

Info

Publication number
CN114854847A
CN114854847A CN202210566863.XA CN202210566863A CN114854847A CN 114854847 A CN114854847 A CN 114854847A CN 202210566863 A CN202210566863 A CN 202210566863A CN 114854847 A CN114854847 A CN 114854847A
Authority
CN
China
Prior art keywords
infectious
microorganism
infectious diseases
value
host gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210566863.XA
Other languages
Chinese (zh)
Other versions
CN114854847B (en
Inventor
祝中一
麻锦敏
陈唯军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huada Medical Laboratory
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Priority to CN202210566863.XA priority Critical patent/CN114854847B/en
Publication of CN114854847A publication Critical patent/CN114854847A/en
Application granted granted Critical
Publication of CN114854847B publication Critical patent/CN114854847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/689Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for bacteria
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/6895Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A50/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
    • Y02A50/30Against vector-borne diseases, e.g. mosquito-borne, fly-borne, tick-borne or waterborne diseases whose impact is exacerbated by climate change
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The present invention proposes a method of constructing a machine learning model for identifying infectious and non-infectious diseases, the method comprising: determining the expression quantity value and/or the relative abundance value of the host genes in biological samples of a subject known to suffer from an infectious disease and a subject known to suffer from a non-infectious disease respectively; and inputting the host gene expression quantity value and/or the microorganism relative abundance value into a trainer respectively, and taking infectious diseases and non-infectious diseases as markers to train so as to obtain a machine learning model for identifying the infectious diseases and the non-infectious diseases. The machine learning model can be used for accurately identifying infectious diseases and non-infectious diseases, and has important scientific research and clinical diagnosis values.

Description

Method for constructing machine learning model for identifying infectious diseases and non-infectious diseases
Technical Field
The present invention relates to the field of medical diagnostics. In particular, the invention relates to methods of constructing machine learning models for identifying infectious and non-infectious diseases.
Background
Infectious diseases are diseases caused by infection, which are invasive diseases of pathogens, proliferation of pathogens, and invasion of body tissues by host tissues in response to pathogens and toxins produced thereby. Non-infectious diseases refer to diseases which are not caused by infection of pathogens but can also result in the activation of the immune system of the body, such as cancer, Alzheimer's disease, epilepsy and the like. The identification of respiratory infectious diseases (such as pneumonia), central nervous system infectious diseases (such as encephalitis and meningitis) and non-infectious diseases, especially systemic bacterial infectious diseases, are in a considerable proportion in intensive care units. Early diagnosis is particularly important because clinical symptoms are atypical, progress rapidly and harm seriously. Therefore, the laboratory detection index which can diagnose early and monitor curative effect and has high specificity is searched, and the method has important significance for helping to diagnose early and treat early effectively in clinical time and accurately, reducing the fatality rate, avoiding abuse of antibiotics and reducing the generation of drug resistance of bacteria.
Although clinical final judgment of infectious and non-infectious diseases is often based on a combination of factors (including clinical symptoms), diagnosis is often dependent on detection of etiology. At present, the conventional clinical pathogen detection method mainly comprises bacteria/fungus culture, virus PCR and immune markers, but has the limitations of low separation culture positive rate (15-20%) and overlong cycle (3-5 days), and fixed target detection is limited in molecular detection and serological detection. For samples that are identified as negative for microorganisms, identification of infectious and non-infectious diseases becomes more difficult. For the result of positive identification of the microorganism, the positive identification of the microorganism does not represent that the detected microorganism is the pathogenic microorganism, and the identification problems of colonization and infection exist.
Colonizing microorganisms means that microorganisms (e.g., bacteria) are present on the surface of the human body (e.g., skin, oral cavity, intestinal tract or airway) without causing disease in the human body. Pathogen infection refers to the invasion of host body tissues by pathogenic pathogens. For microbial colonization, various microbes often fall to the human body from different environments, and can colonize and continuously grow and reproduce offspring at a certain position. That is, the microorganism is isolated from clinical specimens, but the patient does not have the corresponding clinical symptoms and signs of infection and generally does not need antibacterial treatment. Microbial colonization requires adhesion, a suitable environment and a certain number. For pathogen infection, after the pathogen invades the human body, the process of growth and reproduction to release toxic substances and cause pathological reaction of the organism is carried out, namely, the pathogenic pathogen is separated from clinical specimens, and patients have corresponding clinical symptoms and signs of infection at the same time and need antibacterial drug treatment.
Microbial colonization can develop into pathogen infection under certain conditions. The pathogenicity of the microorganism and the resistance of the organism are mutually struggled, when the immunity of the human body is low, or after some immunosuppressive agents are taken, the microorganism can take the opportunity to induce infection. Because the pathogenicity of colonizing microorganisms is relative, the genus or species cannot be clinically diagnosed through clinical manifestation experience of patients, and laboratories cannot simply rely on culture identification technology to determine whether the colonizing microorganisms are infectious pathogens or colonizing microorganisms. If a treatment regimen is made based solely on the outcome of microbial colonization, overuse of antibiotics is likely to occur and microbial resistance may also be induced. If the pathogens are misjudged to be colonizing microorganisms without or with insufficient amounts of antibiotics, the disease condition may develop rapidly, even resulting in life being difficult to recover. Therefore, identifying whether an infected microorganism is colonized or infected is one of the current difficulties in the treatment of infection in the clinic.
Thus, methods for identifying infectious and non-infectious diseases remain to be investigated.
Disclosure of Invention
The present invention aims to solve, at least to some extent, the technical problems of the prior art. To this end, the invention proposes methods of constructing machine-learned models for identifying infectious and non-infectious diseases, the use of reagents for identifying infectious and non-infectious diseases, apparatuses for constructing machine-learned models for identifying infectious and non-infectious diseases, systems for identifying infectious and non-infectious diseases, electronic devices, executable storage media, and methods of identifying infectious and colonizing microorganisms, with important scientific research and clinical diagnostic value.
In a first aspect of the invention, the invention proposes a method of constructing a machine learning model for identifying infectious and non-infectious diseases. According to an embodiment of the invention, the method comprises: determining a host gene expression magnitude and/or a microorganism relative abundance value in a biological sample from a subject known to have an infectious disease and a subject known to have a non-infectious disease; inputting the obtained host gene expression quantity value and/or microorganism relative abundance value into a trainer, and constructing a machine learning model for identifying infectious diseases and non-infectious diseases by taking the infectious diseases and the non-infectious diseases as markers.
The inventor of the present application finds that the host gene, the abundance of the microorganism or the combination of the host gene and the abundance of the microorganism in the biological sample can accurately identify whether the biological sample contains the colonizing microorganism or the infectious pathogen, and can further diagnose whether the subject from which the biological sample is derived has the infectious disease or the non-infectious disease. Since, the construction of a machine learning model based on the host gene expression magnitude and/or the relative abundance value of microorganisms in biological samples of subjects known to have infectious diseases and subjects known to have non-infectious diseases can be accurately used to identify infectious diseases and non-infectious diseases.
In a second aspect of the invention, the invention proposes the use of an agent for identifying infectious and non-infectious diseases. According to an embodiment of the present invention, the reagent is used for detecting the expression level of the host gene and/or the abundance level of the microorganism in the biological sample to be tested.
In a third aspect of the invention, an apparatus is presented for constructing a machine learning model for identifying infectious and non-infectious diseases. According to an embodiment of the invention, the apparatus comprises: an assay unit for determining a host gene expression magnitude and/or a microorganism relative abundance value in a biological sample of a subject known to have an infectious disease and a subject known to have a non-infectious disease; and the training unit is used for constructing a machine learning model for identifying infectious diseases and non-infectious diseases by taking the infectious diseases and the non-infectious diseases as markers on the basis of the host gene expression quantity values and/or the microorganism relative abundance values which are input into the trainer and obtained by the detection unit.
In a fourth aspect of the invention, a system for identifying infectious and non-infectious diseases is presented. According to an embodiment of the invention, the method comprises: a detection unit for detecting an expression quantity value of a host gene and/or a abundance value of a microorganism in a biological sample; an analysis unit adapted to input the host gene expression magnitude value and/or the relative abundance value of the microorganism into the machine learning model obtained by the method of the first aspect, perform analysis to obtain an analysis result, and determine whether the subject providing the biological sample has an infectious disease or a non-infectious disease based on the analysis result.
In a fifth aspect of the invention, a method of identifying an infectious pathogen and a colonizing microorganism is provided. According to an embodiment of the invention, the method comprises: detecting the expression quantity value of the host gene and/or the relative abundance value of the microorganism in a biological sample to be detected; inputting the host gene expression quantity value and/or the microorganism relative abundance value into a machine learning model for analysis to obtain an analysis result; and determining whether the biological sample to be tested contains infectious pathogens or colonizing microorganisms or not based on the analysis result.
In a sixth aspect of the invention, an electronic device is presented. According to an embodiment of the present invention, the electronic apparatus includes: a memory and a processor; the memory for storing a computer program; the processor for executing the computer program to carry out the method for identifying infectious and non-infectious diseases as described in the method of the first aspect or the use of the second aspect.
In a seventh aspect of the invention, an executable storage medium is presented. According to an embodiment of the invention, the storage medium stores computer program instructions which, when run on a processor, cause the processor to perform a method for identifying infectious and non-infectious diseases as described in the method of the first aspect or in the use of the second aspect.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 shows a schematic flow diagram of a method for constructing a machine learning model for identifying infectious and non-infectious diseases according to one embodiment of the invention;
FIG. 2 is a diagram showing comparison of the discrimination effect between groups for the discrimination of infectious and non-infectious encephalitis based on the expression level of host genes according to the present invention;
figure 3 shows a schematic diagram of the comparison of differential effect between infectious and non-infectious encephalitis groups based on relative abundance of microorganisms according to one embodiment of the present invention;
FIG. 4 is a graph showing comparison of discrimination between infectious and non-infectious encephalitis groups based on combinations of host gene expression levels and relative abundance of microorganisms according to one embodiment of the present invention;
FIG. 5 is a graph showing comparison of the results of identification between groups of infectious pneumonia and non-infectious pneumonia based on the expression level of host genes according to an embodiment of the present invention;
FIG. 6 is a graph showing comparison of the results of group-wise identification of infectious and non-infectious pneumonia based on microbial abundance according to one embodiment of the present invention;
FIG. 7 is a graph showing the comparison of the discrimination effect between infectious pneumonia and non-infectious pneumonia based on the expression level of host genes according to an embodiment of the present invention.
Detailed Description
The following describes embodiments of the present invention in detail. The following examples are illustrative only and are not to be construed as limiting the invention.
Method for constructing machine learning model for identifying infectious diseases and non-infectious diseases
In one aspect of the invention, the invention features a method of constructing a machine learning model for identifying infectious and non-infectious diseases. According to an embodiment of the invention, the method comprises: determining a host gene expression magnitude and/or a microorganism relative abundance value in a biological sample from a subject known to have an infectious disease and a subject known to have a non-infectious disease; inputting the obtained host gene expression quantity value and/or microorganism relative abundance value into a trainer, and constructing a machine learning model for identifying infectious diseases and non-infectious diseases by taking the infectious diseases and the non-infectious diseases as markers.
In the case of infectious diseases, since infectious diseases refer to invasive diseases of pathogens, proliferation of pathogens, and host tissues in response to pathogens and toxins produced thereby, into body tissues, pathogen invasion provokes a series of immune responses of the host, including innate and adaptive immune responses, which may be reflected in changes in host gene expression. Thus, gene expression by a host is distinguishable between infectious and non-infectious diseases. Whether infection occurs can be judged by detecting the expression level of the host gene in a sample to be detected, so that the identification of infectious diseases and non-infectious diseases is realized.
In the detection of pathogenic microorganisms, infectious diseases mean pathogenic bacteria, so that the detection of the presence or absence of pathogenic microorganisms can assist in identifying whether the pathogenic microorganisms are infected with the infectious diseases to some extent. The method further improves the accuracy of identifying infectious diseases and non-infectious diseases by detecting the abundance level of microorganisms in a sample to be detected, supplementing machine learning modeling variables and combining with host gene expression level, and judges whether the detected microorganisms are pathogens or colonized microorganisms or infectious bacteria.
Therefore, in the present application, the host gene expression quantity value and/or the microorganism relative abundance value, and the infection result information corresponding to the host gene expression quantity value and/or the microorganism relative abundance value are selected and input into the trainer as the sample data set, and the training is performed to obtain the target trainer for identifying the host gene expression quantity value and/or the microorganism relative abundance value, that is, the machine learning model for identifying the infectious diseases and the non-infectious diseases is constructed. Wherein the infection result information includes identification information for indicating an infectious disease (for example, set to a value of "1"), or identification information for indicating a non-infectious disease (for example, set to a value of "0"). Specifically, when the training data is the host gene expression quantity value and the corresponding infection result information, the obtained model is a host gene expression quantity value model; when the training data are the microbial relative abundance value and the corresponding infection result information, the obtained model is a microbial relative abundance value model; when the training data is the combination of the two, the obtained model is a host gene expression value model and a microorganism relative abundance value model.
For the convenience of understanding, the specific training process is described in this embodiment by taking the host gene expression quantity model as an example, and the training of other models can be performed by referring to the method in the embodiment. Namely, the obtained host gene expression quantity value is input into a trainer, infectious diseases and non-infectious diseases are used as markers, a machine learning model for identifying the infectious diseases and the non-infectious diseases is constructed, and a simulation result is verified by using a ten-times cross validation method, and the method comprises the following specific steps:
randomly classifying host gene expression values according to a preset proportion to obtain a training set and a test set, and circularly performing the process of random classification for 10 times to obtain a 10-minute training set and a test set; the preset proportion value is not specifically limited, and the preset proportion in the embodiment of the invention is preferably 9: 1;
taking infectious diseases and non-infectious diseases as markers, and training a trainer by using the gene expression quantity values in a training set to obtain a candidate machine learning model for identifying the infectious diseases and the non-infectious diseases;
verifying each candidate machine learning model by using the gene expression quantity values in the test set by taking infectious diseases and non-infectious diseases as markers to obtain a predicted value corresponding to each candidate machine learning model;
and performing ROC (receiver operating characteristic curve) analysis on the predicted value and the actual classification result to obtain an area value under an AUC curve. The gene used in the model with the highest AUC value and the least number of genes is set as the core identification gene, and the model used for prediction is a machine learning model for identifying infectious diseases and non-infectious diseases by using the gene. After the AUC value range which is obviously different from the model with lower AUC value in the obtained candidate model is artificially selected, the gene required for modeling and related to the prediction result according with the AUC value is the candidate identification gene, and the model used for the corresponding prediction is the machine learning model for identifying the infectious diseases and the non-infectious diseases by using the related gene.
In the process of training the model, the GLMNET package of the R language is used for fitting the generalized linear model to regularize the gene expression quantity values in the training set. In the specific processing, the dependent variable is a binary variable (autoimmune encephalitis (AE) or non-AE), so the set parameter family is set to "binomial", it is understood that the lambda value in the GLMNET generalized linear model in the R language can be set according to actual needs, for example, 50, and this is not specifically limited in this embodiment.
In the present invention, the method of obtaining "a subject known to have an infectious disease" or "a subject known to have a non-infectious disease" is not particularly limited as long as it is known that the subject to which a biological sample is supplied has an infectious disease or a non-infectious disease, and the method can be determined by, for example, clinical presentation, an auxiliary test, a confirmed diagnosis experiment, or a method of excluding other causes.
The term "subject" refers to any animal being examined, studied, or treated, and is not intended to limit the invention to any particular type of subject. In some embodiments of the invention, humans are preferred subjects, while in other embodiments, non-human animals are preferred subjects, including but not limited to mice, monkeys, ferrets, cows, sheep, goats, pigs, chickens, turkeys, dogs, cats, horses, and reptiles.
As used herein, the term "biological sample" includes any sample that may be selected from a subject containing genetic material that may be used in the methods provided herein. According to an embodiment of the invention, the biological sample is selected from at least one of peripheral blood, cerebrospinal fluid, alveolar lavage, sputum, a reagent and solid tissue.
According to the embodiment of the invention, before the host gene expression quantity value and/or the microorganism relative abundance value are input into the trainer, the host gene expression quantity value and/or the microorganism relative abundance value are subjected to homogenization treatment, and the obtained homogenized host gene expression quantity value and/or the homogenized microorganism relative abundance value are input into the trainer.
According to an embodiment of the present invention, the reagent is used for detecting the expression level of the host gene and/or the abundance level of the microorganism in the biological sample to be tested.
According to embodiments of the invention, the infectious disease is an infectious encephalitis and the non-infectious disease is a non-infectious encephalitis; the host gene includes at least one of: MS4A4E, OPHN1, PLK3, ITGB7, NINJ2, STAT3, VRK1, TMEM147, REXO2, LCMT1, VPS54, POLE3, TESC, EMC3, RNF123, MZT2B, MALSU1, DHRS1, LPCAT4, EAPP, RASAL3, TGS1, and RNA5S 9; the microorganism comprises at least one of: enterobacter, Apibacter, Caedibacter, Alternaria, Gudonglella, Drancourella, Nitrincola, Cypherlophora, Acinetobacter, Propionicella, Zhengella, and Halorintalis.
According to a preferred embodiment of the present invention, the host genes include at least one of MS4A4E and OPHN 1; or the host genes comprise at least one of MS4A4E and OPHN1 and/or at least one of the following: PLK3, ITGB7, NINJ2, STAT3, VRK1, TMEM147, REXO2, LCMT1, VPS54, POLE3, TESC, EMC3, RNF123, MZT2B, MALSU1, DHRS1, LPCAT4, EAPP, RASAL3, TGS1, and RNA5S 9; the microorganism comprises at least Enterobacter; or the microorganism comprises at least Enterobacter and/or at least one of the following: apibacter, Caedibacter, Alternaria, Gudonella, Drancourtella, Nitrincola, Cypherophora, Acinetobacter, Propionicella, Zhengella, and Haloritalis. Therefore, whether the encephalitis has the infectious encephalitis can be accurately identified by using the expression amount of the host gene and the abundance of the microorganism alone or in combination.
According to an embodiment of the invention, the infectious disease is infectious pneumonia and the non-infectious disease is non-infectious pneumonia; the host genes include at least one of: ABCB6, FNIP1, CFTRP2, COX6CP2, DNAL4, GDF5-AS1, KLHL 1, KRTAP 1-2, MIR1283-2, MIR3689D 1, MIR654, MRPS 1, MTND4LP 1, MTNR 11, PTGES3P 1, RNA5SP360, RNU 1-1029-1-1044 1-1144-1183 1-685120-1-6856856856856856854-1314-1-685685685685685685685685 685 1-6858-1-685832-1-685839-1-6856 ATAC37 1, TVP23 1; the microorganism comprises at least one of: scedosporium, Komagataibacterium, Alphabaculovir, Cyclobacterium, Libanicoccus, Serpentiniicella, Lachnospira, Nicoletella, Abyssicococcus, and Occidentaia.
According to a preferred embodiment of the present invention, the host genes include at least one of ABCB6 and FNIP 1; or the host genes at least comprise at least one of ABCB6 and FNIP1 and/or at least one of the following: CFTRP2, COX6CP2, DNAL4, GDF5-AS1, KLHL41, KRTAP9-2, MIR1283-2, MIR3689D2, MIR654, MRPS6, MTND4LP24, MTNR1A, PTGES3P4, RNA5SP360, RNU 6-1029-6-1044 6-1144-6-1183-6-1312 6-1314 6-187 6-493 6-698 6-685743-832 4-839-6-6856856854-6856 ATAC37 6, TVP23 6; the microorganism comprises at least Scedosporium; or the microorganism comprises at least Scedosporium and/or at least one of the following: komagataibacteria, Alphabaculovir, Cyclobacterium, Libanicoccus, Serpentiniicella, Lachnospira, Nicoletella, Abyssicoccus and Occidentaia.
It should be noted that the microorganisms described above are classified and defined on the genus (genus) level, and any species level (species) microorganisms belonging to this genus are included in the scope of the present invention.
According to an embodiment of the present invention, referring to fig. 1, a method of constructing a machine learning model for identifying infectious and non-infectious diseases comprises:
a) obtaining the biological samples from a plurality of subjects who are clinically unambiguously diagnosed as an infectious disease;
b) obtaining the biological samples from a plurality of subjects who are clinically unambiguously diagnosed as non-infectious disease;
c) determining gene expression levels of a plurality of genes (e.g., all expressed genes or transcriptomes, or a subset thereof) in each of the biological samples from steps (a) and (b);
d) determining the relative abundance of a plurality of microorganisms in each of the biological samples from steps (a) and (b);
e) normalizing the gene expression levels obtained in step (c) to generate normalized gene expression values;
f) normalizing the relative abundance of the microorganisms obtained in step (d) to generate a normalized relative abundance expression value;
g) inputting the obtained normalized gene expression value and the relative abundance value normalized by the microorganism into a trainer, and constructing a machine learning model for identifying infectious diseases and non-infectious diseases by taking the infectious diseases and the non-infectious diseases as markers.
Use of agents for identifying infectious and non-infectious diseases
In a second aspect of the invention, the invention proposes the use of an agent for identifying infectious and non-infectious diseases. According to an embodiment of the present invention, the reagent is used for detecting the expression level of a host gene and/or the abundance level of a microorganism in a biological sample to be tested. As described above, by detecting the expression level of the host gene, it is possible to determine whether or not the subject is infected. The expression level of the host gene is combined with the abundance level of the microorganism, so that whether the microorganism is a pathogen or a permanent planting microorganism or an infectious bacterium can be judged. Thus, infectious diseases and non-infectious diseases can be accurately identified.
According to an embodiment of the invention, the method of identifying infectious and non-infectious diseases comprises: detecting the host gene expression quantity value and/or the microorganism relative abundance value in a biological sample to be detected by using the reagent; inputting the host gene expression quantity value and/or the microorganism relative abundance value into a machine learning model for analysis to obtain an analysis result; determining, based on the results of the analysis, whether the subject providing the biological sample to be tested has an infectious disease or a non-infectious disease; wherein the machine learning model is obtained by the method of constructing a machine learning model for identifying infectious and non-infectious diseases according to the first aspect. Thus, infectious diseases or non-infectious diseases can be accurately identified by using the method.
According to the embodiment of the invention, before the host gene expression quantity value and/or the microorganism relative abundance value is input into the machine learning model, the host gene expression quantity value and/or the microorganism relative abundance value is subjected to homogenization treatment, and the obtained homogenized host gene expression quantity value and/or the homogenized microorganism relative abundance value are input into the machine learning model.
It should be noted that the present invention is not limited to a specific type of reagent, as long as the expression level of the host gene and the abundance level of the microorganism can be measured, and for example, the reagent is at least one selected from the group consisting of reagents used in multiplex PCR, real-time quantitative PCR, hybridization, fluorescence detection, nucleic acid mass spectrometry detection, and nucleic acid sequencing. The reagent for measuring the expression level of the host gene may be the same as or different from the reagent for measuring the abundance level of the microorganism, and is not limited to the same.
According to an embodiment of the invention, the biological sample is selected from at least one of peripheral blood, cerebrospinal fluid, alveolar lavage, sputum, a reagent and solid tissue.
According to embodiments of the invention, the infectious disease is an infectious encephalitis and the non-infectious disease is a non-infectious encephalitis; the host gene includes at least one of: MS4A4E, OPHN1, PLK3, ITGB7, NINJ2, STAT3, VRK1, TMEM147, REXO2, LCMT1, VPS54, POLE3, TESC, EMC3, RNF123, MZT2B, MALSU1, DHRS1, LPCAT4, EAPP, RASAL3, TGS1, and RNA5S 9; the microorganism comprises at least one of: enterobacter, Apibacter, Caedibacter, Alternaria, Gudonglella, Drancourella, Nitrincola, Cypherlophora, Acinetobacter, Propionicella, Zhengella, and Halorintalis.
According to a preferred embodiment of the present invention, the host genes include at least one of MS4A4E and OPHN 1; or the host genes comprise at least one of MS4A4E and OPHN1 and/or at least one of the following: PLK3, ITGB7, NINJ2, STAT3, VRK1, TMEM147, REXO2, LCMT1, VPS54, POLE3, TESC, EMC3, RNF123, MZT2B, MALSU1, DHRS1, LPCAT4, EAPP, RASAL3, TGS1, and RNA5S 9; the microorganism comprises at least Enterobacter; or the microorganism comprises at least Enterobacter and/or at least one of the following: apibacter, Caedibacter, Alternaria, Gudonella, Drancourtella, Nitrincola, Cypherophora, Acinetobacter, Propionicella, Zhengella, and Haloritalis. Therefore, whether the encephalitis has the infectious encephalitis can be accurately identified by using the expression amount of the host gene and the abundance of the microorganism alone or in combination.
According to an embodiment of the invention, the infectious disease is infectious pneumonia and the non-infectious disease is non-infectious pneumonia; the host gene includes at least one of: ABCB6, FNIP1, CFTRP2, COX6CP2, DNAL4, GDF5-AS1, KLHL 1, KRTAP 1-2, MIR1283-2, MIR3689D 1, MIR654, MRPS 1, MTND4LP 1, MTNR 11, PTGES3P 1, RNA5SP360, RNU 1-1029-1-1044 1-1144-1183 1-685120-1-6856856856856856854-1314-1-685685685685685685685685 685 1-6858-1-685832-1-685839-1-6856 ATAC37 1, TVP23 1; the microorganism comprises at least one of: scedosporium, Komagataibacterium, Alphabaculovir, Cyclobacterium, Libanicoccus, Serpentiniicella, Lachnospira, Nicoletella, Abyssicococcus, and Occidentaia.
According to a preferred embodiment of the present invention, the host genes include at least one of ABCB6 and FNIP 1; or the host genes at least comprise at least one of ABCB6 and FNIP1 and/or at least one of the following: CFTRP2, COX6CP2, DNAL4, GDF5-AS1, KLHL41, KRTAP9-2, MIR1283-2, MIR3689D2, MIR654, MRPS6, MTND4LP24, MTNR1A, PTGES3P4, RNA5SP360, RNU 6-1029-6-1044 6-1144-6-1183-6-1312 6-1314 6-187 6-493 6-698 6-685743-832 4-839-6-6856856854-6856 ATAC37 6, TVP23 6; the microorganism comprises at least Scedosporium; or the microorganism comprises at least Scedosporium and/or at least one of the following: komagataibacteria, Alphabaculovir, Cyclobacterium, Libanicoccus, Serpentiniicella, Lachnospira, Nicoletella, Abyssicoccus and Occidentaia.
It is to be noted that the features and advantages described for the method of constructing a machine learning model for identifying infectious and non-infectious diseases according to the first aspect above (in particular the features and advantages described for the host genes and microorganisms) are equally applicable for this purpose and will not be described in further detail here.
Device for constructing machine learning model for identifying infectious diseases and non-infectious diseases
In a third aspect of the invention, an apparatus is presented for constructing a machine learning model for identifying infectious and non-infectious diseases. According to an embodiment of the invention, the apparatus comprises: an assay unit for determining a magnitude of host gene expression and/or a relative abundance of a microorganism in a biological sample from a subject known to have an infectious disease and a subject known to have a non-infectious disease; and the training unit is used for constructing a machine learning model for identifying infectious diseases and non-infectious diseases by taking the infectious diseases and the non-infectious diseases as markers on the basis of the host gene expression quantity values and/or the microorganism relative abundance values which are input into the trainer and obtained by the detection unit.
Therefore, the device according to the embodiment of the present invention can be used to construct a machine learning model for accurately identifying infectious diseases and non-infectious diseases, which is helpful for implementing the method for constructing a machine learning model for identifying infectious diseases and non-infectious diseases described in the first aspect.
System for identifying infectious and non-infectious diseases
In a fourth aspect of the invention, a system for identifying infectious and non-infectious diseases is presented. According to an embodiment of the invention, the system comprises: a detection unit for detecting an expression quantity value of a host gene and/or a abundance value of a microorganism in a biological sample; an analysis unit, adapted to input the host gene expression quantity value and/or the microorganism relative abundance value into the machine learning model obtained by the method of the first aspect, to perform analysis, to obtain an analysis result, and based on the analysis result, to determine whether the subject providing the biological sample has an infectious disease or a non-infectious disease. Thus, infectious diseases and non-infectious diseases can be accurately identified by using the system.
Before the host gene expression quantity value and/or the microorganism relative abundance value are input into the machine learning model, the host gene expression quantity value and/or the microorganism relative abundance value are subjected to homogenization treatment.
It should be noted that the features and advantages described above for the method of constructing a machine learning model for identifying infectious and non-infectious diseases are equally applicable to this system and will not be described in detail here.
Method for identifying infectious pathogens and colonizing microorganisms
In a fifth aspect of the invention, a method of identifying an infectious pathogen and a colonizing microorganism is provided. According to an embodiment of the invention, the method comprises: detecting the expression quantity value of the host gene and/or the relative abundance value of the microorganism in a biological sample to be detected; inputting the host gene expression quantity value and/or the microorganism relative abundance value into a machine learning model for analysis to obtain an analysis result; and determining whether the biological sample to be tested contains infectious pathogens or colonizing microorganisms or not based on the analysis result. Thus, infectious pathogens and colonizing microorganisms can be accurately identified using this method.
According to an embodiment of the invention, the method of constructing the machine learning model comprises: obtaining a biological sample known to contain infectious pathogens and colonizing microorganisms; determining a host gene expression level and/or a microorganism relative abundance value in the biological sample known to contain infectious pathogens and colonizing microorganisms; homogenizing the measured host gene expression quantity value and/or the measured microorganism relative abundance value to obtain a homogenized host gene expression quantity value and/or a homogenized microorganism abundance expression quantity value; and inputting the host gene expression quantity value and/or the microorganism relative abundance value into a trainer, and training by taking infectious pathogens and the colonizing microorganisms as markers to obtain a machine learning model for identifying the infectious pathogens and the colonizing microorganisms.
It should be noted that the features and advantages described in the method for constructing a machine learning model for identifying infectious diseases and non-infectious diseases in the first aspect are also applicable to the method for constructing a machine learning model, and are not described herein again.
According to an embodiment of the present invention, the pathogen is a pathogen that causes an organism to suffer from encephalitis, and the host gene and the microorganism are the host gene and the microorganism defined in the method of the first aspect or the use of the second aspect as "the infectious disease is an infectious encephalitis, and the non-infectious disease is a non-infectious encephalitis", respectively.
According to an embodiment of the present invention, the pathogen is a pathogen causing pneumonia, and the host genes and microorganisms are the host genes and microorganisms defined in the method of the first aspect or the use of the second aspect as to "the infectious disease is infectious pneumonia, and the non-infectious disease is non-infectious pneumonia", respectively.
Electronic device
In a sixth aspect of the invention, an electronic device is presented. According to an embodiment of the present invention, the electronic apparatus includes: a memory and a processor; the memory for storing a computer program; the processor for executing the computer program to carry out a method for identifying infectious and non-infectious diseases as described in the method of the first aspect or the use of the second aspect as described above. Thus, by implementing the electronic device, infectious diseases and non-infectious diseases can be accurately identified.
It should be noted that the features and advantages described above for the method for constructing a machine learning model for identifying infectious diseases and non-infectious diseases, and the use of the reagent for identifying infectious diseases and non-infectious diseases, are also applicable to the electronic device, and will not be described herein again.
Executable storage medium
In a seventh aspect of the invention, an executable storage medium is presented. According to an embodiment of the invention, the storage medium stores computer program instructions which, when run on a processor, cause the processor to perform a method of constructing a machine learning model or a method of using an agent for identifying infectious and non-infectious diseases in the identification of infectious and non-infectious diseases as described above.
The term "executable storage medium" refers to any device or system for storing and providing information (e.g., data and instructions) to a computer processor. Examples of computer readable media include, but are not limited to, DVDs, CDs, hard drives, tapes, and servers for streaming media over a network, and applications such as those found on smart phones and tablets. In various embodiments, aspects of the invention, including data structures and methods, may be stored on computer-readable media. Processing and data may also be performed on many types of devices including, but not limited to, desktop and notebook computers, tablets, smart phones, and the like.
It should be noted that the features and advantages described above for the method for constructing a machine learning model for identifying infectious diseases and non-infectious diseases according to the first aspect and the use of the agent according to the second aspect for identifying infectious diseases and non-infectious diseases are also applicable to the executable storage medium, and will not be described herein again.
The scheme of the invention will be explained with reference to the examples. It will be appreciated by those skilled in the art that the following examples are illustrative of the invention only and should not be taken as limiting the scope of the invention. The examples, where specific techniques or conditions are not indicated, are to be construed according to the techniques or conditions described in the literature in the art or according to the product specifications. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products commercially available.
Example 1 identification of infectious and non-infectious encephalitis
1. Grouping and clinical differential diagnosis of samples:
autoimmune encephalitis, known as autoimmune encephalitis, is an immune mechanism-mediated inflammation that mostly involves white matter of the brain, manifested as demyelination of white matter. Most autoimmune brains are caused by viral infections or tumors. The diagnosis of the autoimmune brain comprises 4 aspects of clinical manifestation, auxiliary examination, confirmed diagnosis experiment and exclusion of other causes. Diagnosed Autoimmune Encephalitis (Autoimmune Encephalitis, AE): the method meets items 1 to 4 of the following diagnostic conditions. The diagnosis of AE requires the comprehensive results of clinical manifestations, cerebrospinal fluid examination, neuroimaging and electroencephalogram examination of patients, and positive anti-neuron antibodies are the main basis for diagnosis.
1.1 clinical presentation: acute or subacute onset (<3 months) with 1 or more of the following neurological and psychiatric symptoms or clinical syndromes: (1) limbic system symptoms: hypomnesia, seizures, abnormal mental behavior, 1 or more of 3 symptoms. (2) Encephalitis syndrome: clinical manifestations of diffuse or multifocal brain lesions. (3) Clinical manifestations of basal ganglia and/or diencephalon/hypothalamic involvement. (4) Mental disorders, and psychopsychological specialties do not follow non-organic diseases.
1.2 auxiliary inspection: auxiliary examination findings with 1 or more of the following, or the incorporation of relevant tumors: (1) cerebrospinal fluid abnormality: leukocytosis of cerebrospinal fluid: (>5×10 6 L); or the cerebrospinal cytology is lymphocytic inflammation; or cerebrospinal fluid oligocloning zone positive. (2) Neuroimaging or electrophysiological abnormalities: MRI marginal system T2 or FLAIR aberrant signals, unilateral or bilateral, or other regional T2 or FLAIR aberrant signals (except for non-specific white matter changes and stroke); or hypermetabolic changes in the PET limbic system, or hypermetabolism of multiple cortical and/or basal ganglia; or an electroencephalogram abnormality: focal epilepsy or epileptiform discharges (located outside the temporal lobe or temporal lobe), or diffuse or multifocal distributed slow wave rhythms. (3) Specific types of tumors associated with AE, such as limbic encephalitis combined with small cell lung cancer, anti-NMDAR encephalitis combined with teratomas.
1.3 confirmation experiment: autoantibodies positive against neuronal surface antigens. Antibody detection is mainly by indirect immunofluorescence assay (IIF). Based on antigen substrates, the assays are divided into cell-based assays (CBA) and tissue-based assays (TBA). CBA adopts transfected cells expressing neuron cell surface antigens, and TBA adopts animal brain tissue slices as antigen substrates. CBA has high specificity and sensitivity. The matched cerebrospinal fluid and serum samples of the patients should be detected as much as possible, and the initial dilution titers of the cerebrospinal fluid and the serum are respectively 1: 1 and 1: 10.
The following experiments were performed using 18 cases of samples clinically identified as autoimmune encephalitis and 41 cases of samples of infectious encephalitis.
2. RNA extraction, sample library construction and sequencing:
according to the manufacturer's manual, 300 ml cerebrospinal fluid (CSF) samples or negative "no template" controls (NTC) of each patient were transferred to new sterile tubes directly using the TIANAmp Micro DNA kit (DP 316; Tiangen Biotechnology, Beijing). After addition of 10 ml proteinase K and 300 ml buffer GB (RNA) carrier, the samples were incubated at 56 ℃ for 10 min. 300 ml of cold absolute ethanol was added and the tube incubated for 5 minutes at room temperature. The liquid is transferred to a new adsorption column and buffered GD and buffered PW washes. The DNA was then dissolved in 40ml of Tris-ethylenediaminetetraacetic acid (TE) buffer. The extracted DNA was used to construct a DNA library. And (3) carrying out ultrasonic treatment on the DNA extracted in the previous step by using a Bioruptor Pico device according to the protocol of a manufacturer to generate fragments of 200-300 bp. The DNA library was constructed according to the standard protocol of BGISEQ-100 sequencing platform (bgi-Tianjin, China). DNA was repaired and end-repaired adapters were added at night. After ligation, the DNA was amplified by PCR and purified using AMPure XP beads (Beckman Coulter, Pasadena, Calif., USA). Before sequencing, quality control was performed using Agilent 2100Bioanalyzer (Agilent 2100Bioanalyzer, Santa Clara, Calif., USA) in combination with quantitative PCR. Emulsion PCR was performed on the quantitative library using the OneTouch system. The DNA was then sequenced using the BGISEQ-100 platform.
3. Data processing, quality control and expression profile quantification:
sequencing successfully built samples, removing basic rRNA, performing data filtration through fastp to obtain basic statistical data, wherein 13M reads are obtained on average in each sample, the GC content is about 50%, the average genome comparison rate is 83.67%, and the gene set comparison rate is as follows: 20.21 percent.
4. Analysis of differences in host and microbial expression profiles:
based statistical data is subjected to expression quantification of 22989 genes, 22710 reliable expression genes are identified, differential expression genes are identified in a non-infectious encephalitis group (18 samples) and an infectious encephalitis group (41 samples) which are identified as autoimmune encephalitis according to the clinical differential diagnosis, and finally, 824 significant up-regulated genes and 9551 significant down-regulated genes of the non-infectious encephalitis group relative to a G infectious encephalitis group are found.
The basic statistical data is subjected to the quantification of the expression abundance of microorganisms, 3422 reliably expressed microorganism species under the genus level are identified, microorganism differential expression genes are identified in a group (18 samples) of non-infectious encephalitis (identified as autoimmune encephalitis according to the clinical differential diagnosis) and an infectious encephalitis group (41 samples), and finally 334 significantly up-regulated species and 114 significantly down-regulated species of the non-infectious encephalitis group relative to a G infectious encephalitis group are found.
5. Constructing and verifying a host gene expression quantity model:
based on the differential gene analysis result, screening genes with log2 value of abundance difference of two groups (infectious encephalitis and non-infectious encephalitis) more than 2 times and Padj value less than or equal to 0.05 in multiple hypothesis test, and carrying out pre-screening by matching with quantitative information of sample expression (namely, average corrected gene abundance value FPKM >1 and at least 50% sample FPKM >1 of each group) to obtain candidate genes.
Step 1 clinical differential diagnosis for free encephalitis and infectious encephalitis samples were each taken a portion as training set, the rest as test set. Modeling the candidate genes in the training set by using a general linear model fitted by a GLMNET package of an R language, setting a model parameter family as a 'binomial', setting a dependent variable as a binary variable (autoimmune encephalitis (AE) or non-AE) and setting a lambda value as 50 to obtain a plurality of models. Cross validation was performed 10 times on each model according to the test set, yielding AUC values for each model. The model with the highest AUC and the least number of genes was selected as the best model.
The results are shown in fig. 2, when the AUC values of the models constructed by using MS4A4E or OPHN1 as the identifying genes are both high, the AUC value of the model modeled by using a single MS4A4E gene can reach 0.82, and the AUC value of the model modeled by using a single OPHN1 gene can reach 0.73. Therefore, MS4A4E and OPHN1 were used as core discriminators. Next to these two are the following candidate discriminators: PLK3, ITGB7, NINJ2, STAT3, VRK1, TMEM147, REXO2, LCMT1, VPS54, POLE3, TESC, EMC3, RNF123, MZT2B, MALSU1, DHRS1, LPCAT4, EAPP, RASAL3, TGS1 and RNA5S9, the models constructed with these candidate identifications also have higher AUC values, and the AUC value modeled using the core discriminator MS4A4E and OPHN1 and the combination of all the candidate discriminators mentioned above can reach 0.91. Whereas the AUC value modeled using the following known combination of autoimmune brain-related genes was 0.55: GRIN1, GRIN2A, GRIN2B, GRIN2C, GRIN2D, GRIN3A, GRIN3B, AQP4, LGI1, GABRA1, GRIA1, GRIA2, IgLON5, Caspr2, MOG.
6. Constructing and verifying a microorganism relative abundance value model:
based on the microbial difference gene analysis result in the step 4, screening microbes with log2 value more than 2 times of the abundance difference between two groups (infectious encephalitis and non-infectious encephalitis) and Padj value less than or equal to 0.05 in multiple hypothesis test, and performing pre-screening by matching with quantitative information of the microbial abundance of the sample (average corrected gene abundance value FPKM >1 and at least 50% of samples FPKM >1 in each group) to obtain candidate microbes.
Combining modeling with the method of the candidate microorganism reference step 5 and 10-times cross validation to obtain the core identification microorganism Enterobacter and the candidate identification microorganism: apibacter, Caedibacter, Alternaria, Gudonella, Drancourtella, Nitrincola, Cypherophora, Acinetobacter, Propionicella, Zhengella, and Haloritalis.
As shown in FIG. 3, the AUC value obtained by using the single Enterobacter microorganism was 0.63, the AUC value obtained by using the above core microorganism Enterobacter and all the candidate differential microorganism combinations was 0.73, and the AUC value obtained by using the combination of non-target microorganisms was 0.49.
The combination of non-target microorganisms used in this example were Ambidenovirus, Lachnoaerobacterium, Parabactoides, Paracyndimonas, Disoconium, Flavonifractor, Porphyromonas, Aparavir, Pontiacum, Fluviii spira, Belllilina.
7. Constructing a model and verifying the host gene expression quantity and the microbial relative abundance value:
and (3) respectively normalizing the screened host gene expression quantity value and the screened microbial relative abundance value to obtain a normalized host gene expression quantity value and a normalized microbial relative abundance value, and constructing a machine learning model by referring to the method in the step 5. By using the combination of the abundance of the microorganisms and the host genes, the AUC value can reach 0.93 (shown in the following figure 4), and compared with the target autoimmunity brain gene combination (AUC value 0.91) or the target autoimmunity brain microorganism combination (AUC value 0.73), the AUC value is obviously improved (Wilcox test p <0.05), and the accurate identification of infectious encephalitis and non-infectious encephalitis can be realized.
Example 2 identification of infectious and non-infectious pneumonia
1. Grouping and clinical differential diagnosis of samples:
the infectious pneumonia samples in this example were all samples of community-acquired pneumonia.
Community-acquired pneumonia is caused by various microorganisms such as bacteria, viruses, chlamydia and mycoplasma only outside hospitals. The main clinical symptoms are cough, accompanied or not accompanied with expectoration and chest pain, and the prodromal symptoms mainly comprise rhinitis-like symptoms or symptoms of upper respiratory tract infection, such as nasal obstruction, rhinorrhea, sneeze, dry throat, pharyngalgia, foreign body sensation in pharynx, hoarseness, headache, dizziness, eye heat distention, lacrimation, mild cough, etc. Not every community-acquired pneumonia patient will have prodrome symptoms, and the incidence rate is generally between 30% and 65% depending on the pathogen.
The diagnosis of community-acquired pneumonia generally considers that like other pneumonia, patients have fever, cough, purulent sputum and leukocytosis or reduction; chest X-ray film shows high-density invasive lesion of sheet, leaf and alveolus, and more than half of patients older than 65 years old have symptoms except respiratory tract, and more than 1/3 patients have no signs of systemic infection. The initial diagnosis can be made clinically by examining the body temperature, pulse, respiratory sounds and rales during the onset of disease.
The following experiments were performed on 64 samples of infectious pneumonia and 32 samples of non-infectious pneumonia that were differentially diagnosed by the above clinical methods.
2. Referring to steps 2-5 of example 1, RNA in sputum was extracted, pooled and sequenced, the sequencing results were analyzed, and core identifying genes ABCB6 and FNIP1, and candidate identifying genes were screened: CFTRP2, COX6CP2, DNAL4, GDF5-AS1, KLHL41, KRTAP9-2, MIR1283-2, MIR3689D2, MIR654, MRPS6, MTND4LP24, MTNR1A, PTGES3P4, RNA5SP360, RNU 6-1029-6-1044 6-1144-6-1183-6-1312 6-1314 6-187 6-493 6-698 6-685743-832 4-839-6-685938 6ATAC37 6 and TVP23 6.
The expression quantity of the host gene is used for constructing a machine learning model, so that infectious pneumonia and non-infectious pneumonia can be identified. As shown in fig. 5, the AUC value using the single ABCB6 gene was up to 0.78, the AUC value using the single FNIP1 gene was up to 0.73, the AUC value using the above-described combination of the core identifying gene and all candidate identifying genes was up to 0.79, and the AUC value using the known combination of self-pneumonia-associated genes was 0.67.
Known pneumonia-associated genes used in this example were IFNA17, IGHD3-3, TRNS1, TRNM, OXPHOS, COX3, ND4L, and HMOX 1.
3. Referring to steps 4 and 6 of example 1, the core identifying microorganism, scorodosporium, and the candidate identifying microorganism were selected: komagataibacteria, Alphabaculovir, Cyclobacterium, Libanicoccus, Serpentiniicella, Lachnospira, Nicoletella, Abyssicoccus and Occidentaia. The relative expression abundance value of the microorganism is homogenized, then a machine learning model is constructed, infectious pneumonia and non-infectious pneumonia can be identified and identified,
as shown in fig. 6, the AUC value using the single scorosporium microorganism modeling was up to 0.56, the AUC value using the above core identified microorganism and all candidate identified microorganism combinations was up to 0.62, and the AUC value using the non-target microorganism combinations was 0.50.
The combination of non-target microorganisms used in this example were Streptococcus pneumaniae, Staphylococcus aureus, alpha-hemolytic Streptococcus, Klebsiella Pneumoniae, Haemophilus influenzae, Pseudomonas aeruginosa.
4. And respectively normalizing the screened host gene expression quantity value and the screened microbial relative abundance value to obtain a normalized host gene expression quantity value and a normalized microbial relative abundance value, and constructing a machine learning model. Using the microbial abundance and host genome combination, AUC values of up to 0.79 (shown in figure 7 below).
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (16)

1. A method of constructing a machine learning model for identifying infectious and non-infectious diseases, comprising:
determining a host gene expression magnitude and/or a microorganism relative abundance value in a biological sample from a subject known to have an infectious disease and a subject known to have a non-infectious disease;
inputting the obtained host gene expression quantity value and/or microorganism relative abundance value into a trainer, and constructing a machine learning model for identifying infectious diseases and non-infectious diseases by taking the infectious diseases and the non-infectious diseases as markers.
2. The method of claim 1, wherein the biological sample is selected from at least one of peripheral blood, cerebrospinal fluid, alveolar lavage, sputum, a probe, and solid tissue.
3. The method according to claim 1, wherein before the host gene expression quantity value and/or the microorganism relative abundance value are inputted into the trainer, the host gene expression quantity value and/or the microorganism relative abundance value are subjected to homogenization treatment, and the obtained homogenized host gene expression quantity value and/or the homogenized microorganism relative abundance value are inputted into the trainer.
4. Use of an agent for identifying infectious and non-infectious diseases, wherein said agent is used for detecting the expression level of a host gene and/or the abundance level of a microorganism in a biological sample to be tested.
5. The use according to claim 4, wherein the method for identifying infectious and non-infectious diseases comprises:
detecting the host gene expression quantity value and/or the microorganism relative abundance value in a biological sample to be detected by using the reagent;
inputting the host gene expression quantity value and/or the microorganism relative abundance value into a machine learning model for analysis to obtain an analysis result;
determining, based on the results of the analysis, whether the subject providing the biological sample to be tested has an infectious disease or a non-infectious disease;
wherein the machine learning model is obtained by the method for constructing a machine learning model for identifying infectious diseases and non-infectious diseases according to any one of claims 1 to 3.
6. The use according to claim 5, wherein before the host gene expression quantity value and/or the microorganism relative abundance value is input into the machine learning model, the host gene expression quantity value and/or the microorganism relative abundance value is subjected to homogenization treatment, and the obtained homogenized host gene expression quantity value and/or the homogenized microorganism relative abundance value is input into the machine learning model;
the reagent is selected from at least one of reagents used in multiplex PCR, real-time quantitative PCR, hybridization, fluorescence detection, nucleic acid mass spectrometry detection and nucleic acid sequencing;
the biological sample is selected from at least one of peripheral blood, cerebrospinal fluid, alveolar lavage fluid, sputum, a probe, and solid tissue.
7. The method according to any one of claims 1 to 3 or the use according to any one of claims 4 to 6, wherein the infectious disease is an infectious encephalitis and the non-infectious disease is a non-infectious encephalitis;
the host gene includes at least one of: MS4A4E, OPHN1, PLK3, ITGB7, NINJ2, STAT3, VRK1, TMEM147, REXO2, LCMT1, VPS54, POLE3, TESC, EMC3, RNF123, MZT2B, MALSU1, DHRS1, LPCAT4, EAPP, RASAL3, TGS1, and RNA5S 9;
the microorganism comprises at least one of: enterobacter, Apibacter, Caedibacter, Alternaria, Gudonglella, Drancourtella, Nitrincola, Cypherophora, Acinetobacter, Propionicella, Zhengella, and Haloritalis;
preferably, the first and second electrodes are formed of a metal,
the host genes include at least one of MS4A4E and OPHN 1;
or the host genes comprise at least one of MS4A4E and OPHN1 and/or at least one of the following: PLK3, ITGB7, NINJ2, STAT3, VRK1, TMEM147, REXO2, LCMT1, VPS54, POLE3, TESC, EMC3, RNF123, MZT2B, MALSU1, DHRS1, LPCAT4, EAPP, RASAL3, TGS1, and RNA5S 9;
the microorganism comprises at least Enterobacter;
or the microorganism comprises at least Enterobacter and/or at least one of the following: apibacter, Caedibacter, Alternaria, Gudonella, Drancourtella, Nitrincola, Cypherophora, Acinetobacter, Propionicella, Zhengella, and Haloritalis.
8. The method according to any one of claims 1 to 3 or the use according to any one of claims 4 to 6, wherein the infectious disease is infectious pneumonia and the non-infectious disease is non-infectious pneumonia;
the host gene includes at least one of: ABCB6, FNIP1, CFTRP2, COX6CP2, DNAL4, GDF5-AS1, KLHL 1, KRTAP 1-2, MIR1283-2, MIR3689D 1, MIR654, MRPS 1, MTND4LP 1, MTNR 11, PTGES3P 1, RNA5SP360, RNU 1-1029-1-1044 1-1144-1183 1-685120-1-6856856856856856854-1314-1-685685685685685685685685 685 1-6858-1-685832-1-685839-1-6856 ATAC37 1, TVP23 1;
the microorganism comprises at least one of: scedosporium, Komagataibacterium, Alphabaculovir, Cyclobacterium, Libanicoccus, Serpentiniella, Lachnospira, Nicoletella, Abyssicococcus, and Occidentaia;
preferably, the first and second electrodes are formed of a metal,
the host genes include at least one of ABCB6 and FNIP 1;
or the host genes at least comprise at least one of ABCB6 and FNIP1 and/or at least one of the following: CFTRP2, COX6CP2, DNAL4, GDF5-AS1, KLHL41, KRTAP9-2, MIR1283-2, MIR3689D2, MIR654, MRPS6, MTND4LP24, MTNR1A, PTGES3P4, RNA5SP360, RNU 6-1029-6-1044 6-1144-6-1183-6-1312 6-1314 6-187 6-493 6-698 6-685743-832 4-839-6-6856856854-6856 ATAC37 6, TVP23 6;
the microorganism comprises at least Scedosporium;
or the microorganism comprises at least Scedosporium and/or at least one of the following: komagataibacteria, Alphabaculovir, Cyclobacterium, Libanicoccus, Serpentiniicella, Lachnospira, Nicoletella, Abyssicoccus and Occidentaia.
9. An apparatus for constructing a machine learning model for identifying infectious and non-infectious diseases, comprising:
an assay unit for determining a host gene expression magnitude and/or a microorganism relative abundance value in a biological sample of a subject known to have an infectious disease and a subject known to have a non-infectious disease;
and the training unit is used for constructing a machine learning model for identifying infectious diseases and non-infectious diseases by taking the infectious diseases and the non-infectious diseases as markers on the basis of the host gene expression quantity values and/or the microorganism relative abundance values which are input into the trainer and obtained by the detection unit.
10. A system for identifying infectious and non-infectious diseases, comprising:
a detection unit for detecting an expression quantity value of a host gene and/or a abundance value of a microorganism in a biological sample;
an analysis unit adapted to input the host gene expression magnitude and/or microorganism relative abundance value into the machine learning model obtained by the method of claims 1-3, 7 or 8, perform analysis to obtain an analysis result, and determine whether the subject providing the biological sample has an infectious disease or a non-infectious disease based on the analysis result.
11. A method for identifying infectious pathogens and colonizing microorganisms comprising:
detecting the expression quantity value of the host gene and/or the relative abundance value of the microorganism in a biological sample to be detected;
inputting the host gene expression quantity value and/or the microorganism relative abundance value into a machine learning model for analysis to obtain an analysis result;
and determining whether the biological sample to be tested contains infectious pathogens or colonizing microorganisms or not based on the analysis result.
12. The method of claim 11, wherein the method of constructing the machine learning model comprises:
obtaining a biological sample known to contain infectious pathogens and colonizing microorganisms;
determining a host gene expression level and/or a microorganism relative abundance value in the biological sample known to contain infectious pathogens and colonizing microorganisms;
homogenizing the measured host gene expression quantity value and/or the measured microorganism relative abundance value to obtain a homogenized host gene expression quantity value and/or a homogenized microorganism abundance expression quantity value;
and inputting the host gene expression quantity value and/or the microorganism relative abundance value into a trainer, and training by taking infectious pathogens and the colonizing microorganisms as markers to obtain a machine learning model for identifying the infectious pathogens and the colonizing microorganisms.
13. The method of claim 12, wherein the pathogen is an encephalitis-causing pathogen, and the host gene and the microorganism are respectively as defined in the method or use of claim 7.
14. The method of claim 12, wherein the pathogen is a pathogen causing pneumonia in the body, and the host genes and microorganisms are the host genes and microorganisms defined in the method or use of claim 8, respectively.
15. An electronic device, comprising: a memory and a processor;
the memory for storing a computer program;
the processor for executing the computer program to implement the method of claim 1 to 3, 7 or 8 or the method of identifying infectious and non-infectious diseases in the use of any one of claims 4 to 6.
16. An executable storage medium, wherein the storage medium stores computer program instructions which, when executed on a processor, cause the processor to perform a method of identifying infectious and non-infectious diseases as claimed in claims 1 to 3, 7 or 8 or in any one of the uses of claims 4 to 6.
CN202210566863.XA 2022-05-23 2022-05-23 Method for constructing machine learning model for identifying infectious diseases and non-infectious diseases Active CN114854847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210566863.XA CN114854847B (en) 2022-05-23 2022-05-23 Method for constructing machine learning model for identifying infectious diseases and non-infectious diseases

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210566863.XA CN114854847B (en) 2022-05-23 2022-05-23 Method for constructing machine learning model for identifying infectious diseases and non-infectious diseases

Publications (2)

Publication Number Publication Date
CN114854847A true CN114854847A (en) 2022-08-05
CN114854847B CN114854847B (en) 2023-08-04

Family

ID=82639302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210566863.XA Active CN114854847B (en) 2022-05-23 2022-05-23 Method for constructing machine learning model for identifying infectious diseases and non-infectious diseases

Country Status (1)

Country Link
CN (1) CN114854847B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116598005A (en) * 2023-07-17 2023-08-15 中日友好医院(中日友好临床医学研究所) Lower respiratory tract infection probability prediction system and device based on host sequence information

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1785722A1 (en) * 2005-11-11 2007-05-16 Stratos Bio Ltd. Microbial, viral and mammalian susceptibility to agents that affect cell growth and metabolism, and compatibility of compounds
US8163896B1 (en) * 2002-11-14 2012-04-24 Rosetta Genomics Ltd. Bioinformatically detectable group of novel regulatory genes and uses thereof
CA2885834A1 (en) * 2012-09-27 2014-04-03 The Children's Mercy Hospital System for genome analysis and genetic disease diagnosis
US20140323391A1 (en) * 2013-03-15 2014-10-30 Duke University Biomarkers for the molecular classification of bacterial infection
WO2014190394A1 (en) * 2013-05-29 2014-12-04 Immunexpress Pty Ltd Microbial markers and uses therefor
CN105095623A (en) * 2014-05-13 2015-11-25 中国人民解放军总医院 Disease biomarker screening analysis method, disease biomarker screening analysis platform, server and disease biomarker screening analysis system
WO2017069958A2 (en) * 2015-10-09 2017-04-27 The Brigham And Women's Hospital, Inc. Modulation of novel immune checkpoint targets
US20200248274A1 (en) * 2005-04-14 2020-08-06 Trustees Of Boston University Diagnostic for lung disorders using class prediction
CA3137017A1 (en) * 2019-04-15 2020-10-22 Mycomed Technologies Llc Methods and compositions using extracellular vesicles for the detection of disease and disorders

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8163896B1 (en) * 2002-11-14 2012-04-24 Rosetta Genomics Ltd. Bioinformatically detectable group of novel regulatory genes and uses thereof
US20200248274A1 (en) * 2005-04-14 2020-08-06 Trustees Of Boston University Diagnostic for lung disorders using class prediction
EP1785722A1 (en) * 2005-11-11 2007-05-16 Stratos Bio Ltd. Microbial, viral and mammalian susceptibility to agents that affect cell growth and metabolism, and compatibility of compounds
CA2885834A1 (en) * 2012-09-27 2014-04-03 The Children's Mercy Hospital System for genome analysis and genetic disease diagnosis
US20140323391A1 (en) * 2013-03-15 2014-10-30 Duke University Biomarkers for the molecular classification of bacterial infection
WO2014190394A1 (en) * 2013-05-29 2014-12-04 Immunexpress Pty Ltd Microbial markers and uses therefor
CN105095623A (en) * 2014-05-13 2015-11-25 中国人民解放军总医院 Disease biomarker screening analysis method, disease biomarker screening analysis platform, server and disease biomarker screening analysis system
WO2017069958A2 (en) * 2015-10-09 2017-04-27 The Brigham And Women's Hospital, Inc. Modulation of novel immune checkpoint targets
CA3137017A1 (en) * 2019-04-15 2020-10-22 Mycomed Technologies Llc Methods and compositions using extracellular vesicles for the detection of disease and disorders

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SVEN JARIUS: "Rho GTPase‑activating protein 10 (ARHGAP10/GRAF2) is a novel autoantibody target in patients with autoimmune encephalitis", 《JOURNAL OF NEUROLOGY》 *
SVEN JARIUS: "Rho GTPase‑activating protein 10 (ARHGAP10/GRAF2) is a novel autoantibody target in patients with autoimmune encephalitis", 《JOURNAL OF NEUROLOGY》, vol. 269, no. 10, 15 October 2022 (2022-10-15), pages 5420 - 5430 *
何飞: "免疫功能正常尖端赛多孢子菌肺炎1例", 《温州医科大学学报》 *
何飞: "免疫功能正常尖端赛多孢子菌肺炎1例", 《温州医科大学学报》, vol. 48, no. 1, 31 January 2018 (2018-01-31), pages 70 - 71 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116598005A (en) * 2023-07-17 2023-08-15 中日友好医院(中日友好临床医学研究所) Lower respiratory tract infection probability prediction system and device based on host sequence information
CN116598005B (en) * 2023-07-17 2023-10-03 中日友好医院(中日友好临床医学研究所) Lower respiratory tract infection probability prediction system and device based on host sequence information

Also Published As

Publication number Publication date
CN114854847B (en) 2023-08-04

Similar Documents

Publication Publication Date Title
Duarte et al. Technological advances in bovine mastitis diagnosis: an overview
Zhang et al. Incremental value of metagenomic next generation sequencing for the diagnosis of suspected focal infection in adults
JP2021192620A (en) Methods to diagnose and treat acute respiratory infections
CN105981026A (en) Biomarker signature method, and apparatus and kits therefor
CN107430644A (en) Method for determining gastrointestinal bacterial flora imbalance
AU2013295732A1 (en) Screening, diagnosis and prognosis of autism and other developmental disorders
US20180137243A1 (en) Therapeutic Methods Using Metagenomic Data From Microbial Communities
Wright et al. Progress report: next-generation sequencing, multiplex polymerase chain reaction, and broad-range molecular assays as diagnostic tools for fever of unknown origin investigations in adults
CN114898808B (en) Method and system for predicting sensitivity of Klebsiella pneumoniae to cefepime
WO2018045359A1 (en) Detection and treatment of infection during pregnancy
CN110283903A (en) Intestinal microflora for Diagnosis of Pancreatic inflammation
CN108431237A (en) For the unusual condition including determining sample, the especially method and apparatus of the diagnostic method based on nucleic acid of health status and/or pathogenic situation
Xiang et al. A glance at the gut microbiota of five experimental animal species through fecal samples
CN114854847B (en) Method for constructing machine learning model for identifying infectious diseases and non-infectious diseases
Bartholomeus et al. Diagnosing enterovirus meningitis via blood transcriptomics: an alternative for lumbar puncture?
Chun et al. The nasal microbiome, nasal transcriptome, and pet sensitization
WO2022140302A1 (en) Sequencing microbial cell-free nucleic acids to detect inflammation, secondary infection, and disease severity
CN110184372A (en) A kind of detection ox source Klebsiella Pneumoniae specific primer and its methods and applications
JP2023501538A (en) Identification of host RNA biomarkers of infection
Bunn et al. Diagnostics: conventional versus modern methods
Liu et al. Clinical application value of metagenomic next-generation sequencing in the diagnosis of central nervous system infections
CN105132525A (en) Use of miRNA molecules in diagnosis and prognosis of schizophrenia
CN115798575B (en) System and method for predicting sensitivity of klebsiella to ceftazidime
Frisch et al. MS Atlas-A molecular map of brain lesion stages in progressive multiple sclerosis
Qu et al. Children with autism show differences in the gut DNA virome compared to non-autistic children: a case control study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20221010

Address after: 518083 Huada Complex Park, 21 Hongan Third Street, Yantian District, Shenzhen City, Guangdong Province, 7 buildings, 7 floors-14 floors

Applicant after: BGI SHENZHEN Co.,Ltd.

Applicant after: Shenzhen Huada Medical Laboratory

Address before: 518083 Huada Complex Park, 21 Hongan Third Street, Yantian District, Shenzhen City, Guangdong Province, 7 buildings, 7 floors-14 floors

Applicant before: BGI SHENZHEN Co.,Ltd.

GR01 Patent grant
GR01 Patent grant