WO2017156739A1 - Isolated nucleic acid application thereof - Google Patents

Isolated nucleic acid application thereof Download PDF

Info

Publication number
WO2017156739A1
WO2017156739A1 PCT/CN2016/076577 CN2016076577W WO2017156739A1 WO 2017156739 A1 WO2017156739 A1 WO 2017156739A1 CN 2016076577 W CN2016076577 W CN 2016076577W WO 2017156739 A1 WO2017156739 A1 WO 2017156739A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleic acid
sequence
acid sequence
abundance
individual
Prior art date
Application number
PCT/CN2016/076577
Other languages
French (fr)
Chinese (zh)
Inventor
仲文迪
张林爽
郑智俊
Original Assignee
上海锐翌生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海锐翌生物科技有限公司 filed Critical 上海锐翌生物科技有限公司
Priority to CN201680083629.4A priority Critical patent/CN109072306A/en
Priority to PCT/CN2016/076577 priority patent/WO2017156739A1/en
Publication of WO2017156739A1 publication Critical patent/WO2017156739A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07HSUGARS; DERIVATIVES THEREOF; NUCLEOSIDES; NUCLEOTIDES; NUCLEIC ACIDS
    • C07H21/00Compounds containing two or more mononucleotide units having separate phosphate or polyphosphate groups linked by saccharide radicals of nucleoside groups, e.g. nucleic acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the present invention relates to the field of biomarkers, in particular, to isolated nucleic acids and uses of the present invention, and more particularly to the use of a set of isolated nucleic acids, isolated nucleic acids, and a method for determining the state of an individual using the isolated nucleic acids
  • Colorectal Cancer also known as colon cancer, is a cancer that originates from the colon or rectum (part of the large intestine) because cells grow abnormally and may invade or metastasize to other parts of the body. Patients with colorectal cancer often have symptoms such as blood in the stool, changes in bowel habits, weight loss, and fatigue. Colorectal cancer is the third most common cancer, accounting for about 10%. In 2012, there were 1.4 million newly diagnosed colorectal cancers in the United States, causing 694,000 deaths. Colorectal cancer is more common in developed countries, accounting for 65% of the total number of cases worldwide. Women are less common than men. In recent years, the incidence of colorectal cancer in China has shown a clear upward trend.
  • Intestinal microorganisms play an important role in intestinal epithelial cells, including the formation of microbial barriers to prevent colonization of pathogenic bacteria, and the implementation of immune regulation and metabolic functions.
  • intestinal flora imbalance can lead to colorectal cancer through different forms, pathogenic microorganisms will cause intestinal inflammation by activating receptors, adsorbing, secreting enterotoxin or invading.
  • Changes in the number, structure, and stability of intestinal microbes, especially imbalances in the flora can alter normal physiological functions and cause intestinal diseases, including colorectal cancer.
  • colorectal cancer diagnostic kits There are three main types of colorectal cancer diagnostic kits: X-ray examination; sigmoidoscopy and fiberoptic colonoscopy and carcinoembryonic antigen (cea) test; carcinoembryonic antigen (cea) test is of little value in the diagnosis of early cases; High sexuality, but many high-risk groups refused regular screening for 1-2 years because they did not want to accept painful colonoscopy. There are studies to detect Mir-92 in the blood The content of the factor can detect whether the blood tester has colorectal cancer, but the error detection rate of healthy people reaches 30%. With the completion of human genome sequencing and the rapid development of high-throughput sequencing technology, genetic screening has become the direction of colorectal cancer diagnosis.
  • Colorectal cancer has no obvious symptoms at an early stage, and there is currently no effective early diagnosis method for non-invasive colorectal cancer.
  • Colorectal cancer patients have revealed similar substantial changes in a gut microbiota by metagenomic sequencing studies (Zeller G et al., 2014; Feng Q et al., 2015). However, specific biomarkers related to colorectal cancer in intestinal microbes are still unclear.
  • the present invention is directed to at least one of the above problems or to at least one alternative business means.
  • a set of isolated nucleic acids comprising at least one of the following two nucleic acid sequence clusters: a first nucleic acid sequence cluster, the nucleic acid sequence in the first nucleic acid sequence cluster and SEQ ID NO: 1-91, wherein the sequence of each of the first nucleic acid sequence clusters has a sequence similarity to the sequence of SEQ ID NO: 1-151 of not less than 90%; the second nucleic acid sequence a cluster, the nucleic acid sequence in the second nucleic acid sequence cluster is in one-to-one correspondence with the sequence shown in SEQ ID NO: 152-233, and the nucleic acid sequence in each of the second nucleic acid sequence clusters corresponds to its corresponding SEQ ID NO: 152
  • the sequence similarity of the sequences in -233 is not less than 90%.
  • the invention provides the use of the above isolated nucleic acid for detecting colorectal cancer, and/or for treating colorectal cancer, and/or for preparing a medicament for treating colorectal cancer and/or for preparing a functional food.
  • a method for obtaining an isolated nucleic acid comprising: (1) obtaining a first sequencing result and a second sequencing result, the first sequencing result Sequencing results of nucleic acid for a stool sample of a plurality of colorectal cancer patients, comprising a plurality of first reads, the second sequencing result being a sequencing result of nucleic acids of a stool sample of a plurality of healthy individuals, including a plurality of second reads (2) assembling the first read segment and the second read segment respectively, correspondingly obtaining a plurality of first assembly sequences and a plurality of second assembly sequences; (3) respectively supporting the first assembly sequence based on the first read segment and The second read segment supports the second assembly sequence, determining the abundance of the first assembly sequence and the abundance of the second assembly sequence; (4) the abundance of the first assembly sequence determined in (3) and the second Assembling the abundance of the sequence, clustering the first assembly sequence and the second assembly sequence to obtain
  • a method for determining the state of an individual using the nucleic acid of one aspect of the invention described above comprising: determining abundance of a cluster of nucleic acid sequences in the nucleic acid in a stool sample of the individual And abundance in the control group; comparing the abundance of the nucleic acid sequence cluster in the fecal sample of the individual with the abundance in the control group, determining whether the individual is statistically significant based on whether the difference is statistically significant State, the control group consists of one or more A stool sample of an individual in the same state is composed of a state including colorectal cancer and no colorectal cancer.
  • All or part of the steps of the above-described method of the present invention for determining the state of an individual using a nucleic acid of one aspect of the invention may be performed using a device/system comprising a detachable corresponding unit functional module, or the method may be programmed Stored on a machine readable medium, implemented by a machine running the readable medium.
  • a device for determining the state of an individual using the nucleic acid of one aspect of the invention described above the device for carrying out the method for determining the state of an individual in accordance with one aspect of the invention described above, the device comprising: abundance a determining unit for determining abundance of a nucleic acid sequence cluster in the nucleic acid in the fecal sample of the individual and abundance in a control group; an individual state determining unit for comparing the nucleic acid sequence cluster in the individual
  • the abundance in the stool sample is different from the abundance in the control group, and the state of the individual is determined based on whether the difference is statistically significant, the control group being the stool of one or more groups of individuals in the same state
  • a system for determining the state of an individual using the nucleic acid of one aspect of the invention described above the system for performing all or part of the steps of the method for determining the state of an individual in accordance with one aspect of the invention described above,
  • the system comprises: a data input module for inputting data; a data output module for outputting data; a processor for executing an executable program, the executing the executable program comprising performing the above-described determination of an individual state of the invention All or part of the steps of the method; a storage module coupled to the data input module, the data output module, and the processor for storing data, including the executable program.
  • a method of classifying a plurality of individuals using the nucleic acid of one aspect of the present invention comprising: determining a state of each individual using the method of determining an individual state according to an aspect of the present invention, respectively.
  • the individual individuals are classified according to the status of each individual obtained.
  • the method can distinguish a plurality of individuals according to different states of the individual or distinguish a plurality of unknown stool samples, which is convenient for classification and label management.
  • the present invention provides a medicament for treating colorectal cancer, characterized in that the medicament promotes an increase in the abundance of the nucleic acid of the above aspect of the invention in the intestinal tract of a patient.
  • the present invention provides a method for producing or screening a medicament for treating colorectal cancer according to one aspect of the present invention, which comprises screening a substance which promotes an increase in abundance of a nucleic acid of one aspect of the present invention as a The steps of the drug.
  • the above-mentioned isolated nucleic acid of one aspect of the present invention is a difference in the abundance of the intestinal microbial sequence of the colorectal cancer patient population and the healthy population by comparing the sequencing data of the intestinal microbial sample by the inventor, and then verified by a large number of sample tests. Determined.
  • the isolated nucleic acid of this group can be used as a marker for colorectal cancer. Compared with the group of patients with colorectal cancer, the so-called isolated nucleic acid is significantly enriched in a healthy population, and the so-called significant enrichment refers to the abundance in the disease control group.
  • the abundance of the nucleic acid sequence clusters contained in the colorectal cancer markers in the healthy group was statistically higher than or significantly higher than the abundance in the disease group.
  • the set of isolated nucleic acids can be used to determine the probability of an individual having a high or low probability of being in a state of colorectal cancer, and can be used for non-invasive early detection or for the detection of colorectal cancer.
  • a substance capable of increasing the abundance of the isolated nucleic acid can be used for treating colorectal cancer or for patients with colorectal cancer, and a substance capable of increasing the abundance of the isolated nucleic acid is not limited to a drug for treating colorectal cancer and A functional food that is beneficial to the balance of the intestinal flora.
  • the nucleic acid identified by the method of one aspect of the present invention that is, a colorectal cancer marker, can be used for the preparation of a medicament for treating colorectal cancer and/or for the preparation of a functional food, a health care drug or the like which is advantageous for balancing the intestinal flora.
  • the method, apparatus and/or system for determining the state of an individual in an aspect of the invention described above is based on detecting the abundance of nucleic acid sequence clusters in a stool sample of an individual, and detecting the abundance of the nucleic acid sequence cluster in the determined nucleic acid against it.
  • the abundances in the group were compared, and based on the obtained comparison results, the relative probability of the individual being a colorectal cancer individual or a healthy individual can be determined.
  • Fig. 1 is a flow chart showing the experimental analysis of screening and identifying colorectal cancer markers in an embodiment of the present invention.
  • FIG. 2 is a abundance heat map of the clustered CAG16610 in an embodiment of the present invention; wherein the left side is the abundance heat map in the found data set, and the right side is the abundance heat map in the verification data set.
  • connection means two or more unless otherwise stated.
  • connection shall be understood broadly, and may be, for example, a fixed connection, a detachable connection, or an integral connection; it may be mechanical, unless otherwise explicitly defined and defined.
  • the connection may also be an electrical connection; it may be directly connected, or may be indirectly connected through an intermediate medium, and may be internal communication between the two elements.
  • a biological marker is a cellular, biochemical, or molecular alteration that can be detected from a biological medium.
  • Biology includes various body fluids, tissues, cells, feces, hair, exhalation and the like.
  • abundance is meant the degree of abundance of such a microorganism or sequence in a population of microorganisms or nucleic acid sequences.
  • the richness of the microorganism in the intestinal microbial population can be expressed as the content of the microorganism in the population; and, for example, the degree of enrichment of a nucleic acid sequence in a set of nucleic acid sequences can be expressed as the nucleic acid sequence.
  • sequences and the similarity of the sequences refer to the degree of similarity or similarity between the sequences, respectively.
  • a set of isolated nucleic acids comprising at least one of the following clusters of nucleic acid sequences: a first nucleic acid sequence cluster, the nucleic acid sequence in the first nucleic acid sequence cluster and SEQ ID NO: 1-
  • the sequences shown in 151 correspond one-to-one, and the sequence similarity between the nucleic acid sequence in each of the first nucleic acid sequence clusters and the corresponding sequence in SEQ ID NO: 1-110 is not less than 90%
  • the second nucleic acid sequence cluster The nucleic acid sequence in the second nucleic acid sequence cluster is in one-to-one correspondence with the sequence shown in SEQ ID NO: 152-233, and the nucleic acid sequence in each of the second nucleic acid sequence clusters corresponds to its corresponding SEQ ID NO: 152-233
  • the sequence similarity of the sequences in the sequence is not less than 90%.
  • the isolated nucleic acid is determined by the inventors by analyzing the sequencing data of the intestinal microbial sample, comparing the abundance of the intestinal microbial sequence of the colorectal cancer patient population and the healthy population, and then determining by a large number of sample tests.
  • the nucleic acid sequence contained in each nucleic acid sequence cluster is a non-redundant sequence.
  • the isolated nucleic acid can be used as a colorectal cancer marker, and the colorectal cancer marker is significantly enriched in a healthy population compared with the colorectal cancer patient group, and the significant enrichment refers to the abundance in the disease control group.
  • the set of isolated nucleic acids can be used to determine the probability of an individual having a high or low probability of being in a state of colorectal cancer, and can be used for non-invasive early detection or for the detection of colorectal cancer.
  • SEQ ID NO: 1-151 and SEQ ID NO: 152-233 are all sequence clusters determined by the inventors based on sample nucleic acid sequencing data and assembly cluster analysis, and the sequences in each sequence cluster are from The same species, and it is generally believed that the success rate of DNA/DNA hybridization is as high as 80% or more in the same species, so the inventors hereby define that the sequence contained in the nucleic acid sequence cluster and the corresponding SEQ ID NO: 1-151 or The sequences in SEQ ID NOS: 152-233 have a sequence similarity of not less than 90%, all belonging to the same sequence cluster, and are capable of functioning as markers for colorectal cancer sequences.
  • the inventors repeatedly changed the nucleotides of the sequence in the nucleic acid sequence cluster such that the sequence in the nucleic acid sequence cluster and the sequence of the corresponding sequence in the SEQ ID NO are not less than 90%, and the test is verified, not less than 90%. Sequence similarity is supported by experiments, and the resulting sequence clusters can also be used as markers for colorectal cancer.
  • the set of isolated nucleic acids consists of one or both of the first nucleic acid sequence cluster and the second nucleic acid sequence cluster.
  • the nucleic acid comprises the first nucleic acid sequence cluster.
  • the sequence similarity between the nucleic acid sequence in each of the first nucleic acid sequence clusters and the corresponding sequence in SEQ ID NO: 1-151 is not less than 95%.
  • the nucleic acid further comprises the second sequence cluster.
  • the sequence identity of the nucleic acid sequence in each of said second nucleic acid sequence clusters to its corresponding SEQ ID NO: 152-233 is not less than 95%.
  • the nucleic acid comprises the second nucleic acid sequence cluster.
  • the sequence identity of the nucleic acid sequence in each of said second nucleic acid sequence clusters to its corresponding SEQ ID NO: 152-233 is not less than 95%.
  • the nucleic acid further comprises the first sequence cluster.
  • the sequence similarity between the nucleic acid sequence in each of the first nucleic acid sequence clusters and the sequence in its corresponding SEQ ID NO: 152-233 is not less than 95%.
  • the so-called isolated nucleic acid is determined by the inventors by analyzing the sequencing data of the intestinal microbial sample, comparing the abundance of the intestinal microbial sequence of the colorectal cancer patient population and the healthy population, and then determining by a large number of sample tests. .
  • the isolated nucleic acid can be used as a colorectal cancer marker, and the colorectal cancer marker is significantly enriched in a healthy population compared to the individual group of colorectal cancer patients, and the so-called significant enrichment refers to abundance in the disease control group.
  • the abundance of the nucleic acid sequence clusters contained in the above-described colorectal cancer markers in the healthy group was statistically higher than or significantly higher than the abundance in the disease group.
  • the set of isolated nucleic acids can be used to determine the probability of an individual having a high or low probability of being in a state of colorectal cancer, and can be used for non-invasive early detection or assisted detection of colorectal cancer;
  • the increased substance can be used for treating colorectal cancer or for patients suffering from colorectal cancer, and the substance capable of increasing the abundance of the isolated nucleic acid is not limited to a drug for treating colorectal cancer and a functional food for benefiting the intestinal flora.
  • the isolated pair of nucleic acids provided by this embodiment can be used for the preparation of a medicament for the treatment of colorectal cancer and/or for the preparation of functional foods, health care drugs and the like which are beneficial to the balanced intestinal flora.
  • a method for obtaining the isolated nucleic acid of any of the above embodiments of the present invention comprising: (1) obtaining a first sequencing result and a second sequencing result, the first The sequencing result is a sequencing result of a nucleic acid sequence of a stool sample of a plurality of colorectal cancer patients, comprising a plurality of first reads, the second sequencing result is a sequencing result of a nucleic acid sequence of a stool sample of a plurality of healthy individuals, including a plurality of a second read segment; (2) assembling the first read segment and the second read segment respectively, correspondingly obtaining a plurality of first assembly sequences and a plurality of second assembly sequences; and (3) respectively performing the first assembly sequence based on the first read segment Supporting condition and support of the second assembly sequence for the second assembly, determining the abundance of the first assembly sequence and the abundance of the second assembly sequence; (4) the abundance of the first assembly sequence determined in (3) Degree and second assembly sequence Abundance of the column, cluster
  • (2) includes: after assembling the first read segment and the second read segment respectively, respectively de-redunding the obtained assembly result of the first read segment and the assembly result of the second read segment, The plurality of first assembly sequences and the second assembly sequence are obtained, and the assembly sequences defining the identity of the paired sequences are not less than 95% and the sequence coverage is more than 90%, and the assembly sequences are the same sequence.
  • two or more sequences are paired/aligned, utilizing Blat.
  • the abundance coefficient of i, N1 is the non-unique comparison of the total number of assembly sequences on the read alignment of the assembly sequence S
  • j is the number of the number of the abundance
  • the first assembly sequence and the second assembly sequence that are present only in less than one-fifth of the stool sample of the total number of stool samples are removed.
  • the nucleic acid sequence which can be finally obtained as a marker has practical significance and can be used for detection determination of unknown samples.
  • (4) includes: performing first clustering on the first assembly sequence and the second assembly sequence to obtain a first clustering result; and removing the cluster including only one assembly sequence
  • the first clustering result performs a second clustering to obtain the plurality of gene clusters. Based on the difference in abundance, secondary clustering is performed, and the cluster containing only one assembly sequence in the first clustering result is eliminated, which facilitates the finally obtained nucleic acid sequence cluster to effectively serve as a colon cancer sequence marker.
  • the clustering can utilize known methods, which are not limited in the present invention.
  • sequences are clustered using the canopy algorithm based on the abundance of the sequences.
  • traditional clustering algorithms such as K-means.
  • the biggest feature of Canopy clustering is that it does not need to specify the k value in advance, that is, the number of clustering (clustering), so it has great practical value.
  • Canopy clustering has a lower precision, but it has a great advantage in speed. Therefore, Canopy clustering can be used to first perform “rough” clustering of data, and then use Canopy or K- Means and so on for further "fine” clustering.
  • the abundance of the cluster of nucleic acid sequences contained in the isolated nucleic acid is determined.
  • the abundance of the nucleic acid sequence cluster in the nucleic acid in the fecal sample of the individual and the abundance in the control group is determined.
  • the abundance of the nucleic acid sequence cluster in the fecal sample of the individual and/or in the control group is determined by obtaining the nucleic acid in the fecal sample of the individual and the sequencing data of the nucleic acid in the control group
  • Sequencing data from any source includes a plurality of reads; determining the nucleic acid sequence at the source based on the support of each of the nucleic acid sequences in the sequence of nucleic acid sequences based on reads in sequencing data from any source Abundance in the stool sample; determining the abundance of the cluster of nucleic acid sequences in which the nucleic acid sequence is located in the same stool sample based on the abundance of the nucleic acid sequence in the stool sample of the source.
  • the so-called sequencing data is obtained by sequencing the nucleic acid in the sample.
  • the sequencing can be selected according to the selected sequencing platform, but is not limited to the semiconductor sequencing technology platform such as PGM, Ion Proton, BGISEQ-100 platform, and synthetic sequencing.
  • Technology platforms such as Ilhexa's Hiseq, Miseq sequence platform and single molecule real-time sequencing platforms such as the PacBio sequence platform.
  • the sequencing method can be either single-ended sequencing or double-end sequencing, and the obtained offline data is a segment read out, which is called a read.
  • the nucleic acid sequence is determined to be in the stool sample of the source, based on the support of the read sequence in the sequencing data from any source for each nucleic acid sequence in the nucleic acid sequence cluster.
  • the comparison can be performed using known comparison software, such as SOAP, BWA, and TeraMap.
  • the parameters are generally compared, and one or a pair of reads are allowed to allow up to k base errors. Mismatch, for example, setting k ⁇ 2, if more than k bases in the reads are mismatched, it is considered that the reads cannot be aligned (aligned) to the nucleic acid sequence.
  • the so-called obtained alignment results include the alignment of each read with each nucleic acid sequence, including whether the read can match a certain nucleic acid sequence or a certain nucleic acid sequence, only uniquely aligned to a nucleic acid sequence or aligned Information such as a plurality of nucleic acid sequences, alignment to a nucleic acid sequence, alignment to a unique position of the nucleic acid sequence, or multiple positions.
  • the alignment is performed using SOAPalign 2.21 with the parameter set to –r 2–m 100–x 1000.
  • the reads are aligned with the nucleic acid sequences in the nucleic acid sequence cluster, and the alignment can be divided into two parts: a) a unique read of the previous nucleic acid sequence, said reads are Unique reads (U); b) ratio For multiple nucleic acid sequences, these reads are referred to as Multiple reads (M).
  • U Unique reads
  • M Multiple reads
  • the abundance is Ab(G), which is related to Unique reads and Multiple reads.
  • Ab(U) and Ab(M) in the above formula are the Unique reads and Multiples of the assembly fragment G, respectively.
  • the abundance of reads For each multiple reads, there is a unique abundance coefficient Co.
  • the Co of the multiple reads can be calculated using the following formula: That is, for such multiple reads, the sum of the abundances of the unique reads of the N sequences on which they are compared is used as the denominator.
  • the abundance of a cluster of nucleic acid sequences is related to the abundance of the nucleic acid sequences contained therein, which according to one embodiment of the invention, the abundance of the cluster of nucleic acid sequences is the mean or median of the abundance of the nucleic acid sequences contained therein.
  • control group is composed of a stool sample of a plurality of individuals suffering from colorectal cancer, when the nucleic acid sequence cluster is abundant in the stool sample of the individual and its abundance in the control group When there is no statistical difference, the state of the individual is determined to have colorectal cancer.
  • the control group consists of a stool sample of a plurality of healthy individuals, when the abundance of the nucleic acid sequence cluster in the individual's stool sample is statistically lower than it is in the control
  • the abundance in the group is determined, the state of the individual is determined to have colorectal cancer.
  • the so-called nucleic acid sequence cluster abundance of the individual with or without statistical significance is determined to fall within or fall within a predetermined confidence interval for its abundance in the control group.
  • the nucleic acid sequence cluster abundance of a statistically lower than determined individual is less than the lower bound of a predetermined confidence interval for its abundance in the control group.
  • the so-called confidence interval refers to the sample The estimated range of the overall parameters constructed by the statistic.
  • the Confidence interval of a probability sample is an interval estimate of a population parameter for this sample.
  • the confidence interval shows the extent to which the true value of this parameter has a certain probability of falling around the measurement.
  • the confidence interval gives the degree of confidence in the measured value of the measured parameter, ie the "certain probability" required previously. This probability is called the confidence level.
  • the so-called predetermined confidence interval is a 95% confidence interval, i.e., the use of this embodiment to determine that the individual is in the determined state, 95% is reliable.
  • 95% confidence interval i.e., the use of this embodiment to determine that the individual is in the determined state, 95% is reliable.
  • levels of significance
  • the so-called non-statistical meaning means that the abundance of the determined nucleic acid sequence cluster falls into the first of its abundance in the control group.
  • Predetermined confidence interval is a 95% confidence interval
  • the first predetermined confidence interval of the abundance of the first nucleic acid sequence cluster in the control group is 2.72E-08 to 7.02E-07
  • the first predetermined confidence interval for the abundance of the second nucleic acid sequence cluster in the control group is 4.84E-07 to 1.61E-06.
  • All or part of the steps of the method for determining the state of an individual using the separated nucleic acid in any of the above embodiments of the present invention may be performed using a device/system comprising a detachable corresponding unit function module, or the method may be programmed, It is stored on a machine readable medium and is implemented by a machine running the readable medium.
  • An apparatus for determining the state of an individual using the isolated nucleic acid of any of the above embodiments the apparatus for performing all or part of the steps of any of the methods for determining the state of the individual, the apparatus The method includes: abundance determining unit, configured to determine abundance of a nucleic acid sequence cluster in the nucleic acid in the septic sample of the individual and abundance in a control group; and an individual state determining unit for comparing the nucleic acid sequence cluster The difference between the abundance in the individual's stool sample and the abundance in the control group, and determining the status of the individual based on whether the difference is statistically significant, the control group being one or more sets of the same state A stool sample of an individual consisting of having colorectal cancer and not having colorectal cancer.
  • the following is performed in the abundance determining unit: obtaining nucleic acid in the fecal sample of the individual and sequencing data of the nucleic acid in the control group, and sequencing data of any source includes a plurality of reads Determining the abundance of the nucleic acid sequence in the fecal sample of the source according to the support of the nucleic acid sequence in the nucleic acid sequence cluster by the reads in the sequencing data from any source; according to the nucleic acid sequence at the source
  • the abundance in the stool sample determines the abundance of the cluster of nucleic acid sequences in which it is located in the same stool sample.
  • the abundance of the nucleic acid sequence in the stool sample of the source is determined according to the support of the read sequence in the sequencing data of any source to the nucleic acid sequence in the nucleic acid sequence cluster, comprising:
  • the abundance coefficient of x, N2 is the non-unique
  • the abundance of the cluster of nucleic acid sequences is the mean or median of the abundance of the nucleic acid sequences contained therein.
  • the control group is composed of a stool sample of a plurality of individuals suffering from colorectal cancer, and the abundance of the nucleic acid sequence cluster in the stool sample of the individual has its abundance in the control group When there is no statistically significant difference in abundance, it is determined that the state of the individual is suffering from colorectal cancer.
  • the control group consists of a stool sample of a plurality of healthy individuals, when the abundance of the nucleic acid sequence cluster in the individual's stool sample is statistically lower than it is in the control When there is no statistical difference in abundance in the group, it is determined that the state of the individual is suffering from colorectal cancer.
  • the so-called nucleic acid sequence cluster abundance of the individual with or without statistical significance is determined to fall within or fall within a predetermined confidence interval of its abundance in the control group.
  • the nucleic acid sequence cluster abundance of a statistically lower than determined individual is less than the lower bound of a predetermined confidence interval for its abundance in the control group.
  • the so-called confidence interval refers to the estimated interval of the population parameters constructed by the sample statistic.
  • the Confidence interval of a probability sample is an interval estimate of a population parameter for this sample.
  • the confidence interval shows the extent to which the true value of this parameter has a certain probability of falling around the measurement.
  • the confidence interval gives the degree of confidence in the measured value of the measured parameter, ie the "certain probability" required previously. This probability is called the confidence level.
  • the so-called predetermined confidence interval is a 95% confidence interval, i.e., the use of this embodiment to determine that the individual is in the determined state, 95% is reliable.
  • 95% confidence interval i.e., the use of this embodiment to determine that the individual is in the determined state, 95% is reliable.
  • levels of significance
  • an isolated nucleic acid according to any of the above embodiments of the present invention
  • a system for determining the state of an individual the system for performing all or part of the steps of the method for determining the state of an individual using the isolated nucleic acid of any of the above embodiments in any of the above embodiments of the invention, the system comprising: data input a module for inputting data; a data output module for outputting data; a processor for executing an executable program, the executing the executable program comprising the method of determining the state of the individual in any of the embodiments of the present invention described above; a storage unit, coupled to the data input module, the data output module, and the processor, for storing data, including the executable program.
  • a medicament for treating colorectal cancer which promotes an increase in the abundance of the isolated nucleic acid in any of the above embodiments in the intestinal tract of a patient.
  • the so-called isolated nucleic acid is determined by the inventors by analyzing the sequencing data of the intestinal microbial sample, comparing the abundance of the intestinal microbial sequence of the colorectal cancer patient population and the healthy population, and then determining by a large number of sample tests. Each nucleic acid sequence cluster it contains is a non-redundant sequence.
  • the isolated nucleic acid can be used as a colorectal cancer marker, and the colorectal cancer marker is significantly enriched in a healthy population compared with the colorectal cancer patient group, and the so-called significant enrichment refers to abundance in the disease control group.
  • the abundance of the nucleic acid sequence clusters contained in the above-mentioned colorectal cancer markers in the healthy individual group is statistically higher than or significantly higher than the abundance in the disease group.
  • the set of isolated nucleic acids can be used to determine the probability of an individual having a high or low probability of being in a state of colorectal cancer, and can be used for non-invasive early detection or assisted detection of colorectal cancer;
  • the increased substance can be used for treating colorectal cancer or for patients suffering from colorectal cancer, and the substance capable of increasing the abundance of the isolated nucleic acid is not limited to a drug for treating colorectal cancer and a functional food for benefiting the intestinal flora.
  • the isolated pair of nucleic acids provided by this embodiment can be used for the preparation of a medicament for the treatment of colorectal cancer and/or for the preparation of functional foods, health care drugs and the like which are beneficial to the balanced intestinal flora.
  • the determined colorectal cancer sequence marker can be reasonably and effectively applied, the growth of beneficial bacteria or sequences of the intestinal tract can be supported, and/or the intestinal pathogenic bacteria or sequence can be inhibited, and the intestinal tract can be blocked. Defects in the barrier, improving and restoring the intestinal micro-ecological structure are important for assisting in reducing blood endotoxin levels and/or reducing the clinical symptoms of colorectal cancer.
  • a method of producing or screening the above-described medicament which comprises the step of screening for a substance which causes an increase in the abundance of the isolated nucleic acid in any of the above embodiments as the medicament.
  • the method for producing or screening for colorectal cancer in the embodiment of the present invention by appropriately and effectively applying the determined colorectal cancer biomarker for screening, it is possible to obtain growth capable of supporting intestinal beneficial bacteria and/or inhibit intestinal tract.
  • Drugs with potential pathogens can prevent intestinal barrier defects, improve and restore intestinal micro-ecological structure, and are important for assisting in reducing blood endotoxin levels and/or reducing the clinical symptoms of colorectal cancer.
  • a method of classifying a plurality of individuals using the isolated nucleic acid of any of the above-described embodiments of the present invention comprising: utilizing the determined individual in any of the above embodiments of the present invention, respectively The state of the method determines the state of each individual; each individual is classified according to the state of each individual obtained.
  • the method can distinguish a plurality of individuals according to different states of the individual or distinguish a plurality of unknown stool samples, which is convenient for classification and label management.
  • reagents, sequences (linkers, tags, and primers), software, and instruments not specifically addressed in the following examples are conventionally commercially available or open source, such as the Illumina transcriptome library construction kit. .
  • the following embodiments include a first phase and a second phase, namely a corresponding discovery phase and a verification phase.
  • the discovery phase consisted of analyzing and comparing data sets of 53 colorectal cancer patients, 42 colon adenoma patients, and 61 healthy controls to determine intestinal microbial composition and functional changes to identify species markers;
  • the validation phase included: A data set of colorectal cancer patients and 19 healthy controls was used to verify the accuracy of the first-stage predictions.
  • the inventors performed a cohort analysis of the entire gut microbiota from 53 colorectal cancer patients, 42 colon adenoma patients, and 61 healthy control stool samples to describe fecal microbial community and functional component characteristics.
  • the inventors downloaded high-quality sequencing data of approximately 1084.87 Gb to construct a colorectal cancer reference gene set.
  • Quantitative metagenomic analysis showed that 2,464,280 genes could be clustered into 258 genes representing bacterial species (Co-abundance gene groups, CAG) in a large number of patients and healthy controls, while those in colorectal cancer patients were compared with healthy controls.
  • CAG Co-abundance gene groups
  • the sequencing data of the first stage of colorectal cancer patients, colon adenoma patients and healthy human stool samples were obtained from the EBI database, data number: ERP005534 (Zeller G et al., 2014), of which 53 patients with colorectal cancer, colon adenoma There were 42 patients and 61 healthy people, all from France. On average, each sample produced 5Gb high-quality sequencing results, totaling 1084.87Gb of sequencing data.
  • the sequencing data of the second stage colorectal cancer patients and healthy human stool sample DNA were derived from the EBI database.
  • EBI raw data has been processed by quality control and de-hosted data, but there are many short reads in the data. After downloading from EBI to data, pairs of raw data in the original data are paired with less than 60 reads.
  • the genome of the metagenomic biomarker is a gene and a corresponding function, so it is necessary to assemble and predict the sequence of the sequence, to redundantly, and to construct a non-redundant reference gene set.
  • All sample reads were assembled into contigs (assembly fragments or contigs) using SOAPdenovo software. Finally, 8.98 million contigs were generated from 64.06% of the total number of reads (the minimum fragment length was 500 bp). These contigs have a total length of 18.8 Gb, and the N50 has a length ranging from 1,253 to 18,741 bp and an average length of 4,773 bp.
  • the MetaGeneMark program predicted from 26,039,803 open reading frames (ORFs) that were greater than 100 bp in length.
  • the predicted total length of ORFs was 16,095,621,987 bp, accounting for 85.61% of the total length of contigs.
  • a non-redundant "CRC gene set" is created by removing excess ORFs, defining that the identity of the paired sequence is over 95% and the short ORFs with a sequence coverage of more than 90% are the same sequence, removing the excess ORFs. De-redundancy, that is, one of them is randomly reserved for the same sequence.
  • the final non-redundant colorectal cancer gut gene set contains 6,585,575 ORFs with an average length of 609.70 bp.
  • the paired paired-end reads treated in step 2.1 were aligned (matched) to the non-redundant reference gene set in 2.2 using SOAPalign 2.21 with the parameter –r 2–m 100–x 1000.
  • the alignment of Reads with non-redundant reference gene sets may be divided into two parts: a) Unique reads (U): reads are aligned only with one gene in a non-redundant gene set; these reads are defined as unique reads. b) Multiple reads (M): If reads compares more than one gene in a non-redundant gene set, it is defined as multiple reads.
  • Ab (U) and Ab (M) are the abundances of the unique reads and multiple reads of the gene G, respectively, and l represents the length of the gene G.
  • Co For each multiple reads, there is a unique gene abundance coefficient Co; assuming that a multiple reads match the N genes, calculate the Co of the multiple reads as follows:
  • the inventor uses the sum of the unique reads abundance of the N genes on the comparison as the denominator.
  • genes detected in at least 10 samples are first screened from the gene abundance table as input to the cluster.
  • the first clustering was performed using the canopy algorithm.
  • the T1 threshold was Pearson correlation coefficient >0.95 and the Spearman correlation coefficient was >0.6.
  • the T2 threshold was Pearson correlation coefficient >0.9.
  • the canopy with only one gene was removed and the second cluster was performed.
  • the input used in the second clustering is the average abundance of canopy after the first clustering.
  • the algorithm used is the canopy-like algorithm.
  • the condition that the new element (gene sequence) is added to the original class is 70% of the class.
  • the above elements satisfy the threshold Pearson correlation coefficient > 0.97.
  • the partial nucleic acid sequence of the marker significantly enriched in the colorectal cancer patient or healthy population determined in Example 1 represents the CAG in which it is located, and the partial marker in the healthy group of the validation group and The difference in abundance of the disease group was also significant (P ⁇ 0.05), and the results of the verification are shown in Table 2 and Figure 2.
  • the abundance of at least one CAG (strain) of Table 2 in each stool sample was determined according to the method of Example 2, and it was judged whether the abundance of the cag(s) in each sample was significantly lower than that in the control group (health)
  • the abundance in the group for example, in the 95% confidence interval of CAG16610 and/or CAG28666 in the colorectal cancer patient group in Table 3 above, the individual was judged to be a colorectal cancer patient, and in the above Table 3 CAG16610 and/or CAG28666 were judged to be non-colorectal cancer patients in the 95% confidence interval of the healthy group.
  • results show that for the detection of judgment using only any of the cag in Table 2, about 55 samples can be judged for individual state, and more than 80% of the samples of 55 samples correspond to the state of the individual. , consistent with the state of the recorded sample source individuals; for the individual state detection using all two cags in Table 2, 48 individual samples can be judged for individual status, and, for more than 48 samples 90% of the samples correspond to the judgment of the individual's state, consistent with the recorded state of the sample source individual. That is, the inventors found that the joint detection of all the species in Table 2, that is, the detection of all the two markers in Table 2 in the sample to be tested are not enriched, can more accurately determine the colorectal cancer patients or easy to find Feeling the crowd.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Wood Science & Technology (AREA)
  • Genetics & Genomics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Zoology (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

An isolated nucleic acid, comprising at least one of the following nucleic acid sequence clusters: a first nucleic acid sequence cluster, the nucleic acid sequences in the cluster corresponding on a one-to-one basis to the sequences shown by SEQ ID NOS: 1-151, the sequence similarity between the nucleic acid sequences in the cluster and the sequences in SEQ 1D NOS: 1-151 corresponding thereto being no less than 90%; and a second nucleic acid sequence cluster, the nucleic acid sequences in the cluster corresponding on a one-to-one basis to the sequences shown by SEQ ID NOS: 152-233, the sequence similarity between the nucleic acid sequences in the cluster and the sequences in SEQ 1D NOS: 152-233 corresponding thereto being no less than 90%. Compared to a group of colorectal cancer patients, the present isolated nucleic acid is significantly enriched in a healthy group, and can be used as a marker to distinguish between a healthy group and a group of colorectal cancer patients.

Description

分离的核酸及应用Isolated nucleic acids and applications 技术领域Technical field
本发明涉及生物标志物领域,具体的,本发明分离的核酸及应用,更具体的,本发明涉及一组分离的核酸、分离的核酸的用途、一种利用分离的核酸确定个体的状态的方法、一种利用分离的核酸确定个体的状态的装置、一种利用分离的核酸对多个个体进行分类的方法、一种治疗大肠癌的药物以及一种制备治疗大肠癌药物的方法。The present invention relates to the field of biomarkers, in particular, to isolated nucleic acids and uses of the present invention, and more particularly to the use of a set of isolated nucleic acids, isolated nucleic acids, and a method for determining the state of an individual using the isolated nucleic acids A device for determining the state of an individual using the isolated nucleic acid, a method for classifying a plurality of individuals using the separated nucleic acid, a drug for treating colorectal cancer, and a method for preparing a drug for treating colorectal cancer.
背景技术Background technique
大肠癌(Colorectal Cancer,CRC)又称结肠癌,是源自结肠或直肠(为大肠的一部分)的癌症,因为细胞不正常的生长,可能侵犯或转移至身体其他部位。大肠癌患者经常出现粪便中带血、排便习惯改变、体重减轻、以及疲倦感等症状。大肠癌为第三常见癌症,约占10%。在2012年,美国有140万例新诊断的大肠直肠癌,且造成69.4万人死亡。大肠直肠癌在发达国家较为常见,占全世界总案例数的65%。而女性较男性少见。近几年,我国结直肠癌的发病率呈明显上升趋势。据统计,2002年,我国结肠癌的发病率为7%,2015年预计发病率为13%,近期2015年中国癌症统计数据发表在影响因子144.8的《CA:A Cancer Journal for Clinjicians》杂志上,数据显示2015年中国新发结直肠癌患者37.63万例,死亡人数19.1万例(Chen W et al.,2016)。与西方人相比,我国直肠癌比结肠癌发病率高,约1.5∶1;青年人(<30岁)患者比例较高,约占15%。Colorectal Cancer (CRC), also known as colon cancer, is a cancer that originates from the colon or rectum (part of the large intestine) because cells grow abnormally and may invade or metastasize to other parts of the body. Patients with colorectal cancer often have symptoms such as blood in the stool, changes in bowel habits, weight loss, and fatigue. Colorectal cancer is the third most common cancer, accounting for about 10%. In 2012, there were 1.4 million newly diagnosed colorectal cancers in the United States, causing 694,000 deaths. Colorectal cancer is more common in developed countries, accounting for 65% of the total number of cases worldwide. Women are less common than men. In recent years, the incidence of colorectal cancer in China has shown a clear upward trend. According to statistics, in 2002, the incidence of colon cancer in China was 7%, and the estimated incidence rate was 13% in 2015. Recently, China's cancer statistics in 2015 were published in the Journal of CA: A Cancer Journal for Clinjicians with an impact factor of 144.8. The data showed that there were 376,300 new cases of colorectal cancer in China in 2015, with a death toll of 191,000 (Chen W et al., 2016). Compared with Westerners, the incidence of rectal cancer in China is higher than that of colon cancer, about 1.5:1; the proportion of young people (<30 years old) is relatively high, accounting for about 15%.
75-95%的大肠癌发病人群没有或少见遗传因素;其他危险因素包括年龄增大、男性、高脂肪摄入量、酒精或红肉、超重、吸烟和缺乏体育锻炼;大约10%的病例与缺乏运动有关(Watson A J et al.,2010;Cunningham D et al.,2010)。饮酒的危害在超过每天一杯后逐步提升。75-95% of patients with colorectal cancer have no or few genetic factors; other risk factors include increased age, male, high fat intake, alcohol or red meat, overweight, smoking, and lack of physical activity; approximately 10% of cases Lack of exercise (Watson A J et al., 2010; Cunningham D et al., 2010). The danger of drinking is gradually increased after more than one cup per day.
肠道微生物对于肠道上皮细胞起到重要作用,包括形成微生物屏障防止病原菌定植、执行免疫调节及代谢功能。有研究表明肠道菌群失衡会通过不同形式导致大肠癌的发生,病原微生物会通过激活识别受体、吸附、分泌肠毒素或侵入等方式引起肠道炎症反应。肠道微生物数量、结构及稳定性的改变,尤其是菌群的失衡会改变正常的生理功能从而引发肠道疾病,包括大肠癌。Intestinal microorganisms play an important role in intestinal epithelial cells, including the formation of microbial barriers to prevent colonization of pathogenic bacteria, and the implementation of immune regulation and metabolic functions. Studies have shown that intestinal flora imbalance can lead to colorectal cancer through different forms, pathogenic microorganisms will cause intestinal inflammation by activating receptors, adsorbing, secreting enterotoxin or invading. Changes in the number, structure, and stability of intestinal microbes, especially imbalances in the flora, can alter normal physiological functions and cause intestinal diseases, including colorectal cancer.
大肠癌诊断包主要有三类:X线检查;乙状结肠镜和纤维结肠镜检查以及癌胚抗原(cea)试验;癌胚抗原(cea)试验对早期病例的诊断价值不大;而肠镜检查虽然准确性高,但很多高危人群因为不愿接受痛苦的肠镜而拒绝1-2年的定期筛查。有研究通过检测血液内Mir-92 因子的含量,可检测验血者是否有大肠癌,但健康人士的错误检测率达到30%。随着人体基因组测序完成及高通量测序技术的高速发展,基因筛查成为大肠癌诊断的方向。基因筛查对发现大肠癌潜在人群很有优势,但发现基因缺损后1-2年仍需接受肠镜诊断。大肠癌在早期并无明显症状,目前缺乏有效的无创大肠癌早期诊断方法。大肠癌患者通过宏基因组测序研究揭示了一个肠道微生物中的类似实质性的改变(Zeller G et al.,2014;Feng Q et al.,2015)。但肠道微生物中大肠癌相关具体生物标志物尚不清晰。There are three main types of colorectal cancer diagnostic kits: X-ray examination; sigmoidoscopy and fiberoptic colonoscopy and carcinoembryonic antigen (cea) test; carcinoembryonic antigen (cea) test is of little value in the diagnosis of early cases; High sexuality, but many high-risk groups refused regular screening for 1-2 years because they did not want to accept painful colonoscopy. There are studies to detect Mir-92 in the blood The content of the factor can detect whether the blood tester has colorectal cancer, but the error detection rate of healthy people reaches 30%. With the completion of human genome sequencing and the rapid development of high-throughput sequencing technology, genetic screening has become the direction of colorectal cancer diagnosis. Genetic screening has an advantage in identifying potential populations of colorectal cancer, but colonoscopy is still required for 1-2 years after the discovery of a genetic defect. Colorectal cancer has no obvious symptoms at an early stage, and there is currently no effective early diagnosis method for non-invasive colorectal cancer. Colorectal cancer patients have revealed similar substantial changes in a gut microbiota by metagenomic sequencing studies (Zeller G et al., 2014; Feng Q et al., 2015). However, specific biomarkers related to colorectal cancer in intestinal microbes are still unclear.
发明内容Summary of the invention
本发明旨在至少解决上述问题至少之一或者提供至少一种可选择的商业手段。The present invention is directed to at least one of the above problems or to at least one alternative business means.
依据本发明的第一方面,提供一组分离的核酸,其包括以下两个核酸序列簇中的至少一个:第一核酸序列簇,所述第一核酸序列簇中的核酸序列与SEQ ID NO:1-151所示的序列一一对应,每条所述第一核酸序列簇中的核酸序列与其对应的SEQ ID NO:1-151中的序列的序列相似性不小于90%;第二核酸序列簇,所述第二核酸序列簇中的核酸序列与SEQ ID NO:152-233所示的序列一一对应,每条所述第二核酸序列簇中的核酸序列与其对应的SEQ ID NO:152-233中的序列的序列相似性不小于90%。According to a first aspect of the invention there is provided a set of isolated nucleic acids comprising at least one of the following two nucleic acid sequence clusters: a first nucleic acid sequence cluster, the nucleic acid sequence in the first nucleic acid sequence cluster and SEQ ID NO: 1-91, wherein the sequence of each of the first nucleic acid sequence clusters has a sequence similarity to the sequence of SEQ ID NO: 1-151 of not less than 90%; the second nucleic acid sequence a cluster, the nucleic acid sequence in the second nucleic acid sequence cluster is in one-to-one correspondence with the sequence shown in SEQ ID NO: 152-233, and the nucleic acid sequence in each of the second nucleic acid sequence clusters corresponds to its corresponding SEQ ID NO: 152 The sequence similarity of the sequences in -233 is not less than 90%.
依据本发明的第二方面,本发明提供上述分离的核酸在检测大肠癌、和/或治疗大肠癌、和/或制备治疗大肠癌药物和/或制备功能性食品中的用途。According to a second aspect of the invention, the invention provides the use of the above isolated nucleic acid for detecting colorectal cancer, and/or for treating colorectal cancer, and/or for preparing a medicament for treating colorectal cancer and/or for preparing a functional food.
依据本发明的第三方面,本发明提供一种获得上述本发明一方面的分离的核酸的方法,该方法包括:(1)获取第一测序结果和第二测序结果,所述第一测序结果为多个大肠癌患者的粪便样本的核酸的测序结果,包括多个第一读段,所述第二测序结果为多个健康个体的粪便样本的核酸的测序结果,包括多个第二读段;(2)分别组装第一读段和第二读段,对应获得多条第一组装序列和多条第二组装序列;(3)分别基于第一读段对第一组装序列的支持情况和第二读段对第二组装序列的支持情况,确定第一组装序列的丰度和第二组装序列的丰度;(4)依据(3)中确定的第一组装序列的丰度和第二组装序列的丰度,对第一组装序列和第二组装序列进行聚类,获得多个基因簇,每个所述基因簇包括多条第一组装序列和/或第二组装序列;(5)统计检验以确定在所述多个大肠癌患者和/或所述多个健康个体的粪便样本中显著富集的基因簇,以获得所述分离的核酸。According to a third aspect of the present invention, there is provided a method for obtaining an isolated nucleic acid according to one aspect of the present invention, the method comprising: (1) obtaining a first sequencing result and a second sequencing result, the first sequencing result Sequencing results of nucleic acid for a stool sample of a plurality of colorectal cancer patients, comprising a plurality of first reads, the second sequencing result being a sequencing result of nucleic acids of a stool sample of a plurality of healthy individuals, including a plurality of second reads (2) assembling the first read segment and the second read segment respectively, correspondingly obtaining a plurality of first assembly sequences and a plurality of second assembly sequences; (3) respectively supporting the first assembly sequence based on the first read segment and The second read segment supports the second assembly sequence, determining the abundance of the first assembly sequence and the abundance of the second assembly sequence; (4) the abundance of the first assembly sequence determined in (3) and the second Assembling the abundance of the sequence, clustering the first assembly sequence and the second assembly sequence to obtain a plurality of gene clusters, each of the gene clusters comprising a plurality of first assembly sequences and/or second assembly sequences; (5) Statistical test to determine the multiple colorectal cancers A gene cluster that is significantly enriched in the stool sample of the patient and/or the plurality of healthy individuals to obtain the isolated nucleic acid.
依据本发明的第四方面,本发明提供一种利用上述本发明一方面的核酸确定个体状态的方法,该方法包括:确定所述核酸中的核酸序列簇在所述个体的粪便样本中丰度以及在对照组中的丰度;比较所述核酸序列簇在所述个体的粪便样本中的丰度与在对照组中的丰度的差异,依据差异是否具有统计学意义来确定所述个体的状态,所述对照组由一组或多 组相同状态的个体的粪便样本组成,所述状态包括患有大肠癌和不患有大肠癌。According to a fourth aspect of the present invention, there is provided a method for determining the state of an individual using the nucleic acid of one aspect of the invention described above, the method comprising: determining abundance of a cluster of nucleic acid sequences in the nucleic acid in a stool sample of the individual And abundance in the control group; comparing the abundance of the nucleic acid sequence cluster in the fecal sample of the individual with the abundance in the control group, determining whether the individual is statistically significant based on whether the difference is statistically significant State, the control group consists of one or more A stool sample of an individual in the same state is composed of a state including colorectal cancer and no colorectal cancer.
上述本发明的这一方面的利用本发明一方面的核酸确定个体的状态的方法的全部或部分步骤,可以利用包含可拆分的相应单元功能模块的装置/系统来施行,或者将方法程序化、存储于机器可读介质,利用机器运行该可读介质来实现。All or part of the steps of the above-described method of the present invention for determining the state of an individual using a nucleic acid of one aspect of the invention may be performed using a device/system comprising a detachable corresponding unit functional module, or the method may be programmed Stored on a machine readable medium, implemented by a machine running the readable medium.
依据本发明的第五方面,本发明提供一种利用上述本发明一方面的核酸确定个体状态的装置,该装置用以实施上述本发明一方面的确定个体状态的方法,该装置包括:丰度确定单元,用于确定所述核酸中的核酸序列簇在所述个体的粪便样本中丰度以及在对照组中的丰度;个体状态确定单元,用于比较所述核酸序列簇在所述个体的粪便样本中的丰度与在对照组中的丰度的差异,并且依据差异是否具有统计学意义来确定所述个体的状态,所述对照组由一组或多组相同状态的个体的粪便样本组成,所述状态包括患有大肠癌和不患有大肠癌。According to a fifth aspect of the invention, there is provided a device for determining the state of an individual using the nucleic acid of one aspect of the invention described above, the device for carrying out the method for determining the state of an individual in accordance with one aspect of the invention described above, the device comprising: abundance a determining unit for determining abundance of a nucleic acid sequence cluster in the nucleic acid in the fecal sample of the individual and abundance in a control group; an individual state determining unit for comparing the nucleic acid sequence cluster in the individual The abundance in the stool sample is different from the abundance in the control group, and the state of the individual is determined based on whether the difference is statistically significant, the control group being the stool of one or more groups of individuals in the same state A sample consisting of having colorectal cancer and not having colorectal cancer.
依据本发明的第六方面,本发明提供一种利用上述本发明一方面的核酸确定个体的状态的系统,该系统用以实施上述本发明一方面的确定个体状态的方法的全部或部分步骤,该系统包括:数据输入模块,用于输入数据;数据输出模块,用于输出数据;处理器,用于执行可执行程序,执行所述可执行程序包括完成上述本发明一方面的确定个体的状态的方法的全部或部分步骤;存储模块,与所述数据输入模块、所述数据输出模块和所述处理器相连,用于存储数据,其中包括所述可执行程序。According to a sixth aspect of the invention, there is provided a system for determining the state of an individual using the nucleic acid of one aspect of the invention described above, the system for performing all or part of the steps of the method for determining the state of an individual in accordance with one aspect of the invention described above, The system comprises: a data input module for inputting data; a data output module for outputting data; a processor for executing an executable program, the executing the executable program comprising performing the above-described determination of an individual state of the invention All or part of the steps of the method; a storage module coupled to the data input module, the data output module, and the processor for storing data, including the executable program.
依据本发明的第七方面,提供一种利用上述本发明一方面的核酸对多个个体进行分类的方法,该方法包括:分别利用上述本发明一方面的确定个体状态的方法确定各个个体的状态;依据获得的各个个体的状态对所述各个个体进行分类。该方法能够依据个体的状态的不同区分开多个个体或者区分开多个未知的粪便样本,便于归类、标记管理。According to a seventh aspect of the present invention, there is provided a method of classifying a plurality of individuals using the nucleic acid of one aspect of the present invention, the method comprising: determining a state of each individual using the method of determining an individual state according to an aspect of the present invention, respectively. The individual individuals are classified according to the status of each individual obtained. The method can distinguish a plurality of individuals according to different states of the individual or distinguish a plurality of unknown stool samples, which is convenient for classification and label management.
依据本发明的第八方面,本发明提供一种治疗大肠癌的药物,其特征在于,所述药物促使患者肠道中的上述本发明一方面的核酸的丰度增加。According to an eighth aspect of the present invention, the present invention provides a medicament for treating colorectal cancer, characterized in that the medicament promotes an increase in the abundance of the nucleic acid of the above aspect of the invention in the intestinal tract of a patient.
依据本发明的第九方面,本发明提供一种生产或筛选上述本发明一方面的治疗大肠癌的药物的方法,该方法包括筛选促使上述本发明一方面的核酸的丰度增加的物质作为所述药物的步骤。According to a ninth aspect of the present invention, the present invention provides a method for producing or screening a medicament for treating colorectal cancer according to one aspect of the present invention, which comprises screening a substance which promotes an increase in abundance of a nucleic acid of one aspect of the present invention as a The steps of the drug.
上述本发明一方面的分离的核酸,是发明人通过处理分析肠道微生物样本的测序数据,对比大肠癌患者群体和健康群体的肠道微生物序列的丰度的差异,再经大量样本试验验证而确定下来的。该组分离的核酸能够作为大肠癌标志物,相较于在大肠癌患者组,所称分离的核酸在健康群体中显著富集,所称显著富集是指与在疾病对照组中的丰度相比,上述 大肠癌标志物所包含的核酸序列簇在健康组中的丰度均具有统计学意义地高于或者明显地、实质性地高于在疾病组中的丰度。该组分离的核酸能够用于确定个体处于患有大肠癌状态的概率高低或者处于健康状态的概率高低,能够用于非侵入性的早期发现或辅助检测大肠癌。The above-mentioned isolated nucleic acid of one aspect of the present invention is a difference in the abundance of the intestinal microbial sequence of the colorectal cancer patient population and the healthy population by comparing the sequencing data of the intestinal microbial sample by the inventor, and then verified by a large number of sample tests. Determined. The isolated nucleic acid of this group can be used as a marker for colorectal cancer. Compared with the group of patients with colorectal cancer, the so-called isolated nucleic acid is significantly enriched in a healthy population, and the so-called significant enrichment refers to the abundance in the disease control group. Compared to the above The abundance of the nucleic acid sequence clusters contained in the colorectal cancer markers in the healthy group was statistically higher than or significantly higher than the abundance in the disease group. The set of isolated nucleic acids can be used to determine the probability of an individual having a high or low probability of being in a state of colorectal cancer, and can be used for non-invasive early detection or for the detection of colorectal cancer.
再者,能够使该组分离的核酸的丰度增加的物质能够用于治疗大肠癌或者益于大肠癌患者,能够使该组分离的核酸的丰度增加的物质不限于治疗大肠癌的药物和有益肠道菌群平衡的功能性食品。利用本发明一方面的方法确定的核酸,即大肠癌标志物,能够用于制备治疗大肠癌的药物和/或用于制备益于平衡肠道菌群的功能性食品、保健药等。Furthermore, a substance capable of increasing the abundance of the isolated nucleic acid can be used for treating colorectal cancer or for patients with colorectal cancer, and a substance capable of increasing the abundance of the isolated nucleic acid is not limited to a drug for treating colorectal cancer and A functional food that is beneficial to the balance of the intestinal flora. The nucleic acid identified by the method of one aspect of the present invention, that is, a colorectal cancer marker, can be used for the preparation of a medicament for treating colorectal cancer and/or for the preparation of a functional food, a health care drug or the like which is advantageous for balancing the intestinal flora.
并且,上述本发明一方面的确定个体状态的方法、装置和/或系统基于检测个体的粪便样本中的核酸序列簇的丰度,将检测确定的核酸中的核酸序列簇的丰度与其在对照组中的丰度进行比较,依据获得的比较结果能够确定个体为大肠癌个体或者为健康个体的相对概率高低。为早期发现大肠癌提供一种非侵入性的辅助检测方法。Moreover, the method, apparatus and/or system for determining the state of an individual in an aspect of the invention described above is based on detecting the abundance of nucleic acid sequence clusters in a stool sample of an individual, and detecting the abundance of the nucleic acid sequence cluster in the determined nucleic acid against it. The abundances in the group were compared, and based on the obtained comparison results, the relative probability of the individual being a colorectal cancer individual or a healthy individual can be determined. Provides a non-invasive adjunct test for early detection of colorectal cancer.
附图说明DRAWINGS
本发明的上述和/或附加的方面和优点从结合下面附图对实施方式的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from
图1是本发明的实施例中的筛选鉴定大肠癌标志物的试验分析流程示意图。BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a flow chart showing the experimental analysis of screening and identifying colorectal cancer markers in an embodiment of the present invention.
图2是本发明的实施例中的聚类得的CAG16610的丰度热图;其中,左边为在发现数据集中的丰度热图,右边为在验证数据集中的丰度热图。2 is a abundance heat map of the clustered CAG16610 in an embodiment of the present invention; wherein the left side is the abundance heat map in the found data set, and the right side is the abundance heat map in the verification data set.
具体实施方式detailed description
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中,自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。需要说明的,本文中所使用的术语“第一”或者“第二”等仅为方便描述,不能理解为指示或暗示相对重要性,也不能理解为所称的第一和第二之间有先后顺序关系。The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are intended to be illustrative of the invention and are not to be construed as limiting. It should be noted that the terms "first" or "second" and the like as used herein are merely for convenience of description, and are not to be construed as indicating or implying relative importance, nor to be construed as having between the first and second Sequential relationship.
在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。在本文中,除非另有明确的规定和限定,术语“相连”、“连接”等术语应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。In the description of the present invention, "a plurality" means two or more unless otherwise stated. In this document, the terms "connected", "connected" and the like shall be understood broadly, and may be, for example, a fixed connection, a detachable connection, or an integral connection; it may be mechanical, unless otherwise explicitly defined and defined. The connection may also be an electrical connection; it may be directly connected, or may be indirectly connected through an intermediate medium, and may be internal communication between the two elements.
生物学标志物是从生物学介质中可以检测到的细胞、生物化学或分子改变。生物学介 质包括各种体液、组织、细胞、粪便、头发、呼气等。A biological marker is a cellular, biochemical, or molecular alteration that can be detected from a biological medium. Biology The quality includes various body fluids, tissues, cells, feces, hair, exhalation and the like.
所称的丰度指在某一微生物或者核酸序列群体中该种微生物或者序列的丰富程度。例如在肠道微生物群体中该种微生物的丰富程度,可表示为该种微生物在该群体中的含量;又例如在一组核酸序列中某种核酸序的丰富程度,可表示为该种核酸序列的数目占该组序列的总数的比例。By abundance is meant the degree of abundance of such a microorganism or sequence in a population of microorganisms or nucleic acid sequences. For example, the richness of the microorganism in the intestinal microbial population can be expressed as the content of the microorganism in the population; and, for example, the degree of enrichment of a nucleic acid sequence in a set of nucleic acid sequences can be expressed as the nucleic acid sequence. The number of the proportions of the total number of sequences in the group.
所称序列的一致性(identity)和序列的相似度(similarity)分别指序列间相同或相似的程度。The identity of the sequence and the similarity of the sequences refer to the degree of similarity or similarity between the sequences, respectively.
根据本发明的一个实施方式提供的一组分离的核酸,其包括以下核酸序列簇中的至少一个:第一核酸序列簇,所述第一核酸序列簇中的核酸序列与SEQ ID NO:1-151所示的序列一一对应,每条所述第一核酸序列簇中的核酸序列与其对应的SEQ ID NO:1-110中的序列的序列相似性不小于90%;第二核酸序列簇,所述第二核酸序列簇中的核酸序列与SEQ ID NO:152-233所示的序列一一对应,每条所述第二核酸序列簇中的核酸序列与其对应的SEQ ID NO:152-233中的序列的序列相似性不小于90%。A set of isolated nucleic acids according to one embodiment of the invention, comprising at least one of the following clusters of nucleic acid sequences: a first nucleic acid sequence cluster, the nucleic acid sequence in the first nucleic acid sequence cluster and SEQ ID NO: 1- The sequences shown in 151 correspond one-to-one, and the sequence similarity between the nucleic acid sequence in each of the first nucleic acid sequence clusters and the corresponding sequence in SEQ ID NO: 1-110 is not less than 90%; the second nucleic acid sequence cluster, The nucleic acid sequence in the second nucleic acid sequence cluster is in one-to-one correspondence with the sequence shown in SEQ ID NO: 152-233, and the nucleic acid sequence in each of the second nucleic acid sequence clusters corresponds to its corresponding SEQ ID NO: 152-233 The sequence similarity of the sequences in the sequence is not less than 90%.
该组分离的核酸,是发明人通过处理分析肠道微生物样本的测序数据,对比大肠癌患者群体和健康群体的肠道微生物序列的丰度的差异,再经大量样本试验验证而确定下来的,每个核酸序列簇中包含的核酸序列为非冗余序列。该组分离的核酸能够作为大肠癌标志物,相较于在大肠癌患者群体组,该大肠癌标志物在健康人群中显著富集,所称显著富集是指与在疾病对照组中的丰度相比,上述大肠癌标志物所包含的核酸序列簇在健康组中的丰度均具有统计学意义地高于或者明显地、实质性地高于在疾病组中的丰度。该组分离的核酸能够用于确定个体处于患有大肠癌状态的概率高低或者处于健康状态的概率高低,能够用于非侵入性的早期发现或辅助检测大肠癌。需要说明的是,SEQ ID NO:1-151和SEQ ID NO:152-233均是发明人依据样本核酸测序数据、组装聚类分析而确定出的序列簇,每个序列簇中的序列均来自相同的物种,而一般认为在相同的菌种中,DNA/DNA杂交的成功率高达80%以上,所以在此发明人限定只要核酸序列簇包含的序列与对应的SEQ ID NO:1-151或SEQ ID NO:152-233中的序列的序列相似性不小于90%,均属于相同的序列簇,能够作为大肠癌序列标志物。发明人多次改变核酸序列簇中的序列的核苷酸,使核酸序列簇中的序列与对应的SEQ ID NO中的序列的序列相似性不小于90%来进行试验验证,不小于90%的序列相似性获得试验支持,得到的序列簇同样可以作为大肠癌标志物。The isolated nucleic acid is determined by the inventors by analyzing the sequencing data of the intestinal microbial sample, comparing the abundance of the intestinal microbial sequence of the colorectal cancer patient population and the healthy population, and then determining by a large number of sample tests. The nucleic acid sequence contained in each nucleic acid sequence cluster is a non-redundant sequence. The isolated nucleic acid can be used as a colorectal cancer marker, and the colorectal cancer marker is significantly enriched in a healthy population compared with the colorectal cancer patient group, and the significant enrichment refers to the abundance in the disease control group. In comparison, the abundance of the nucleic acid sequence clusters contained in the above-mentioned colorectal cancer markers in the healthy group was statistically higher than or significantly higher than the abundance in the disease group. The set of isolated nucleic acids can be used to determine the probability of an individual having a high or low probability of being in a state of colorectal cancer, and can be used for non-invasive early detection or for the detection of colorectal cancer. It should be noted that SEQ ID NO: 1-151 and SEQ ID NO: 152-233 are all sequence clusters determined by the inventors based on sample nucleic acid sequencing data and assembly cluster analysis, and the sequences in each sequence cluster are from The same species, and it is generally believed that the success rate of DNA/DNA hybridization is as high as 80% or more in the same species, so the inventors hereby define that the sequence contained in the nucleic acid sequence cluster and the corresponding SEQ ID NO: 1-151 or The sequences in SEQ ID NOS: 152-233 have a sequence similarity of not less than 90%, all belonging to the same sequence cluster, and are capable of functioning as markers for colorectal cancer sequences. The inventors repeatedly changed the nucleotides of the sequence in the nucleic acid sequence cluster such that the sequence in the nucleic acid sequence cluster and the sequence of the corresponding sequence in the SEQ ID NO are not less than 90%, and the test is verified, not less than 90%. Sequence similarity is supported by experiments, and the resulting sequence clusters can also be used as markers for colorectal cancer.
根据本发明的实施例,该组分离的核酸由第一核酸序列簇和第二核酸序列簇的一个、或者全部两个组成。 According to an embodiment of the invention, the set of isolated nucleic acids consists of one or both of the first nucleic acid sequence cluster and the second nucleic acid sequence cluster.
根据本发明的一个实施例,所述核酸包括所述第一核酸序列簇。根据本发明的一个实施例,每条所述第一核酸序列簇中的核酸序列与其对应的SEQ ID NO:1-151中的序列的序列相似性不小于95%。根据本发明的一个实施例,所述核酸还包括所述第二序列簇。根据本发明的一个实施例,每条所述第二核酸序列簇中的核酸序列与其对应的SEQ ID NO:152-233中的序列的序列相似性不小于95%。According to an embodiment of the invention, the nucleic acid comprises the first nucleic acid sequence cluster. According to an embodiment of the present invention, the sequence similarity between the nucleic acid sequence in each of the first nucleic acid sequence clusters and the corresponding sequence in SEQ ID NO: 1-151 is not less than 95%. According to an embodiment of the invention, the nucleic acid further comprises the second sequence cluster. According to one embodiment of the invention, the sequence identity of the nucleic acid sequence in each of said second nucleic acid sequence clusters to its corresponding SEQ ID NO: 152-233 is not less than 95%.
根据本发明的一个实施例,所述核酸包括所述第二核酸序列簇。根据本发明的一个实施例,每条所述第二核酸序列簇中的核酸序列与其对应的SEQ ID NO:152-233中的序列的序列相似性不小于95%。根据本发明的一个实施例,所述核酸还包括所述第一序列簇。根据本发明的一个实施例,每条所述第一核酸序列簇中的核酸序列与其对应的SEQ ID NO:152-233中的序列的序列相似性不小于95%。According to an embodiment of the invention, the nucleic acid comprises the second nucleic acid sequence cluster. According to one embodiment of the invention, the sequence identity of the nucleic acid sequence in each of said second nucleic acid sequence clusters to its corresponding SEQ ID NO: 152-233 is not less than 95%. According to an embodiment of the invention, the nucleic acid further comprises the first sequence cluster. According to an embodiment of the present invention, the sequence similarity between the nucleic acid sequence in each of the first nucleic acid sequence clusters and the sequence in its corresponding SEQ ID NO: 152-233 is not less than 95%.
根据本发明的另一个实施方式提供的上述任一实施例中的分离的核酸在检测大肠癌、和/或治疗大肠癌、和/或制备治疗大肠癌药物和/或制备功能性食品中的用途。Use of the isolated nucleic acid of any of the above embodiments according to another embodiment of the present invention for detecting colorectal cancer, and/or for treating colorectal cancer, and/or for preparing a drug for treating colorectal cancer and/or for preparing a functional food .
所称的分离的核酸,是发明人通过处理分析肠道微生物样本的测序数据,对比大肠癌患者群体和健康群体的肠道微生物序列的丰度的差异,再经大量样本试验验证而确定下来的。该分离的核酸能够作为大肠癌标志物,相较于在大肠癌患者个体组,该大肠癌标志物在健康人群中显著富集,所称显著富集是指与在疾病对照组中的丰度相比,上述大肠癌标志物所包含的核酸序列簇在健康组中的丰度均具有统计学意义地高于或者明显地、实质性地高于在疾病组中的丰度。该组分离的核酸能够用于确定个体处于患有大肠癌状态的概率高低或者处于健康状态的概率高低,能够用于非侵入性的早期发现或辅助检测大肠癌;能够使所称分离的核酸丰度增加的物质能够用于治疗大肠癌或者益于大肠癌患者服用,能够使所称分离的核酸的丰度增加的物质不限于治疗大肠癌的药物和有益肠道菌群平衡的功能性食品,该实施方式提供的分离对的核酸能够用于制备治疗大肠癌的药物和/或用于制备益于平衡肠道菌群的功能性食品、保健药等。The so-called isolated nucleic acid is determined by the inventors by analyzing the sequencing data of the intestinal microbial sample, comparing the abundance of the intestinal microbial sequence of the colorectal cancer patient population and the healthy population, and then determining by a large number of sample tests. . The isolated nucleic acid can be used as a colorectal cancer marker, and the colorectal cancer marker is significantly enriched in a healthy population compared to the individual group of colorectal cancer patients, and the so-called significant enrichment refers to abundance in the disease control group. In contrast, the abundance of the nucleic acid sequence clusters contained in the above-described colorectal cancer markers in the healthy group was statistically higher than or significantly higher than the abundance in the disease group. The set of isolated nucleic acids can be used to determine the probability of an individual having a high or low probability of being in a state of colorectal cancer, and can be used for non-invasive early detection or assisted detection of colorectal cancer; The increased substance can be used for treating colorectal cancer or for patients suffering from colorectal cancer, and the substance capable of increasing the abundance of the isolated nucleic acid is not limited to a drug for treating colorectal cancer and a functional food for benefiting the intestinal flora. The isolated pair of nucleic acids provided by this embodiment can be used for the preparation of a medicament for the treatment of colorectal cancer and/or for the preparation of functional foods, health care drugs and the like which are beneficial to the balanced intestinal flora.
根据本发明的另一个实施方式提供的一种获得上述本发明任一实施例中的分离的核酸的方法,该方法包括:(1)获取第一测序结果和第二测序结果,所述第一测序结果为多个大肠癌患者的粪便样本的核酸序列的测序结果,包括多个第一读段,所述第二测序结果为多个健康个体的粪便样本的核酸序列的测序结果,包括多个第二读段;(2)分别组装第一读段和第二读段,对应获得多条第一组装序列和多条第二组装序列;(3)分别基于第一读段对第一组装序列的支持情况和第二读段对第二组装序列的支持情况,确定第一组装序列的丰度和第二组装序列的丰度;(4)依据(3)中确定的第一组装序列的丰度和第二组装序 列的丰度,对第一组装序列和第二组装序列进行聚类,获得多个基因簇,每个所述基因簇包括多条第一组装序列和/或第二组装序列;(5)统计检验以确定在所述多个大肠癌患者和/或所述多个健康个体的粪便样本中显著富集的基因簇,以获得所述核酸。利用该方法能够高效的确定出能够作为大肠癌标志物的核酸序列簇。According to another embodiment of the present invention, a method for obtaining the isolated nucleic acid of any of the above embodiments of the present invention, the method comprising: (1) obtaining a first sequencing result and a second sequencing result, the first The sequencing result is a sequencing result of a nucleic acid sequence of a stool sample of a plurality of colorectal cancer patients, comprising a plurality of first reads, the second sequencing result is a sequencing result of a nucleic acid sequence of a stool sample of a plurality of healthy individuals, including a plurality of a second read segment; (2) assembling the first read segment and the second read segment respectively, correspondingly obtaining a plurality of first assembly sequences and a plurality of second assembly sequences; and (3) respectively performing the first assembly sequence based on the first read segment Supporting condition and support of the second assembly sequence for the second assembly, determining the abundance of the first assembly sequence and the abundance of the second assembly sequence; (4) the abundance of the first assembly sequence determined in (3) Degree and second assembly sequence Abundance of the column, clustering the first assembly sequence and the second assembly sequence to obtain a plurality of gene clusters, each of the gene clusters comprising a plurality of first assembly sequences and/or second assembly sequences; (5) statistics A test is performed to determine a gene cluster that is significantly enriched in the stool samples of the plurality of colorectal cancer patients and/or the plurality of healthy individuals to obtain the nucleic acid. By this method, a cluster of nucleic acid sequences capable of being a marker of colorectal cancer can be efficiently determined.
根据本发明的实施例,(2)包括:分别组装第一读段和第二读段之后,分别对获得的第一读段的组装结果和第二读段的组装结果进行去冗余,以获得所述多条第一组装序列和第二组装序列,定义配对后序列的一致性不小于95%且序列覆盖度超过90%的组装序列为相同的序列。根据本发明的一个实施例,对两条或多条序列进行配对/比对,利用Blat。According to an embodiment of the present invention, (2) includes: after assembling the first read segment and the second read segment respectively, respectively de-redunding the obtained assembly result of the first read segment and the assembly result of the second read segment, The plurality of first assembly sequences and the second assembly sequence are obtained, and the assembly sequences defining the identity of the paired sequences are not less than 95% and the sequence coverage is more than 90%, and the assembly sequences are the same sequence. According to one embodiment of the invention, two or more sequences are paired/aligned, utilizing Blat.
根据本发明的实施例,(3)包括:将所述第一读段和第二读段分别比对到所述第一组装序列和第二组装序列上,对应获得第一比对结果和第二比对结果;分别基于获得的第一比对结果和第二比对结果,利用以下公式确定所述第一组装序列和第二组装序列的丰度:组装序列S的丰度Ab(S)=Ab(US)+Ab(MS),其中,Ab(US)=US/lS,US为唯一比对上该组装序列S的读段数目,lS为该组装序列S的长度,
Figure PCTCN2016076577-appb-000001
MS为非唯一比对上该组装序列S的读段的数目,i表示非唯一比对上该组装序列S的读段的编号,Coi为非唯一比对上该组装序列S的读段i的丰度系数,
Figure PCTCN2016076577-appb-000002
N1为非唯一比对上该组装序列S的读段比对上的组装序列的总数目,j为非唯一比对上该组装序列S的读段比对上的组装序列的编号,Uj为唯一比对上组装序列j的读段数目。上述丰度确定公式,基于比对结果中的唯一和非唯一比对上组装序列的读段对该组装序列的丰度的贡献情况,充分利用测序数据的同时确定的丰度十分准确。
According to an embodiment of the present invention, (3) includes: comparing the first read segment and the second read segment to the first assembly sequence and the second assembly sequence, respectively, to obtain a first comparison result and a corresponding Two alignment results; determining the abundance of the first assembly sequence and the second assembly sequence based on the obtained first alignment result and the second alignment result respectively: the abundance Ab(S) of the assembly sequence S =Ab(U S )+Ab(M S ), where Ab(U S )=U S /l S , U S is the only number of reads of the assembly sequence S, l S is the assembly sequence S length,
Figure PCTCN2016076577-appb-000001
M S is the number of reads that are non-uniquely aligned with the assembly sequence S, i represents the number of reads of the assembly sequence S that are not uniquely aligned, and Co i is a non-unique comparison of the reads of the assembly sequence S The abundance coefficient of i,
Figure PCTCN2016076577-appb-000002
N1 is the non-unique comparison of the total number of assembly sequences on the read alignment of the assembly sequence S, and j is the number of the assembly sequence on the read alignment of the assembly sequence S that is non-unique, U j is The number of reads that uniquely align the assembly sequence j. The abundance determination formula described above is based on the contribution of the readout of the assembled sequence to the assembly sequence of the unique and non-unique alignment in the comparison result, and the abundance determined by the use of the sequencing data is very accurate.
根据本发明的实施例,进行(4)之前,去除掉只存在于小于粪便样本总数的十五分之一的粪便样本中的第一组装序列和第二组装序列。如此,使得最终获得的能够作为标志物的核酸序列具有实际意义,能够用于未知样本的检测判定。According to an embodiment of the invention, prior to (4), the first assembly sequence and the second assembly sequence that are present only in less than one-fifth of the stool sample of the total number of stool samples are removed. In this way, the nucleic acid sequence which can be finally obtained as a marker has practical significance and can be used for detection determination of unknown samples.
根据本发明的实施例,(4)包括:对所述第一组装序列和第二组装序列进行第一次聚类,获得第一聚类结果;对去除掉只包含一条组装序列的簇后的第一聚类结果进行第二聚类,以获得所述多个基因簇。基于丰度的差异,进行二次聚类,剔除掉第一聚类结果中的只包含一条组装序列的簇,利于最终获得的核酸序列簇能够有效地作为大肠癌序列标志物。According to an embodiment of the present invention, (4) includes: performing first clustering on the first assembly sequence and the second assembly sequence to obtain a first clustering result; and removing the cluster including only one assembly sequence The first clustering result performs a second clustering to obtain the plurality of gene clusters. Based on the difference in abundance, secondary clustering is performed, and the cluster containing only one assembly sequence in the first clustering result is eliminated, which facilitates the finally obtained nucleic acid sequence cluster to effectively serve as a colon cancer sequence marker.
聚类可以利用已知的方法,本发明对此不作限制。根据本发明的一个实施例,基于序列的丰度,采用canopy算法对序列进行聚类。与传统的聚类算法(比如K-means)不同, Canopy聚类最大的特点是不需要事先指定k值,即clustering(簇)的个数,因此具有很大的实际应用价值。而且,与其他聚类算法相比,Canopy聚类虽然精度较低,但其在速度上有很大优势,因此可以使用Canopy聚类先对数据进行“粗”聚类,再使用Canopy或K-means等进行进一步“细”聚类。The clustering can utilize known methods, which are not limited in the present invention. According to one embodiment of the invention, sequences are clustered using the canopy algorithm based on the abundance of the sequences. Unlike traditional clustering algorithms such as K-means, The biggest feature of Canopy clustering is that it does not need to specify the k value in advance, that is, the number of clustering (clustering), so it has great practical value. Moreover, compared with other clustering algorithms, Canopy clustering has a lower precision, but it has a great advantage in speed. Therefore, Canopy clustering can be used to first perform “rough” clustering of data, and then use Canopy or K- Means and so on for further "fine" clustering.
根据本发明的又一个实施方式提供的一种利用上述任一实施例中的分离的核酸确定个体的状态的方法,该方法包括以下步骤:According to still another embodiment of the present invention, there is provided a method for determining the state of an individual using the isolated nucleic acid of any of the above embodiments, the method comprising the steps of:
确定分离的核酸包含的核酸序列簇的丰度。The abundance of the cluster of nucleic acid sequences contained in the isolated nucleic acid is determined.
确定所述核酸中的核酸序列簇在所述个体的粪便样本中丰度以及在对照组中的丰度。The abundance of the nucleic acid sequence cluster in the nucleic acid in the fecal sample of the individual and the abundance in the control group is determined.
根据本发明的实施例,通过以下确定核酸序列簇在所述个体的粪便样本和/或在对照组中的丰度:获得所述个体的粪便样本中的核酸和对照组中的核酸的测序数据,任一来源的测序数据均包括多个读段;依据任一来源的测序数据中的读段对所述核酸序列簇中的每条核酸序列的支持情况,确定所述核酸序列在该来源的粪便样本中的丰度;依据所述核酸序列在该来源的粪便样本中的丰度,确定其所在的核酸序列簇在相同粪便样本中的丰度。According to an embodiment of the present invention, the abundance of the nucleic acid sequence cluster in the fecal sample of the individual and/or in the control group is determined by obtaining the nucleic acid in the fecal sample of the individual and the sequencing data of the nucleic acid in the control group Sequencing data from any source includes a plurality of reads; determining the nucleic acid sequence at the source based on the support of each of the nucleic acid sequences in the sequence of nucleic acid sequences based on reads in sequencing data from any source Abundance in the stool sample; determining the abundance of the cluster of nucleic acid sequences in which the nucleic acid sequence is located in the same stool sample based on the abundance of the nucleic acid sequence in the stool sample of the source.
所称的测序数据通过对样本中的核酸进行测序得来,测序依据所选的测序平台的不同,可选择但不限于半导体测序技术平台比如PGM、Ion Proton、BGISEQ-100平台,合成边测序的技术平台比如Illumina公司的Hiseq、Miseq序列平台以及单分子实时测序平台比如PacBio序列平台。测序方式可以选择单端测序,也可以选择双末端测序,获得的下机数据是测读出来的片段,称为读段(reads)。The so-called sequencing data is obtained by sequencing the nucleic acid in the sample. The sequencing can be selected according to the selected sequencing platform, but is not limited to the semiconductor sequencing technology platform such as PGM, Ion Proton, BGISEQ-100 platform, and synthetic sequencing. Technology platforms such as Ilhexa's Hiseq, Miseq sequence platform and single molecule real-time sequencing platforms such as the PacBio sequence platform. The sequencing method can be either single-ended sequencing or double-end sequencing, and the obtained offline data is a segment read out, which is called a read.
根据本发明的一个实施例,所称依据任一来源的测序数据中的读段对所述核酸序列簇中的每条核酸序列的支持情况,确定所述核酸序列在该来源的粪便样本中的丰度,包括:将所述读段比对到所述核酸序列上,基于获得的比对结果,利用以下公式确定所述核酸序列的丰度:核酸序列G的丰度Ab(G)=Ab(UG)+Ab(MG),其中,Ab(UG)=UG/lG,UG为唯一比对上该核酸序列G的读段数目,lG为该核酸序列G的长度,
Figure PCTCN2016076577-appb-000003
MG为非唯一比对上该核酸序列G的读段的数目,x表示非唯一比对上该核酸序列G的读段的编号,Cox为非唯一比对上该核酸序列G的读段x的丰度系数,
Figure PCTCN2016076577-appb-000004
N2为非唯一比对上该核酸序列G的读段比对上的核酸序列的总数目,y为非唯一比对上该核酸序列G的读段比对上的核酸序列的编号,Uy为唯一比对上核酸序列j的读段数目。
According to one embodiment of the invention, the nucleic acid sequence is determined to be in the stool sample of the source, based on the support of the read sequence in the sequencing data from any source for each nucleic acid sequence in the nucleic acid sequence cluster. Abundance, comprising: aligning the reads to the nucleic acid sequence, based on the obtained alignment results, determining the abundance of the nucleic acid sequence using the following formula: Abundance of nucleic acid sequence G Ab(G)=Ab (U G )+Ab(M G ), wherein Ab(U G )=U G /l G , U G is the number of reads of the nucleic acid sequence G uniquely aligned, and l G is the length of the nucleic acid sequence G ,
Figure PCTCN2016076577-appb-000003
M G the only number than the non-read period of the nucleic acid sequence of G, x represents the ratio of non-unique number on the read segment of the nucleic acid sequence G, Co x nonunique than on the nucleic acid sequence of the G reads The abundance coefficient of x,
Figure PCTCN2016076577-appb-000004
N2 is the non-uniquely aligned total number of nucleic acid sequences on the read alignment of the nucleic acid sequence G, and y is the number of the nucleic acid sequence on the read alignment of the nucleic acid sequence G that is non-unique, U y is The number of reads that are unique to the nucleic acid sequence j.
比对可以利用已知比对软件进行,例如SOAP、BWA和TeraMap等,在比对过程中, 一般对比对参数进行设置,设置一个或者一对读段(reads)最多允许有k个碱基错配(mismatch),例如设置k≤2,若reads中有超过k个碱基发生错配,则视为该reads无法比对到(比对上)该核酸序列。所称的获得的比对结果包含各条读段与各核酸序列的比对情况,包括读段是否能够比对上某一条或某些核酸序列、只唯一比对到一条核酸序列还是比对到多条核酸序列、比对到核酸序列的位置、比对到核酸序列的唯一位置还是多个位置等信息。根据本发明的一个实施例,利用SOAPalign 2.21进行比对,设置参数为–r 2–m 100–x 1000。reads与核酸序列簇中的核酸序列比对,比对上的可以被分为两部分:a)唯一比对上一条核酸序列的读段,称这些读段为Unique reads(U);b)比对上多个核酸序列,称这些读段为Multiple reads(M)。对于给定的核酸序列G,表示其丰度为Ab(G),与Unique reads和Multiple reads相关,上述公式中的Ab(U)和Ab(M)分别为该组装片段G的Unique reads和Multiple reads贡献的丰度。每个multiple reads,有特有的丰度系数Co,假设一条multiple reads比对上N个核酸序列,可以利用下列公式计算该条multiple reads的Co:
Figure PCTCN2016076577-appb-000005
即对于这类multiple reads,把其所比对上的N个序列的unique reads的丰度之和作为分母。
The comparison can be performed using known comparison software, such as SOAP, BWA, and TeraMap. In the comparison process, the parameters are generally compared, and one or a pair of reads are allowed to allow up to k base errors. Mismatch, for example, setting k ≤ 2, if more than k bases in the reads are mismatched, it is considered that the reads cannot be aligned (aligned) to the nucleic acid sequence. The so-called obtained alignment results include the alignment of each read with each nucleic acid sequence, including whether the read can match a certain nucleic acid sequence or a certain nucleic acid sequence, only uniquely aligned to a nucleic acid sequence or aligned Information such as a plurality of nucleic acid sequences, alignment to a nucleic acid sequence, alignment to a unique position of the nucleic acid sequence, or multiple positions. According to one embodiment of the invention, the alignment is performed using SOAPalign 2.21 with the parameter set to –r 2–m 100–x 1000. The reads are aligned with the nucleic acid sequences in the nucleic acid sequence cluster, and the alignment can be divided into two parts: a) a unique read of the previous nucleic acid sequence, said reads are Unique reads (U); b) ratio For multiple nucleic acid sequences, these reads are referred to as Multiple reads (M). For a given nucleic acid sequence G, the abundance is Ab(G), which is related to Unique reads and Multiple reads. Ab(U) and Ab(M) in the above formula are the Unique reads and Multiples of the assembly fragment G, respectively. The abundance of reads. For each multiple reads, there is a unique abundance coefficient Co. Assuming multiple reads of the N nucleic acid sequences, the Co of the multiple reads can be calculated using the following formula:
Figure PCTCN2016076577-appb-000005
That is, for such multiple reads, the sum of the abundances of the unique reads of the N sequences on which they are compared is used as the denominator.
核酸序列簇的丰度与其包含的核酸序列的丰度相关,根据本发明的一个实施例,所述核酸序列簇的丰度为其包含的核酸序列的丰度的均值或者中位数。The abundance of a cluster of nucleic acid sequences is related to the abundance of the nucleic acid sequences contained therein, which according to one embodiment of the invention, the abundance of the cluster of nucleic acid sequences is the mean or median of the abundance of the nucleic acid sequences contained therein.
判定个体的状态。Determine the status of the individual.
比较所述核酸序列簇在所述个体的粪便样本中的丰度与在对照组中的丰度的差异,依据差异是否具有统计意义来确定所述个体的状态,所述对照组由一组或多组相同状态的个体的粪便样本组成,所述状态包括患有大肠癌和不患有大肠癌。Comparing the abundance of the nucleic acid sequence cluster in the fecal sample of the individual with the abundance in the control group, determining the state of the individual based on whether the difference is statistically significant, the control group consisting of a group or A plurality of sets of stool samples of individuals of the same state, including those having colorectal cancer and not having colorectal cancer.
根据本发明的实施例,所述对照组由多个患大肠癌的个体的粪便样本组成,当所述核酸序列簇在所述个体的粪便样本中的丰度与其在所述对照组中的丰度无统计学上的差异时,确定所述个体的状态为患有大肠癌。According to an embodiment of the present invention, the control group is composed of a stool sample of a plurality of individuals suffering from colorectal cancer, when the nucleic acid sequence cluster is abundant in the stool sample of the individual and its abundance in the control group When there is no statistical difference, the state of the individual is determined to have colorectal cancer.
根据本发明的实施例,所述对照组由多个健康个体的粪便样本组成,当所述核酸序列簇在所述个体的粪便样本中的丰度具有统计学意义地低于其在所述对照组中的丰度时,确定所述个体的状态为患有大肠癌。According to an embodiment of the invention, the control group consists of a stool sample of a plurality of healthy individuals, when the abundance of the nucleic acid sequence cluster in the individual's stool sample is statistically lower than it is in the control When the abundance in the group is determined, the state of the individual is determined to have colorectal cancer.
所称的具有或者不具有统计学意义上的差异为确定的个体的核酸序列簇丰度不落入或者落入其在对照组中的丰度的预定置信区间。具有统计学意义地低于即确定的个体的核酸序列簇丰度小于其在对照组中的丰度的预定置信区间的下限。所称的置信区间是指由样本 统计量所构造的总体参数的估计区间。在统计学中,一个概率样本的置信区间(Confidence interval)是对这个样本的某个总体参数的区间估计。置信区间展现的是这个参数的真实值有一定概率落在测量结果的周围的程度。置信区间给出的是被测量参数的测量值的可信程度,即前面所要求的“一定概率”。这个概率被称为置信水平。根据本发明的实施例,所称的预定置信区间为95%置信区间,即说明利用该实施例确定个体处于所确定出的状态,95%是可靠的。需要说明的是,根据目的或要求不同,可能对确定个体状态结果的可信程度有不同的要求,本领域技术人员可以选择不同的显著性水平(α),即选择不同的可能犯错误的概率,如此,确定的个体的状态的可信程度为1-α。The so-called nucleic acid sequence cluster abundance of the individual with or without statistical significance is determined to fall within or fall within a predetermined confidence interval for its abundance in the control group. The nucleic acid sequence cluster abundance of a statistically lower than determined individual is less than the lower bound of a predetermined confidence interval for its abundance in the control group. The so-called confidence interval refers to the sample The estimated range of the overall parameters constructed by the statistic. In statistics, the Confidence interval of a probability sample is an interval estimate of a population parameter for this sample. The confidence interval shows the extent to which the true value of this parameter has a certain probability of falling around the measurement. The confidence interval gives the degree of confidence in the measured value of the measured parameter, ie the "certain probability" required previously. This probability is called the confidence level. According to an embodiment of the invention, the so-called predetermined confidence interval is a 95% confidence interval, i.e., the use of this embodiment to determine that the individual is in the determined state, 95% is reliable. It should be noted that, depending on the purpose or requirement, there may be different requirements for determining the degree of credibility of the individual state result, and those skilled in the art may select different levels of significance (α), that is, select different probability of making mistakes. Thus, the determined degree of identity of the individual is 1-α.
根据本发明的实施例,当对照组为来自大肠癌患者的样本时,所称的无统计学上的意义指确定的核酸序列簇的丰度落入其在对照组中的丰度的第一预定置信区间。根据本发明的一个实施例,所述第一预定置信区间为95%置信区间,第一核酸序列簇在对照组中的丰度的第一预定置信区间为2.72E-08~7.02E-07,第二核酸序列簇在对照组中的丰度的第一预定置信区间为4.84E-07~1.61E-06。According to an embodiment of the present invention, when the control group is a sample from a colorectal cancer patient, the so-called non-statistical meaning means that the abundance of the determined nucleic acid sequence cluster falls into the first of its abundance in the control group. Predetermined confidence interval. According to an embodiment of the present invention, the first predetermined confidence interval is a 95% confidence interval, and the first predetermined confidence interval of the abundance of the first nucleic acid sequence cluster in the control group is 2.72E-08 to 7.02E-07, The first predetermined confidence interval for the abundance of the second nucleic acid sequence cluster in the control group is 4.84E-07 to 1.61E-06.
上述本发明的任一实施例中的利用分离的核酸确定个体的状态的方法的全部或部分步骤,可以利用包含可拆分的相应单元功能模块的装置/系统来施行,或者将方法程序化、存储于机器可读介质,利用机器运行该可读介质来实现。All or part of the steps of the method for determining the state of an individual using the separated nucleic acid in any of the above embodiments of the present invention may be performed using a device/system comprising a detachable corresponding unit function module, or the method may be programmed, It is stored on a machine readable medium and is implemented by a machine running the readable medium.
根据本发明的一个实施方式提供的一种利用上述任一实施例中的分离的核酸确定个体状态的装置,该装置用以实施上述任一确定个体的状态的方法的全部或部分步骤,该装置包括:丰度确定单元,用于确定所述核酸中的核酸序列簇在所述个体的粪便样本中丰度以及在对照组中的丰度;个体状态确定单元,用于比较所述核酸序列簇在所述个体的粪便样本中的丰度与在对照组中的丰度的差异,并且依据差异是否具有统计意义来确定所述个体的状态,所述对照组由一组或多组相同状态的个体的粪便样本组成,所述状态包括患有大肠癌和不患有大肠癌。上述对本发明任一实施例中的利用分离的核酸确定个体的状态的方法的技术特征和优点的描述,同样适用本发明这一方面的装置,在此不再赘述。An apparatus for determining the state of an individual using the isolated nucleic acid of any of the above embodiments, according to an embodiment of the present invention, the apparatus for performing all or part of the steps of any of the methods for determining the state of the individual, the apparatus The method includes: abundance determining unit, configured to determine abundance of a nucleic acid sequence cluster in the nucleic acid in the septic sample of the individual and abundance in a control group; and an individual state determining unit for comparing the nucleic acid sequence cluster The difference between the abundance in the individual's stool sample and the abundance in the control group, and determining the status of the individual based on whether the difference is statistically significant, the control group being one or more sets of the same state A stool sample of an individual consisting of having colorectal cancer and not having colorectal cancer. The above description of the technical features and advantages of the method for determining the state of an individual using the separated nucleic acid in any of the embodiments of the present invention is equally applicable to the apparatus of this aspect of the present invention and will not be described herein.
根据本发明的实施例,利用所述丰度确定单元中进行以下:获得所述个体的粪便样本中的核酸和对照组中的核酸的测序数据,任一来源的测序数据均包括多个读段;依据任一来源的测序数据中的读段对所述核酸序列簇中的核酸序列的支持情况,确定所述核酸序列在该来源的粪便样本中的丰度;依据所述核酸序列在该来源的粪便样本中的丰度,确定其所在的核酸序列簇在相同粪便样本中的丰度。According to an embodiment of the present invention, the following is performed in the abundance determining unit: obtaining nucleic acid in the fecal sample of the individual and sequencing data of the nucleic acid in the control group, and sequencing data of any source includes a plurality of reads Determining the abundance of the nucleic acid sequence in the fecal sample of the source according to the support of the nucleic acid sequence in the nucleic acid sequence cluster by the reads in the sequencing data from any source; according to the nucleic acid sequence at the source The abundance in the stool sample determines the abundance of the cluster of nucleic acid sequences in which it is located in the same stool sample.
根据本发明的实施例,依据任一来源的测序数据中的读段对所述核酸序列簇中的核酸 序列的支持情况,确定所述核酸序列在该来源的粪便样本中的丰度,包括:将所述读段比对到所述核酸序列上,基于获得的比对结果,利用以下公式确定所述核酸序列的丰度:核酸序列G的丰度Ab(G)=Ab(UG)+Ab(MG),其中,Ab(UG)=UG/lG,UG为唯一比对上该核酸序列G的读段数目,lG为该核酸序列G的长度,
Figure PCTCN2016076577-appb-000006
MG为非唯一比对上该核酸序列G的读段的数目,x表示非唯一比对上该核酸序列G的读段的编号,Cox为非唯一比对上该核酸序列G的读段x的丰度系数,
Figure PCTCN2016076577-appb-000007
N2为非唯一比对上该核酸序列G的读段比对上的核酸序列的总数目,y为非唯一比对上该核酸序列G的读段比对上的核酸序列的编号,Uy为唯一比对上核酸序列j的读段数目。
According to an embodiment of the invention, the abundance of the nucleic acid sequence in the stool sample of the source is determined according to the support of the read sequence in the sequencing data of any source to the nucleic acid sequence in the nucleic acid sequence cluster, comprising: The reads are aligned to the nucleic acid sequence, and based on the obtained alignment results, the abundance of the nucleic acid sequence is determined using the following formula: Abundance of nucleic acid sequence G Ab(G)=Ab(U G )+ Ab(M G ), wherein Ab(U G )=U G /l G , U G is the only number of reads of the nucleic acid sequence G, and l G is the length of the nucleic acid sequence G,
Figure PCTCN2016076577-appb-000006
M G the only number than the non-read period of the nucleic acid sequence of G, x represents the ratio of non-unique number on the read segment of the nucleic acid sequence G, Co x nonunique than on the nucleic acid sequence of the G reads The abundance coefficient of x,
Figure PCTCN2016076577-appb-000007
N2 is the non-uniquely aligned total number of nucleic acid sequences on the read alignment of the nucleic acid sequence G, and y is the number of the nucleic acid sequence on the read alignment of the nucleic acid sequence G that is non-unique, U y is The number of reads that are unique to the nucleic acid sequence j.
根据本发明的实施例,所述核酸序列簇的丰度为其包含的核酸序列的丰度的均值或者中位数。According to an embodiment of the invention, the abundance of the cluster of nucleic acid sequences is the mean or median of the abundance of the nucleic acid sequences contained therein.
根据本发明的实施例,所述对照组由多个患大肠癌的个体的粪便样本组成,当所述核酸序列簇在所述个体的粪便样本中的丰度具有与其在所述对照组中的丰度无统计学意义上的差异时,确定所述个体的状态为患有大肠癌。According to an embodiment of the present invention, the control group is composed of a stool sample of a plurality of individuals suffering from colorectal cancer, and the abundance of the nucleic acid sequence cluster in the stool sample of the individual has its abundance in the control group When there is no statistically significant difference in abundance, it is determined that the state of the individual is suffering from colorectal cancer.
根据本发明的实施例,所述对照组由多个健康个体的粪便样本组成,当所述核酸序列簇在所述个体的粪便样本中的丰度具有统计学意义地低于其在所述对照组中的丰度无统计意义上的差异时,确定所述个体的状态为患有大肠癌。According to an embodiment of the invention, the control group consists of a stool sample of a plurality of healthy individuals, when the abundance of the nucleic acid sequence cluster in the individual's stool sample is statistically lower than it is in the control When there is no statistical difference in abundance in the group, it is determined that the state of the individual is suffering from colorectal cancer.
上述所称的具有或者不具有统计学意义上的差异为确定的个体的核酸序列簇丰度不落入或者落入其在对照组中的丰度的预定置信区间。具有统计学意义地低于即确定的个体的核酸序列簇丰度小于其在对照组中的丰度的预定置信区间的下限。所称的置信区间是指由样本统计量所构造的总体参数的估计区间。在统计学中,一个概率样本的置信区间(Confidence interval)是对这个样本的某个总体参数的区间估计。置信区间展现的是这个参数的真实值有一定概率落在测量结果的周围的程度。置信区间给出的是被测量参数的测量值的可信程度,即前面所要求的“一定概率”。这个概率被称为置信水平。根据本发明的实施例,所称的预定置信区间为95%置信区间,即说明利用该实施例确定个体处于所确定出的状态,95%是可靠的。需要说明的是,根据目的或要求不同,可能对确定个体状态结果的可信程度有不同的要求,本领域技术人员可以选择不同的显著性水平(α),即选择不同的可能犯错误的概率,如此,确定的个体的状态的可信程度为1-α。The so-called nucleic acid sequence cluster abundance of the individual with or without statistical significance is determined to fall within or fall within a predetermined confidence interval of its abundance in the control group. The nucleic acid sequence cluster abundance of a statistically lower than determined individual is less than the lower bound of a predetermined confidence interval for its abundance in the control group. The so-called confidence interval refers to the estimated interval of the population parameters constructed by the sample statistic. In statistics, the Confidence interval of a probability sample is an interval estimate of a population parameter for this sample. The confidence interval shows the extent to which the true value of this parameter has a certain probability of falling around the measurement. The confidence interval gives the degree of confidence in the measured value of the measured parameter, ie the "certain probability" required previously. This probability is called the confidence level. According to an embodiment of the invention, the so-called predetermined confidence interval is a 95% confidence interval, i.e., the use of this embodiment to determine that the individual is in the determined state, 95% is reliable. It should be noted that, depending on the purpose or requirement, there may be different requirements for determining the degree of credibility of the individual state result, and those skilled in the art may select different levels of significance (α), that is, select different probability of making mistakes. Thus, the determined degree of identity of the individual is 1-α.
根据本发明的一个实施方式提供的一种利用上述本发明任一实施例中的分离的核酸确 定个体的状态的系统,该系统用以实施上述本发明任一实施例中的利用上述任一实施例中的分离的核酸确定个体的状态的方法的全部或部分步骤,该系统包括:数据输入模块,用于输入数据;数据输出模块,用于输出数据;处理器,用于执行可执行程序,执行所述可执行程序包括完成上述本发明任一实施例中的确定个体的状态的方法;存储单元,与所述数据输入模块、所述数据输出模块和所述处理器相连,用于存储数据,其中包括所述可执行程序。上述对本发明任一实施例中的利用分离的核酸确定个体的状态的方法的技术特征和优点的描述,同样适用本发明这一方面的系统,在此不再赘述。According to one embodiment of the present invention, there is provided an isolated nucleic acid according to any of the above embodiments of the present invention A system for determining the state of an individual, the system for performing all or part of the steps of the method for determining the state of an individual using the isolated nucleic acid of any of the above embodiments in any of the above embodiments of the invention, the system comprising: data input a module for inputting data; a data output module for outputting data; a processor for executing an executable program, the executing the executable program comprising the method of determining the state of the individual in any of the embodiments of the present invention described above; a storage unit, coupled to the data input module, the data output module, and the processor, for storing data, including the executable program. The above description of the technical features and advantages of the method for determining the state of an individual using the isolated nucleic acid in any of the embodiments of the present invention is equally applicable to the system of this aspect of the present invention and will not be described herein.
根据本发明的一个实施方式,提供一种治疗大肠癌的药物,所述药物促使患者肠道中的上述任一实施例中的分离的核酸的丰度增加。所称的分离的核酸,是发明人通过处理分析肠道微生物样本的测序数据,对比大肠癌患者群体和健康群体的肠道微生物序列的丰度的差异,再经大量样本试验验证而确定下来的,其包含的每个核酸序列簇均为非冗余序列。该分离的核酸能够作为大肠癌标志物,相较于在大肠癌患者组,该大肠癌标志物在健康人群中显著富集,所称显著富集是指与在疾病对照组中的丰度相比,上述大肠癌标志物所包含的核酸序列簇在健康个体组中的丰度均具有统计意义地高于或者明显地、实质性地高于在疾病组中的丰度。该组分离的核酸能够用于确定个体处于患有大肠癌状态的概率高低或者处于健康状态的概率高低,能够用于非侵入性的早期发现或辅助检测大肠癌;能够使所称分离的核酸丰度增加的物质能够用于治疗大肠癌或者益于大肠癌患者服用,能够使所称分离的核酸的丰度增加的物质不限于治疗大肠癌的药物和有益肠道菌群平衡的功能性食品,该实施方式提供的分离对的核酸能够用于制备治疗大肠癌的药物和/或用于制备益于平衡肠道菌群的功能性食品、保健药等。According to one embodiment of the present invention, there is provided a medicament for treating colorectal cancer which promotes an increase in the abundance of the isolated nucleic acid in any of the above embodiments in the intestinal tract of a patient. The so-called isolated nucleic acid is determined by the inventors by analyzing the sequencing data of the intestinal microbial sample, comparing the abundance of the intestinal microbial sequence of the colorectal cancer patient population and the healthy population, and then determining by a large number of sample tests. Each nucleic acid sequence cluster it contains is a non-redundant sequence. The isolated nucleic acid can be used as a colorectal cancer marker, and the colorectal cancer marker is significantly enriched in a healthy population compared with the colorectal cancer patient group, and the so-called significant enrichment refers to abundance in the disease control group. In comparison, the abundance of the nucleic acid sequence clusters contained in the above-mentioned colorectal cancer markers in the healthy individual group is statistically higher than or significantly higher than the abundance in the disease group. The set of isolated nucleic acids can be used to determine the probability of an individual having a high or low probability of being in a state of colorectal cancer, and can be used for non-invasive early detection or assisted detection of colorectal cancer; The increased substance can be used for treating colorectal cancer or for patients suffering from colorectal cancer, and the substance capable of increasing the abundance of the isolated nucleic acid is not limited to a drug for treating colorectal cancer and a functional food for benefiting the intestinal flora. The isolated pair of nucleic acids provided by this embodiment can be used for the preparation of a medicament for the treatment of colorectal cancer and/or for the preparation of functional foods, health care drugs and the like which are beneficial to the balanced intestinal flora.
利用这一实施方式的药物或者功能性食品,合理有效地应用确定的大肠癌序列标志物,扶持肠道有益菌或序列的生长,和/或抑制肠道潜在致病菌或序列,可以阻止肠道屏障的缺损,改善并恢复肠道微生态结构,对于辅助降低血内毒素水平和/或减轻大肠癌的临床症状具有重要意义。By using the drug or functional food of this embodiment, the determined colorectal cancer sequence marker can be reasonably and effectively applied, the growth of beneficial bacteria or sequences of the intestinal tract can be supported, and/or the intestinal pathogenic bacteria or sequence can be inhibited, and the intestinal tract can be blocked. Defects in the barrier, improving and restoring the intestinal micro-ecological structure are important for assisting in reducing blood endotoxin levels and/or reducing the clinical symptoms of colorectal cancer.
根据本发明的一个实施方式,还提供一种生产或筛选上述药物的方法,该方法包括筛选促使上述任一实施例中的分离的核酸的丰度增加的物质作为所述药物的步骤。According to an embodiment of the present invention, there is further provided a method of producing or screening the above-described medicament, which comprises the step of screening for a substance which causes an increase in the abundance of the isolated nucleic acid in any of the above embodiments as the medicament.
利用本发明该实施方式中的生产或筛选治疗大肠癌的药物的方法,通过合理有效地应用确定的大肠癌生物标志物进行筛选,能够获得能扶持肠道有益菌的生长和/或抑制肠道潜在致病菌的药物,可以阻止肠道屏障的缺损,改善并恢复肠道微生态结构,对于辅助降低血内毒素水平和/或减轻大肠癌的临床症状具有重要意义。 By using the method for producing or screening for colorectal cancer in the embodiment of the present invention, by appropriately and effectively applying the determined colorectal cancer biomarker for screening, it is possible to obtain growth capable of supporting intestinal beneficial bacteria and/or inhibit intestinal tract. Drugs with potential pathogens can prevent intestinal barrier defects, improve and restore intestinal micro-ecological structure, and are important for assisting in reducing blood endotoxin levels and/or reducing the clinical symptoms of colorectal cancer.
根据本发明的最后一个实施方式提供的一种利用上述本发明任一实施例的分离的核酸对多个个体进行分类的方法,该方法包括:分别利用上述本发明任一实施例中的确定个体的状态的方法确定各个个体的状态;依据获得的各个个体的状态对各个个体进行分类。该方法能够依据个体的状态的不同区分开多个个体或者区分开多个未知的粪便样本,便于归类、标记管理。另外,上述对本发明任一实施例中的利用任一实施例中的分离的核酸确定个体的状态的方法的技术特征和优点的描述,同样适用本发明这一方面的方法,在此不再赘述。According to a last embodiment of the present invention, there is provided a method of classifying a plurality of individuals using the isolated nucleic acid of any of the above-described embodiments of the present invention, the method comprising: utilizing the determined individual in any of the above embodiments of the present invention, respectively The state of the method determines the state of each individual; each individual is classified according to the state of each individual obtained. The method can distinguish a plurality of individuals according to different states of the individual or distinguish a plurality of unknown stool samples, which is convenient for classification and label management. In addition, the above description of the technical features and advantages of the method for determining the state of an individual using the isolated nucleic acid of any of the embodiments in any of the embodiments of the present invention is equally applicable to the method of this aspect of the present invention, and no further description is provided herein. .
以下结合具体实施例对本发明的方法和/或装置进行详细的描述。除另有交待,以下实施例中涉及的未特别交待的试剂、序列(接头、标签和引物)、软件及仪器,都是常规市售产品或者开源的,例如购买Illumina的转录组文库构建试剂盒。The method and/or apparatus of the present invention is described in detail below in conjunction with the specific embodiments. Unless otherwise stated, the reagents, sequences (linkers, tags, and primers), software, and instruments not specifically addressed in the following examples are conventionally commercially available or open source, such as the Illumina transcriptome library construction kit. .
以下实施例包括第一阶段和第二阶段,即对应发现阶段和验证阶段。发现阶段包括:基于分析比较53个大肠癌患者、42个结肠腺瘤患者以及61个健康对照组的数据集,确定肠道微生物成分及功能改变,以确定物种标志物;验证阶段包括:利用38个大肠癌患者及19个健康对照组的数据集,以验证第一阶段预测结果的准确性。The following embodiments include a first phase and a second phase, namely a corresponding discovery phase and a verification phase. The discovery phase consisted of analyzing and comparing data sets of 53 colorectal cancer patients, 42 colon adenoma patients, and 61 healthy controls to determine intestinal microbial composition and functional changes to identify species markers; the validation phase included: A data set of colorectal cancer patients and 19 healthy controls was used to verify the accuracy of the first-stage predictions.
实施例1Example 1
该示例中,发明人从53个大肠癌患者、42个结肠腺瘤患者以及61个健康对照的粪便样品开展整个肠道菌群微生物的关联分析研究描述粪便微生物群落及功能成分特征。总的来说,发明人下载约1084.87Gb高质量的测序数据,构建了大肠癌参照基因集。定量宏基因组分析显示在大量的病人及健康对照组中,2,464,280个基因可以聚类为258个代表细菌物种的基因簇(Co-abundance gene groups,CAG),而在大肠癌患者组与健康人对照组中表现差异的有26个,其中大肠癌患者组中富集17个CAG,健康组中富集9个CAG,如表1所示。这26个CAG中能被注释上物种的只有9个,有17个无法注释上物种,为未鉴定的新物种。In this example, the inventors performed a cohort analysis of the entire gut microbiota from 53 colorectal cancer patients, 42 colon adenoma patients, and 61 healthy control stool samples to describe fecal microbial community and functional component characteristics. In total, the inventors downloaded high-quality sequencing data of approximately 1084.87 Gb to construct a colorectal cancer reference gene set. Quantitative metagenomic analysis showed that 2,464,280 genes could be clustered into 258 genes representing bacterial species (Co-abundance gene groups, CAG) in a large number of patients and healthy controls, while those in colorectal cancer patients were compared with healthy controls. There were 26 differences in the performance of the group, of which 17 CAG were enriched in the colorectal cancer patient group and 9 CAG were enriched in the healthy group, as shown in Table 1. Of the 26 CAGs, only 9 were annotated, and 17 were unspecified species, which were unidentified new species.
1、测序数据的获取1. Acquisition of sequencing data
1.1样本收集和DNA提取1.1 Sample collection and DNA extraction
第一阶段的大肠癌患者、结肠腺瘤患者和健康人粪便样本DNA的测序数据来源于EBI数据库,数据编号:ERP005534(Zeller G et al.,2014),其中大肠癌患者53例、结肠腺瘤患者42例和健康人61例,均来自法国。每个样本平均产生5Gb高质量测序结果,总计1084.87Gb测序数据量。The sequencing data of the first stage of colorectal cancer patients, colon adenoma patients and healthy human stool samples were obtained from the EBI database, data number: ERP005534 (Zeller G et al., 2014), of which 53 patients with colorectal cancer, colon adenoma There were 42 patients and 61 healthy people, all from France. On average, each sample produced 5Gb high-quality sequencing results, totaling 1084.87Gb of sequencing data.
第二阶段大肠癌患者和健康人粪便样本DNA的测序数据来源于EBI数据库。数据编 号:ERP005534(Zeller G et al.,2014),其中大肠癌患者38例和健康人5例,来自德国;数据编号:ERA000116(Qin et al.,2010),其中健康人14例,来自西班牙。The sequencing data of the second stage colorectal cancer patients and healthy human stool sample DNA were derived from the EBI database. Data compilation No.: ERP005534 (Zeller G et al., 2014), including 38 patients with colorectal cancer and 5 healthy people from Germany; data number: ERA000116 (Qin et al., 2010), of which 14 were healthy people from Spain.
参照图1的实验流程,鉴定大肠癌的相关微生物标志物,其中省略的步骤或者细节为本领域技术人员所熟知,几个重要步骤介绍如下面几个实施例所述。Referring to the experimental protocol of Figure 1, the relevant microbial markers of colorectal cancer are identified, with the omitted steps or details being well known to those skilled in the art, and several important steps are described in the following examples.
2、生物标志物的鉴定2. Identification of biomarkers
2.1测序数据的基本处理2.1 Basic processing of sequencing data
EBI原始数据已经经过质量控制和去宿主数据的处理,但是在数据中存在很多短reads,从EBI下载到数据后,成对过滤原始数据中碱基数量小于60的reads。EBI raw data has been processed by quality control and de-hosted data, but there are many short reads in the data. After downloading from EBI to data, pairs of raw data in the original data are paired with less than 60 reads.
2.2获得大肠癌微生物组基因集2.2 Obtaining the colony of the colorectal cancer microbial group
宏基因组生物标志物主体是基因和相对应的功能,因此需要对测序序列进行组装和基因预测,去冗余,构建非冗余参考基因集。用SOAPdenovo软件将所有样品reads组装成contigs(组装片段或者称为重叠群)。最终,由总reads数的64.06%产生898万contigs(最小片段长度为500bp)。这些contigs总长18.8Gb,N50长度范围为1,253~18,741bp,平均长度为4,773bp。The genome of the metagenomic biomarker is a gene and a corresponding function, so it is necessary to assemble and predict the sequence of the sequence, to redundantly, and to construct a non-redundant reference gene set. All sample reads were assembled into contigs (assembly fragments or contigs) using SOAPdenovo software. Finally, 8.98 million contigs were generated from 64.06% of the total number of reads (the minimum fragment length was 500 bp). These contigs have a total length of 18.8 Gb, and the N50 has a length ranging from 1,253 to 18,741 bp and an average length of 4,773 bp.
为了预测156个样本的每个样本微生物基因,发明人采用MetaHIT人类肠道基因组研究中的方法。MetaGeneMark程序从预测到26,039,803个长度大于100bp的开放阅读框(ORFs)。预测的ORFs总长为16,095,621,987bp,占contigs总长度的85.61%。通过去除多余ORFs来建立非冗余“CRC基因集”,定义配对后序列的一致性(identity)超过95%且序列覆盖度(coverage)超过90%的短ORFs为相同的序列,去除多余ORFs即去冗余,即对于相同的序列随机保留其中一条。最终的非冗余大肠癌肠道基因集包含6,585,575个ORFs,平均长度609.70bp。To predict each sample microbial gene in 156 samples, the inventors used the method in the MetaHIT Human Intestinal Genome Study. The MetaGeneMark program predicted from 26,039,803 open reading frames (ORFs) that were greater than 100 bp in length. The predicted total length of ORFs was 16,095,621,987 bp, accounting for 85.61% of the total length of contigs. A non-redundant "CRC gene set" is created by removing excess ORFs, defining that the identity of the paired sequence is over 95% and the short ORFs with a sequence coverage of more than 90% are the same sequence, removing the excess ORFs. De-redundancy, that is, one of them is randomly reserved for the same sequence. The final non-redundant colorectal cancer gut gene set contains 6,585,575 ORFs with an average length of 609.70 bp.
2.3基因丰度分析2.3 Gene abundance analysis
利用SOAPalign 2.21将经2.1步骤处理后的成对的paired-end reads比对(匹配)到2.2中的非冗余参考基因集,参数为–r 2–m 100–x 1000。Reads与非冗余参考基因集比对,可能被分为两部分:a)Unique reads(U):reads只与非冗余基因集中的一个基因比对上;这些reads被定义为unique reads。b)Multiple reads(M):如果reads比对上非冗余基因集中的一个以上的基因,定义为multiple reads。The paired paired-end reads treated in step 2.1 were aligned (matched) to the non-redundant reference gene set in 2.2 using SOAPalign 2.21 with the parameter –r 2–m 100–x 1000. The alignment of Reads with non-redundant reference gene sets may be divided into two parts: a) Unique reads (U): reads are aligned only with one gene in a non-redundant gene set; these reads are defined as unique reads. b) Multiple reads (M): If reads compares more than one gene in a non-redundant gene set, it is defined as multiple reads.
对于给定的基因G,其丰度为Ab(G),与U reads和M reads相关,丰度的计算方式如下: For a given gene G, its abundance is Ab(G), associated with U reads and M reads, and the abundance is calculated as follows:
Ab(S)=Ab(U)+Ab(M)Ab(S)=Ab(U)+Ab(M)
Ab(U)=U/lAb(U)=U/l
Figure PCTCN2016076577-appb-000008
Figure PCTCN2016076577-appb-000008
Ab(U)和Ab(M)分别为该基因G的unique reads和multiple reads的丰度,l表示基因G的长度。每个multiple reads,有特有基因丰度系数Co;假设某一multiple reads比对上N个基因,按以下方法计算该条multiple reads的Co:Ab (U) and Ab (M) are the abundances of the unique reads and multiple reads of the gene G, respectively, and l represents the length of the gene G. For each multiple reads, there is a unique gene abundance coefficient Co; assuming that a multiple reads match the N genes, calculate the Co of the multiple reads as follows:
Figure PCTCN2016076577-appb-000009
Figure PCTCN2016076577-appb-000009
即对于multiple reads,发明人把其所比对上的N个genes的unique reads丰度之和作为分母。That is, for multiple reads, the inventor uses the sum of the unique reads abundance of the N genes on the comparison as the denominator.
2.4聚类分析/筛选大肠癌生物标记物2.4 Cluster analysis / screening of colorectal cancer biomarkers
为了进行健康人(88例,包括结肠腺瘤较小的患者)与大肠癌患者(53例)的肠道宏基因组学的研究,在去冗余后的基因集中做了一个鉴定未知物种的研究。为了探索及理解与大肠癌相关的已知物种与未知物种,首先基于156个样本的基因集上鉴定不同丰度的基因,再根据丰度表将基因进行聚类。同个个体中同一物种的基因丰度相似,而不同个体的同一物种基因具有差异,所以同一物种的基因可以通过丰度一致性聚类。由此产生的基因簇来表示宏基因组物种(CAG)。再根据找到的CAG,鉴定其中与大肠癌相关的已知或未知物种。In order to conduct intestinal metagenomics studies in healthy individuals (88 patients, including patients with small colon adenomas) and colorectal cancer patients (53 patients), a study to identify unknown species was performed in the de-redundant gene set. In order to explore and understand the known and unknown species associated with colorectal cancer, genes with different abundances were first identified based on the gene set of 156 samples, and then the genes were clustered according to the abundance table. The gene abundance of the same species in the same individual is similar, and the genes of the same species in different individuals are different, so the genes of the same species can be clustered by abundance consistency. The resulting gene cluster represents the metagenomic species (CAG). Based on the CAG found, a known or unknown species associated with colorectal cancer is identified.
简单地说,首先从基因丰度表中筛选出至少在10个样品中被检测到的基因作为聚类的输入。利用canopy算法进行第一次聚类,T1阈值为皮尔逊相关系数>0.95且斯皮尔曼相关系数>0.6,T2阈值为皮尔逊相关系数>0.9。Briefly, genes detected in at least 10 samples are first screened from the gene abundance table as input to the cluster. The first clustering was performed using the canopy algorithm. The T1 threshold was Pearson correlation coefficient >0.95 and the Spearman correlation coefficient was >0.6. The T2 threshold was Pearson correlation coefficient >0.9.
第一次聚类完毕后,为了矫正第一次聚类的碎片化,去除了只有一个基因的canopy,并进行了第二次聚类。第二次聚类所用的输入为第一次聚类后得到canopy的平均丰度,所用的算法是类canopy算法,新元素(基因序列)加入原有类的条件是与该类中的70%以上的元素满足阈值皮尔逊相关系数>0.97。得到二次聚类的结果后,需要将聚类中重叠的部分去除,对于每个存在与多个类中的基因,加入与之皮尔逊相关系数最大的类。After the first clustering, in order to correct the fragmentation of the first cluster, the canopy with only one gene was removed and the second cluster was performed. The input used in the second clustering is the average abundance of canopy after the first clustering. The algorithm used is the canopy-like algorithm. The condition that the new element (gene sequence) is added to the original class is 70% of the class. The above elements satisfy the threshold Pearson correlation coefficient > 0.97. After obtaining the results of the quadratic clustering, it is necessary to remove the overlapping parts of the cluster, and for each gene existing in the plurality of classes, the class with the largest correlation coefficient with Pearson is added.
该示例实际操作运行及结果:从基因丰度表中筛选出的2,464,280个基因加入聚类,第一次聚类后得到1,110,070个canopy,其中254,018个canopy中的基因个数不止一个,并参与第二次聚类。第二次聚类结束后得到170,203个CAG,其中基因数目大于100个的CAG 有1,546个,基因数目大于700个的CAG有285个。根据285个基因数目大于700的CAG的平均丰度,利用Benjamini Hochberg多重检验矫正的秩和检验,阈值为fdr<0.05,找出9个CAG在健康人中富集,17个CAG在大肠癌患者中富集,其中分别包含了51,472与8,374个基因。The actual operation and results of the example: 2,464,280 genes selected from the gene abundance table were added to the cluster, and 1,110,070 canopy were obtained after the first clustering, of which 254,018 canopy had more than one gene, and participated in the first Secondary clustering. After the second cluster is completed, 170,203 CAGs are obtained, of which CAG has more than 100 genes. There were 1,546 CAG with 285 genes with more than 700 genes. Based on the average abundance of 285 CAGs with a gene number greater than 700, the rank sum test using the Benjamini Hochberg multiple test was used, the threshold was fdr<0.05, and 9 CAGs were found to be enriched in healthy individuals, and 17 CAGs were found in patients with colorectal cancer. Enriched, which contained 51,472 and 8,374 genes, respectively.
为了证明基因簇中的基因属于基因组且与CAG分类注释一致,对6006个基因组进行了blat分析,参照截止2012年8月的第三版的NCBI中的有效参考基因组和HMP、MetaHIT的DACC肠道基因组。当存在唯一一个物种时,有CAG中多于90%基因比对到其基因组上、且比对上的部分占较短ORF的90%、相似度达到95%,将CAG注释上该已知基因组。基因数目大于700的26个CAG中,有9个被归类到物种各级水平,另外17个无法注释到物种,说明这些均为未发表的新物种,如表1所示。标记基因均匀注释验证了聚类质量,适用于整个CAG基因。In order to prove that the genes in the gene cluster belong to the genome and are consistent with the CAG classification annotation, blat analysis was performed on 6006 genomes, with reference to the effective reference genome in the third edition of NCBI as of August 2012 and the DACC intestinal tract of HMP and MetaHIT. Genome. When there is only one species, more than 90% of the genes in the CAG are aligned to the genome, and the alignment is 90% of the shorter ORF, and the similarity is 95%. The known genome is noted by CAG. . Of the 26 CAGs with a gene number greater than 700, 9 were classified at the species level, and 17 were unable to annotate the species, indicating that these were unpublished new species, as shown in Table 1. Marker gene uniform annotation validates cluster quality and applies to the entire CAG gene.
表1 差异CAG的物种注释Table 1 Species annotation of differential CAG
Figure PCTCN2016076577-appb-000010
Figure PCTCN2016076577-appb-000010
Figure PCTCN2016076577-appb-000011
Figure PCTCN2016076577-appb-000011
2,464,280个基因部分聚类成285个CAG,其中26个CAG的丰度在健康人与大肠癌患者之间存在显著差异。9个CAG包含健康个体富集的51,472个基因,17个CAG包含大肠癌患者富集的8,374个基因。其中有17个无法注释上物种,为未鉴定的新物种。2,464,280 genes were partially clustered into 285 CAGs, and the abundance of 26 CAGs was significantly different between healthy and colorectal cancer patients. Nine CAGs contained 51,472 genes enriched in healthy individuals, and 17 CAGs contained 8,374 genes enriched in colorectal cancer patients. Of these, 17 were unable to annotate the species and were unidentified new species.
实施例2Example 2
取显著性水平α=0.05,取实施例1确定的显著富集在大肠癌患者或健康人群中的标志物中的部分核酸序列代表其所在的CAG,部分标志物在验证组中的健康组和疾病组的丰度的差异也具有显著性(P<0.05),验证结果如表2和图2所示。Taking the significance level α=0.05, the partial nucleic acid sequence of the marker significantly enriched in the colorectal cancer patient or healthy population determined in Example 1 represents the CAG in which it is located, and the partial marker in the healthy group of the validation group and The difference in abundance of the disease group was also significant (P<0.05), and the results of the verification are shown in Table 2 and Figure 2.
在验证集中发现,所有258个CAG均保持丰度一致性,可以认为这些基因簇就是代表了某个菌株的所有基因或部分基因。对于26个在实验集中表现出与大肠癌相关的CAG,该示例仍发现其中2个存在显著差异(p<0.05),如表2和图2所示。这两CAG的丰度为其所包含的所有基因的丰度的平均值,如表3所示。 In the validation set, it was found that all 258 CAGs maintained abundance consistency, and these gene clusters were considered to represent all genes or partial genes of a certain strain. For the 26 CAGs that were associated with colorectal cancer in the experimental set, this example still found significant differences between the two (p < 0.05), as shown in Table 2 and Figure 2. The abundance of these two CAGs is the average of the abundances of all the genes they contain, as shown in Table 3.
表2Table 2
Figure PCTCN2016076577-appb-000012
Figure PCTCN2016076577-appb-000012
表3table 3
Figure PCTCN2016076577-appb-000013
Figure PCTCN2016076577-appb-000013
实施例3Example 3
利用57个粪便样本进行样本来源的个体状态的检测。57 fecal samples were used to detect the individual status of the sample source.
参照实施例2的方法确定各粪便样本中的表2的至少1个CAG(菌种)的丰度,判断各样本中的这个(些)cag的丰度是否显著低于其在对照组(健康组)中的丰度,例如落入上例表3中的CAG16610和/或CAG28666在大肠癌患者组的95%的置信区间,判定该个体为大肠癌患者,而对于落入上例表3中的CAG16610和/或CAG28666在健康组的95%的置信区间的判定为非大肠癌患者。The abundance of at least one CAG (strain) of Table 2 in each stool sample was determined according to the method of Example 2, and it was judged whether the abundance of the cag(s) in each sample was significantly lower than that in the control group (health) The abundance in the group), for example, in the 95% confidence interval of CAG16610 and/or CAG28666 in the colorectal cancer patient group in Table 3 above, the individual was judged to be a colorectal cancer patient, and in the above Table 3 CAG16610 and/or CAG28666 were judged to be non-colorectal cancer patients in the 95% confidence interval of the healthy group.
结果显示,对于只利用表2中的任一个cag来检测判断的,能对其中的大约55个样本进行个体状态判断,而且,对55个样本中的超过80%的样本对应个体的状态的判断,与记录的该样本来源个体的状态一致;对于利用表2中的全部两个cag来进行个体状态检测的,能对其中的48个样本进行个体状态判断,而且,对48个样本中的超过90%的样本对应个体的状态的判断,与记录的该样本来源个体的状态一致。即,发明人发现对表2中的所有物种进行联合检测,即检测出待测样本中的表2中的全部两个标志物都不被富集,能够更准确的判断发现大肠癌患者或易感人群。The results show that for the detection of judgment using only any of the cag in Table 2, about 55 samples can be judged for individual state, and more than 80% of the samples of 55 samples correspond to the state of the individual. , consistent with the state of the recorded sample source individuals; for the individual state detection using all two cags in Table 2, 48 individual samples can be judged for individual status, and, for more than 48 samples 90% of the samples correspond to the judgment of the individual's state, consistent with the recorded state of the sample source individual. That is, the inventors found that the joint detection of all the species in Table 2, that is, the detection of all the two markers in Table 2 in the sample to be tested are not enriched, can more accurately determine the colorectal cancer patients or easy to find Feeling the crowd.
在利用标志物治疗大肠癌的方案中,发明人发现使表2中全部两个标志物都得到富集,治疗效果极佳。In the treatment of colorectal cancer using markers, the inventors found that all of the two markers in Table 2 were enriched and the therapeutic effect was excellent.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何 的一个或多个实施例或示例中以合适的方式结合。In the description of the present specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" and the like means a specific feature described in connection with the embodiment or example. A structure, material or feature is included in at least one embodiment or example of the invention. In the present specification, the schematic representation of the above terms does not necessarily mean the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be in any One or more embodiments or examples are combined in a suitable manner.
尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同物限定。 While the embodiments of the present invention have been shown and described, the embodiments of the invention may The scope of the invention is defined by the claims and their equivalents.

Claims (28)

  1. 一组分离的核酸,其包括以下两个核酸序列簇中的至少一个:A set of isolated nucleic acids comprising at least one of the following two clusters of nucleic acid sequences:
    第一核酸序列簇,所述第一核酸序列簇中的核酸序列与SEQ ID NO:1-151所示的序列一一对应,每条所述第一核酸序列簇中的核酸序列与其对应的SEQ ID NO:1-151中的序列的序列相似性不小于90%;a first nucleic acid sequence cluster, the nucleic acid sequence in the first nucleic acid sequence cluster is in one-to-one correspondence with the sequence shown in SEQ ID NO: 1-151, and the nucleic acid sequence in each of the first nucleic acid sequence clusters and its corresponding SEQ The sequence similarity of the sequence in ID NO: 1-151 is not less than 90%;
    第二核酸序列簇,所述第二核酸序列簇中的核酸序列与SEQ ID NO:152-233所示的序列一一对应,每条所述第二核酸序列簇中的核酸序列与其对应的SEQ ID NO:152-233中的序列的序列相似性不小于90%。a second nucleic acid sequence cluster, the nucleic acid sequence in the second nucleic acid sequence cluster is in one-to-one correspondence with the sequence shown in SEQ ID NO: 152-233, and the nucleic acid sequence in each of the second nucleic acid sequence clusters and its corresponding SEQ The sequence similarity of the sequences in ID NO: 152-233 is not less than 90%.
  2. 权利要求1的核酸,其特征在于,所述核酸包括所述第一核酸序列簇。The nucleic acid of claim 1 wherein said nucleic acid comprises said first nucleic acid sequence cluster.
  3. 权利要求2的核酸,其特征在于,每条所述第一核酸序列簇中的核酸序列与其对应的SEQ ID NO:1-151中的序列的序列相似性不小于95%。The nucleic acid according to claim 2, wherein the nucleic acid sequence in each of said first nucleic acid sequence clusters has a sequence similarity to a sequence in SEQ ID NO: 1-151 corresponding thereto of not less than 95%.
  4. 权利要求1的核酸,其特征在于,所述核酸包括所述第二核酸序列簇。The nucleic acid of claim 1 wherein said nucleic acid comprises said second nucleic acid sequence cluster.
  5. 权利要求4的核酸,其特征在于,每条所述第二核酸序列簇中的核酸序列与其对应的SEQ ID NO:152-233中的序列的序列相似性不小于95%。The nucleic acid according to claim 4, wherein the nucleic acid sequence in each of said second nucleic acid sequence clusters has a sequence similarity to the sequence of SEQ ID NOS: 152-233 of not less than 95%.
  6. 权利要求2或4的核酸,其特征在于,所述核酸还包括两个所述核酸序列簇中的另一个。The nucleic acid according to claim 2 or 4, characterized in that the nucleic acid further comprises the other of the two clusters of nucleic acid sequences.
  7. 权利要求1-6任一核酸在检测大肠癌、和/或治疗大肠癌、和/或制备治疗大肠癌药物和/或制备功能性食品中的用途。Use of a nucleic acid according to any of claims 1-6 for detecting colorectal cancer, and/or for treating colorectal cancer, and/or for preparing a medicament for treating colorectal cancer and/or for preparing a functional food.
  8. 一种获得权利要求1-6任一核酸的方法,其特征在于,包括:A method of obtaining a nucleic acid according to any one of claims 1 to 6, characterized in that it comprises:
    (1)获取第一测序结果和第二测序结果,(1) obtaining the first sequencing result and the second sequencing result,
    所述第一测序结果为多个大肠癌患者的粪便样本的核酸的测序结果,包括多个第一读段,The first sequencing result is a sequencing result of nucleic acid of a stool sample of a plurality of colorectal cancer patients, including a plurality of first readings,
    所述第二测序结果为多个健康个体的粪便样本的核酸的测序结果,包括多个第二读段;The second sequencing result is a sequencing result of nucleic acid of a stool sample of a plurality of healthy individuals, including a plurality of second reads;
    (2)分别组装第一读段和第二读段,对应获得多条第一组装序列和多条第二组装序列;(2) assembling the first read segment and the second read segment respectively, correspondingly obtaining a plurality of first assembly sequences and a plurality of second assembly sequences;
    (3)分别基于第一读段对第一组装序列的支持情况和第二读段对第二组装序列的支持情况,确定第一组装序列的丰度和第二组装序列的丰度;(3) determining the abundance of the first assembly sequence and the abundance of the second assembly sequence based on the support of the first assembly sequence for the first read sequence and the support of the second assembly sequence for the second read sequence, respectively;
    (4)依据(3)中确定的第一组装序列的丰度和第二组装序列的丰度,对第一组装序列和第二组装序列进行聚类,获得多个基因簇,每个所述基因簇包括多条第一组装序列和/ 或第二组装序列;(4) clustering the first assembly sequence and the second assembly sequence according to the abundance of the first assembly sequence determined in (3) and the abundance of the second assembly sequence, to obtain a plurality of gene clusters, each of which is The gene cluster includes a plurality of first assembly sequences and / Or a second assembly sequence;
    (5)统计检验以确定在所述多个大肠癌患者和/或所述多个健康个体的粪便样本中显著富集的基因簇,以获得所述核酸。(5) A statistical test to determine a gene cluster significantly enriched in the stool samples of the plurality of colorectal cancer patients and/or the plurality of healthy individuals to obtain the nucleic acid.
  9. 权利要求8的方法,其特征在于,(2)包括:The method of claim 8 wherein (2) comprises:
    分别组装第一读段和第二读段之后,分别对获得的第一读段的组装结果和第二读段的组装结果进行去冗余,以获得所述多条第一组装序列和第二组装序列,定义配对后序列的一致性不小于95%且序列覆盖度超过90%的组装序列为相同的序列。After assembling the first read segment and the second read segment respectively, respectively, the obtained assembly result of the first read segment and the assembly result of the second read segment are respectively de-redunded to obtain the plurality of first assembly sequences and the second The assembled sequence defines an assembly sequence in which the sequence identity after pairing is not less than 95% and the sequence coverage exceeds 90%.
  10. 权利要求8的方法,其特征在于,(3)包括:The method of claim 8 wherein (3) comprises:
    将所述第一读段和第二读段分别比对到所述第一组装序列和第二组装序列上,对应获得第一比对结果和第二比对结果;Comparing the first read segment and the second read segment to the first assembly sequence and the second assembly sequence respectively, respectively obtaining a first comparison result and a second comparison result;
    分别基于所述第一比对结果和第二比对结果,利用以下公式确定所述第一组装序列和第二组装序列的丰度:Abundance of the first assembly sequence and the second assembly sequence is determined using the following formula based on the first alignment result and the second alignment result, respectively:
    组装序列S的丰度Ab(S)=Ab(US)+Ab(MS),其中,Abundance Ab(S)=Ab(U S )+Ab(M S ) of the assembly sequence S, wherein
    Ab(US)=US/lSAb(U S )=U S /l S ,
    US为唯一比对上该组装序列S的读段数目,U S is the only number of reads that align the assembly sequence S,
    lS为该组装序列S的长度,l S is the length of the assembly sequence S,
    Figure PCTCN2016076577-appb-100001
    Figure PCTCN2016076577-appb-100001
    MS为非唯一比对上该组装序列S的读段的数目,M S nonunique than the read section of the assembly of the sequence number S,
    i表示非唯一比对上该组装序列S的读段的编号,i represents the number of the read of the assembly sequence S that is not uniquely aligned,
    Coi为非唯一比对上该组装序列S的读段i对应的丰度系数,
    Figure PCTCN2016076577-appb-100002
    Co i is a non-unique comparison of the abundance coefficients corresponding to the read i of the assembly sequence S,
    Figure PCTCN2016076577-appb-100002
    N1为非唯一比对上该组装序列S的读段比对上的组装序列的总数目,N1 is the non-unique comparison of the total number of assembled sequences on the read alignment of the assembly sequence S,
    j为非唯一比对上该组装序列S的读段比对上的组装序列的编号,j is the number of the assembly sequence on the read alignment of the assembly sequence S that is not uniquely aligned,
    Uj为唯一比对上组装序列j的读段数目。U j is the number of reads that uniquely align the assembly sequence j.
  11. 权利要求8的方法,其特征在于,进行(4)之前,去除掉只存在于小于粪便样本总数的十五分之一的粪便样本中的第一组装序列和第二组装序列。The method of claim 8 wherein prior to (4), removing the first assembly sequence and the second assembly sequence that are only present in less than one-fifth of the stool sample of the total number of stool samples.
  12. 权利要求8的方法,其特征在于,(4)包括:The method of claim 8 wherein (4) comprises:
    对所述第一组装序列和第二组装序列进行第一次聚类,获得第一聚类结果; Performing a first clustering on the first assembly sequence and the second assembly sequence to obtain a first clustering result;
    对去除掉只包含一条组装序列的簇后的第一聚类结果进行第二聚类,以获得所述多个基因簇。Performing a second clustering on the first clustering result after removing the cluster containing only one assembly sequence to obtain the plurality of gene clusters.
  13. 一种利用权利要求1-6任一核酸确定个体状态的方法,其特征在于,包括:A method for determining the state of an individual using the nucleic acid of any of claims 1-6, comprising:
    确定所述核酸中的核酸序列簇在所述个体的粪便样本中丰度以及在对照组中的丰度;Determining the abundance of the nucleic acid sequence cluster in the nucleic acid in the fecal sample of the individual and the abundance in the control group;
    比较所述核酸序列簇在所述个体的粪便样本中的丰度与在对照组中的丰度的差异,依据差异是否具有统计意义来确定所述个体的状态,Comparing the abundance of the nucleic acid sequence cluster in the fecal sample of the individual with the abundance in the control group, determining the state of the individual based on whether the difference is statistically significant,
    所述对照组由一组或多组相同状态的个体的粪便样本组成,The control group consists of one or more sets of stool samples from individuals of the same state,
    所述状态包括患有大肠癌和不患有大肠癌。Such conditions include having colorectal cancer and not having colorectal cancer.
  14. 权利要求13的方法,其特征在于,通过以下确定核酸序列簇在所述个体的粪便样本和/或在对照组中的丰度:The method of claim 13 wherein the abundance of the nucleic acid sequence cluster in the individual's stool sample and/or in the control group is determined by:
    获得所述个体的粪便样本中的核酸序列和对照组中的核酸序列的测序数据,任一来源的测序数据均包括多个读段;Obtaining sequencing data of the nucleic acid sequence in the fecal sample of the individual and the nucleic acid sequence in the control group, and the sequencing data of any source includes a plurality of reads;
    依据任一来源的测序数据中的读段对所述核酸序列簇中的每条核酸序列的支持情况,确定所述核酸序列在该来源的粪便样本中的丰度;Determining the abundance of the nucleic acid sequence in the stool sample of the source based on the support of the reads in the sequencing data from any source for each nucleic acid sequence in the nucleic acid sequence cluster;
    依据所述核酸序列在该来源的粪便样本中的丰度,确定其所在的核酸序列簇在相同粪便样本中的丰度。Based on the abundance of the nucleic acid sequence in the stool sample of the source, the abundance of the cluster of nucleic acid sequences in which it is located is determined in the same stool sample.
  15. 权利要求14的方法,其特征在于,依据任一来源的测序数据中的读段对所述核酸序列簇中的每条核酸序列的支持情况,确定所述核酸序列的丰度,包括:The method of claim 14 wherein determining the abundance of said nucleic acid sequence based on the support of a read in said sequence of sequencing data for each nucleic acid sequence in said sequence of nucleic acid sequences comprises:
    将所述读段比对到所述核酸序列上,Comparing the reads to the nucleic acid sequence,
    基于获得的比对结果,利用以下公式确定所述核酸序列的丰度:Based on the obtained alignment results, the abundance of the nucleic acid sequence is determined using the following formula:
    核酸序列G的丰度Ab(G)=Ab(UG)+Ab(MG),其中,Abundance of nucleic acid sequence G Ab(G)=Ab(U G )+Ab(M G ), wherein
    Ab(UG)=UG/lGAb(U G )=U G /l G ,
    UG为唯一比对上该核酸序列G的读段数目,U G is the only number of reads that align the nucleic acid sequence G,
    lG为该核酸序列G的长度,l G is the length of the nucleic acid sequence G,
    Figure PCTCN2016076577-appb-100003
    Figure PCTCN2016076577-appb-100003
    MG为非唯一比对上该核酸序列G的读段的数目,M G than the read section of the non-unique nucleic acid sequences of the G number,
    x表示非唯一比对上该核酸序列G的读段的编号, x represents the number of the read of the nucleic acid sequence G that is not uniquely aligned,
    Cox为非唯一比对上该核酸序列G的读段x对应的丰度系数,
    Figure PCTCN2016076577-appb-100004
    Co x is a non-unique comparison of the abundance coefficient corresponding to the read x of the nucleic acid sequence G,
    Figure PCTCN2016076577-appb-100004
    N2为非唯一比对上该核酸序列G的读段比对上的所述核酸序列簇中的核酸序列的总数目,N2 is the total number of nucleic acid sequences in the nucleic acid sequence cluster on the alignment of the nucleic acid sequence G that is not uniquely aligned,
    y为非唯一比对上该核酸序列G的读段比对上的核酸序列的编号,y is a non-unique number that aligns the nucleic acid sequence on the read alignment of the nucleic acid sequence G,
    Uy为唯一比对上核酸序列j的读段数目。U y is the number of reads that uniquely align the nucleic acid sequence j.
  16. 权利要求14的方法,其特征在于,所述核酸序列簇的丰度为其包含的核酸序列的丰度的均值或者中位数。The method of claim 14 wherein the abundance of said sequence of nucleic acid sequences is the mean or median of the abundance of the nucleic acid sequences contained therein.
  17. 权利要求13的方法,其特征在于,所述对照组由多个患大肠癌的个体的粪便样本组成,The method of claim 13 wherein said control group consists of a stool sample of a plurality of individuals having colorectal cancer,
    当所述核酸序列簇在所述个体的粪便样本中的丰度与其在所述对照组中的丰度无统计学意义上的差异时,确定所述个体的状态为患有大肠癌。When the abundance of the nucleic acid sequence cluster in the fecal sample of the individual is not statistically different from its abundance in the control group, it is determined that the state of the individual is suffering from colorectal cancer.
  18. 权利要求13的方法,其特征在于,所述对照组由多个健康个体的粪便样本组成,The method of claim 13 wherein said control group consists of stool samples from a plurality of healthy individuals.
    当所述核酸序列簇在所述个体的粪便样本中的丰度具有统计学意义地低于其在所述对照组中的丰度时,确定所述个体的状态为患有大肠癌。When the abundance of the nucleic acid sequence cluster in the fecal sample of the individual is statistically lower than its abundance in the control group, the state of the individual is determined to be suffering from colorectal cancer.
  19. 一种利用权利要求1-6任一核酸确定个体状态的装置,其特征在于,包括:A device for determining the state of an individual using the nucleic acid of any of claims 1-6, comprising:
    丰度确定单元,用于确定所述核酸中的核酸序列簇在所述个体的粪便样本中丰度以及在对照组中的丰度;An abundance determining unit for determining abundance of a nucleic acid sequence cluster in the nucleic acid in the fecal sample of the individual and abundance in a control group;
    个体状态确定单元,用于比较所述核酸序列簇在所述个体的粪便样本中的丰度与在对照组中的丰度的差异,并且依据差异是否具有统计意义来确定所述个体的状态,An individual state determining unit configured to compare a difference in abundance of the nucleic acid sequence cluster in a stool sample of the individual with abundance in a control group, and determine a state of the individual based on whether the difference is statistically significant,
    所述对照组由一组或多组相同状态的个体的粪便样本组成,The control group consists of one or more sets of stool samples from individuals of the same state,
    所述状态包括患有大肠癌和不患有大肠癌。Such conditions include having colorectal cancer and not having colorectal cancer.
  20. 权利要求19的装置,其特征在于,利用所述丰度确定单元中进行以下:The apparatus of claim 19, wherein said abundance determining unit performs the following:
    获得所述个体的粪便样本中的核酸序列和对照组中的核酸序列的测序数据,任一来源的测序数据均包括多个读段;Obtaining sequencing data of the nucleic acid sequence in the fecal sample of the individual and the nucleic acid sequence in the control group, and the sequencing data of any source includes a plurality of reads;
    依据任一来源的测序数据中的读段对所述核酸序列簇中的核酸序列的支持情况,确定所述核酸序列在该来源的粪便样本中的丰度;Determining the abundance of the nucleic acid sequence in the stool sample of the source based on the support of the read sequence in the sequencing data of any source for the nucleic acid sequence in the nucleic acid sequence cluster;
    依据所述核酸序列在该来源的粪便样本中的丰度,确定其所在的核酸序列簇在相同粪便样本中的丰度。Based on the abundance of the nucleic acid sequence in the stool sample of the source, the abundance of the cluster of nucleic acid sequences in which it is located is determined in the same stool sample.
  21. 权利要求20的装置,其特征在于,依据任一来源的测序数据中的读段对所述核酸 序列簇中的核酸序列的支持情况,确定所述核酸序列在该来源的粪便样本中的丰度,包括:The device of claim 20, wherein said nucleic acid is read according to a read in sequencing data from any source The support of the nucleic acid sequence in the sequence cluster determines the abundance of the nucleic acid sequence in the stool sample of the source, including:
    将所述读段比对到所述核酸序列上,Comparing the reads to the nucleic acid sequence,
    基于获得的比对结果,利用以下公式确定所述核酸序列的丰度:Based on the obtained alignment results, the abundance of the nucleic acid sequence is determined using the following formula:
    核酸序列G的丰度Ab(G)=Ab(UG)+Ab(MG),其中,Abundance of nucleic acid sequence G Ab(G)=Ab(U G )+Ab(M G ), wherein
    Ab(UG)=UG/lGAb(U G )=U G /l G ,
    UG为唯一比对上该核酸序列G的读段数目,U G is the only number of reads that align the nucleic acid sequence G,
    lG为该核酸序列G的长度,l G is the length of the nucleic acid sequence G,
    Figure PCTCN2016076577-appb-100005
    Figure PCTCN2016076577-appb-100005
    MG为非唯一比对上该核酸序列G的读段的数目,M G than the read section of the non-unique nucleic acid sequences of the G number,
    x表示非唯一比对上该核酸序列G的读段的编号,x represents the number of the read of the nucleic acid sequence G that is not uniquely aligned,
    Cox为非唯一比对上该核酸序列G的读段x对应的丰度系数,
    Figure PCTCN2016076577-appb-100006
    Co x is a non-unique comparison of the abundance coefficient corresponding to the read x of the nucleic acid sequence G,
    Figure PCTCN2016076577-appb-100006
    N2为非唯一比对上该核酸序列G的读段比对上的核酸序列簇中的核酸序列的总数目,N2 is the total number of nucleic acid sequences in the nucleic acid sequence cluster on the aligned pair of non-uniquely aligned pairs of the nucleic acid sequence G,
    y为非唯一比对上该核酸序列G的读段比对上的核酸序列的编号,y is a non-unique number that aligns the nucleic acid sequence on the read alignment of the nucleic acid sequence G,
    Uy为唯一比对上核酸序列j的读段数目。U y is the number of reads that uniquely align the nucleic acid sequence j.
  22. 权利要求18的装置,其特征在于,所述核酸序列簇的丰度为其包含的核酸序列的丰度的均值或者中位数。18. Apparatus according to claim 18 wherein the abundance of said sequence of nucleic acid sequences is the mean or median of the abundance of the nucleic acid sequences contained therein.
  23. 权利要求17的装置,其特征在于,所述对照组由多个患大肠癌的个体的粪便样本组成,The device of claim 17, wherein said control group consists of a stool sample of a plurality of individuals having colorectal cancer,
    当所述核酸序列簇在所述个体的粪便样本中的丰度与其在所述对照组中的丰度无统计学上的差异时,确定所述个体的状态为患有大肠癌。When the abundance of the nucleic acid sequence cluster in the fecal sample of the individual is not statistically different from its abundance in the control group, it is determined that the state of the individual is suffering from colorectal cancer.
  24. 权利要求17的装置,其特征在于,所述对照组由多个健康个体的粪便样本组成,The device of claim 17 wherein said control group consists of stool samples from a plurality of healthy individuals.
    当所述核酸序列簇在所述个体的粪便样本中的丰度具有统计学意义地小于其在所述对照组中的丰度时,确定所述个体的状态为患有大肠癌。When the abundance of the nucleic acid sequence cluster in the fecal sample of the individual is statistically less than its abundance in the control group, the state of the individual is determined to be suffering from colorectal cancer.
  25. 一种利用权利要求1-6任一核酸确定个体的状态的系统,其特征在于,包括:A system for determining the state of an individual using the nucleic acid of any of claims 1-6, comprising:
    数据输入模块,用于输入数据;a data input module for inputting data;
    数据输出模块,用于输出数据; a data output module for outputting data;
    处理器,用于执行可执行程序,执行所述可执行程序包括完成权利要求13-18任一方法;a processor for executing an executable program, the executing the executable program comprising performing the method of any one of claims 13-18;
    存储模块,与所述数据输入模块、所述数据输出模块和所述处理器相连,用于存储数据,其中包括所述可执行程序。a storage module coupled to the data input module, the data output module, and the processor for storing data, including the executable program.
  26. 一种治疗大肠癌的药物,其特征在于,所述药物促使患者肠道中的权利要求1-6任一核酸的丰度增加。A medicament for treating colorectal cancer, characterized in that the medicament promotes an increase in the abundance of the nucleic acids of any of claims 1-6 in the intestinal tract of a patient.
  27. 一种生产或筛选权利要求26的药物的方法,其特征在于,包括筛选促使权利要求1-6任一核酸的丰度增加的物质作为所述药物的步骤。A method of producing or screening a medicament according to claim 26, which comprises the step of screening for a substance which promotes an increase in abundance of the nucleic acid of any of claims 1-6 as said medicament.
  28. 一种利用权利要求1-6任一核酸对多个个体进行分类的方法,其特征在于,包括:A method of classifying a plurality of individuals using the nucleic acid of any of claims 1-6, comprising:
    分别利用权利要求13-18任一方法确定各个个体的状态;Determining the status of each individual using any of claims 13-18;
    依据获得的各个个体的状态对各个个体进行分类。 Each individual is classified according to the state of each individual obtained.
PCT/CN2016/076577 2016-03-17 2016-03-17 Isolated nucleic acid application thereof WO2017156739A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201680083629.4A CN109072306A (en) 2016-03-17 2016-03-17 Isolated nucleic acid and application
PCT/CN2016/076577 WO2017156739A1 (en) 2016-03-17 2016-03-17 Isolated nucleic acid application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/076577 WO2017156739A1 (en) 2016-03-17 2016-03-17 Isolated nucleic acid application thereof

Publications (1)

Publication Number Publication Date
WO2017156739A1 true WO2017156739A1 (en) 2017-09-21

Family

ID=59850049

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/076577 WO2017156739A1 (en) 2016-03-17 2016-03-17 Isolated nucleic acid application thereof

Country Status (2)

Country Link
CN (1) CN109072306A (en)
WO (1) WO2017156739A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114317718A (en) * 2021-12-31 2022-04-12 上海锐翌生物科技有限公司 Rheumatoid arthritis marker KO and application thereof

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113637660B (en) * 2021-08-05 2023-09-08 云南师范大学 Beta-galactosidase GalNC3-89, and preparation method and application thereof
CN113969290B (en) * 2021-11-16 2023-06-30 中南大学 Deep sea bacteria-derived alpha-glucosidase QsGH97a and encoding gene and application thereof
CN114317716A (en) * 2021-12-31 2022-04-12 上海锐翌生物科技有限公司 Gout marker gene and application thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104039982A (en) * 2012-08-01 2014-09-10 深圳华大基因研究院 Method and device for analyzing microbial community composition
WO2015018308A1 (en) * 2013-08-06 2015-02-12 BGI Shenzhen Co.,Limited Biomarkers for colorectal cancer
CN105132518A (en) * 2015-09-30 2015-12-09 上海锐翌生物科技有限公司 Colon cancer marker and application thereof
CN105296590A (en) * 2015-09-30 2016-02-03 上海锐翌生物科技有限公司 Colorectal cancer marker and application thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2955232B1 (en) * 2014-06-12 2017-08-23 Peer Bork Method for diagnosing adenomas and/or colorectal cancer (CRC) based on analyzing the gut microbiome

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104039982A (en) * 2012-08-01 2014-09-10 深圳华大基因研究院 Method and device for analyzing microbial community composition
WO2015018308A1 (en) * 2013-08-06 2015-02-12 BGI Shenzhen Co.,Limited Biomarkers for colorectal cancer
WO2015018307A1 (en) * 2013-08-06 2015-02-12 Bgi Shenzhen Co., Limited Biomarkers for colorectal cancer
CN105132518A (en) * 2015-09-30 2015-12-09 上海锐翌生物科技有限公司 Colon cancer marker and application thereof
CN105296590A (en) * 2015-09-30 2016-02-03 上海锐翌生物科技有限公司 Colorectal cancer marker and application thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114317718A (en) * 2021-12-31 2022-04-12 上海锐翌生物科技有限公司 Rheumatoid arthritis marker KO and application thereof

Also Published As

Publication number Publication date
CN109072306A (en) 2018-12-21

Similar Documents

Publication Publication Date Title
US20190367995A1 (en) Biomarkers for colorectal cancer
CN105296590B (en) Large intestine carcinoma marker and its application
CN107217089B (en) Method and device for determining individual state
US20150211053A1 (en) Biomarkers for diabetes and usages thereof
CN105132518B (en) Large intestine carcinoma marker and its application
WO2014019180A1 (en) Method and system for determining biomarker in abnormal state
WO2017156739A1 (en) Isolated nucleic acid application thereof
WO2016049932A1 (en) Biomarkers for obesity related diseases
WO2016050110A1 (en) Biomarkers for rheumatoid arthritis and usage thereof
CN109658980A (en) A kind of screening and application of excrement gene marker
CN107217088B (en) Ankylosing spondylitis microbial markers
CN110541026A (en) Biomarker for detecting ulcerative colitis and application
CN114277139B (en) Application of exosomes ARPC5, SNHG5 and the like in lung cancer diagnosis
CN114182007B (en) Behcet disease marker gene and application thereof
CN107208159A (en) Host DNA as Crohn&#39;s disease biomarker
WO2017156764A1 (en) Isolated nucleic acid application thereof
CN107217086B (en) Disease marker and application
CN105671177B (en) Ankylosing spondylitis marker and application thereof
CN105733988B (en) Composition and application
WO2016049927A1 (en) Biomarkers for obesity related diseases
CN108064273A (en) The biomarker of colorectal cancer relevant disease
CN113930479B (en) Systemic lupus erythematosus marker microorganism and application thereof
CN113913490B (en) Non-alcoholic fatty liver disease marker microorganism and application thereof
CN114085886B (en) Crohn&#39;s marker microorganism for children and application thereof
CN114517228A (en) Inflammatory bowel disease marker gene and application thereof

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16893904

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19/01/2019)

122 Ep: pct application non-entry in european phase

Ref document number: 16893904

Country of ref document: EP

Kind code of ref document: A1