WO2016112539A1 - Procédé et dispositif pour déterminer la teneur en acide nucléique foetal - Google Patents

Procédé et dispositif pour déterminer la teneur en acide nucléique foetal Download PDF

Info

Publication number
WO2016112539A1
WO2016112539A1 PCT/CN2015/070900 CN2015070900W WO2016112539A1 WO 2016112539 A1 WO2016112539 A1 WO 2016112539A1 CN 2015070900 W CN2015070900 W CN 2015070900W WO 2016112539 A1 WO2016112539 A1 WO 2016112539A1
Authority
WO
WIPO (PCT)
Prior art keywords
region
relationship
snp site
nucleic acid
depth
Prior art date
Application number
PCT/CN2015/070900
Other languages
English (en)
Chinese (zh)
Inventor
康雄斌
陈芳
刘萍
徐惠欣
芦静
蒋浩君
Original Assignee
深圳华大基因股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因股份有限公司 filed Critical 深圳华大基因股份有限公司
Priority to CN201580072546.0A priority Critical patent/CN107109324B/zh
Priority to PCT/CN2015/070900 priority patent/WO2016112539A1/fr
Publication of WO2016112539A1 publication Critical patent/WO2016112539A1/fr

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12MAPPARATUS FOR ENZYMOLOGY OR MICROBIOLOGY; APPARATUS FOR CULTURING MICROORGANISMS FOR PRODUCING BIOMASS, FOR GROWING CELLS OR FOR OBTAINING FERMENTATION OR METABOLIC PRODUCTS, i.e. BIOREACTORS OR FERMENTERS
    • C12M1/00Apparatus for enzymology or microbiology
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12MAPPARATUS FOR ENZYMOLOGY OR MICROBIOLOGY; APPARATUS FOR CULTURING MICROORGANISMS FOR PRODUCING BIOMASS, FOR GROWING CELLS OR FOR OBTAINING FERMENTATION OR METABOLIC PRODUCTS, i.e. BIOREACTORS OR FERMENTERS
    • C12M1/00Apparatus for enzymology or microbiology
    • C12M1/34Measuring or testing with condition measuring or sensing means, e.g. colony counters
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the present invention relates to the field of biological information. Specifically, the present invention relates to a device for constructing a distribution region of different combined genotype SNP sites, a method for constructing a distribution region of different combined genotype SNP sites, and a method for distinguishing different combinations.
  • a method of genotypic SNP locus a method of determining fetal nucleic acid content in a pregnant sample, a device for determining fetal nucleic acid content in a pregnant sample, and a computer readable medium.
  • prenatal testing techniques Since the discovery of fetal free DNA in maternal plasma, prenatal testing techniques have undergone major innovations. Recently, with the continuous reduction of the price of second-generation sequencing technology and technological innovation, the development of non-invasive prenatal testing is rapidly applied. For example, the diagnosis of genetic diseases such as prenatal hemophilia, gender confusion and monogenic diseases. In these diagnoses, fetal concentration is an important parameter. In addition, abnormal fetal concentrations can also be used to aid in predicting some disease risks, such as higher levels of fetal concentration may be associated with preterm birth, and lower levels of fetal concentration may be used to aid in the identification of moderate or severe eclampsia.
  • the present invention provides a device for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and a gene in a second source nucleic acid a combination of types
  • the apparatus comprising: a first region-second region building unit for constructing a first region and a second region, the first region being a distribution region of the first combined genotype SNP site, and the second region is a
  • the distribution region of the two combined genotype SNP sites, the first region and the second region are divided based on the difference between the first relationship and the second relationship, and the first relationship is the second highest base of the first combined genotype SNP site
  • the relationship between the depth and the depth of the highest base, the second relationship is the relationship between the depth of the second highest base of the second combined genotype SNP and the depth of the highest base, the first combination
  • the genotypes are ABaa and ABab
  • the second combined genotypes are AAaa
  • the relationship between the depth of the next highest base of the first predetermined ratio of AAaa SNP sites and the depth of the highest base, and the fourth relationship is the number of AAab SNP sites in the second combined genotype SNP locus
  • a or a represents the highest base of the SNP site, and B or b represents the next highest base of the same SNP site.
  • the SNP is generally considered to be dimorphic, it is a di-allelic, consisting of two bases, the highest base and the second highest base, respectively.
  • a and B respectively represent a SNP locus in a source nucleic acid.
  • the highest base and the second highest base in the corresponding one, and a and b respectively represent the highest base and the next highest base of the same SNP site in the nucleic acid of another source.
  • the device is based on four possible combinations of genotype formation in the two source nucleic acids by the same SNP site, assuming that the first source nucleic acid is greater than the second source nucleic acid content, and if in the same sequence
  • the distribution regions of the four different combined genotype SNP sites were constructed by the difference in the relationship between the sub-high base depth and the highest base depth of the four combined SNP sites.
  • the distribution region constructed using the device can be used to determine the combined genotype of SNP sites in the data to be detected, and/or to distinguish different combined genotype SNP sites in the data to be detected and to obtain a certain combination genotype thereof. Data information of SNP sites.
  • the present invention provides a device for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and a gene in a second source nucleic acid a combination of types
  • the apparatus comprising: a first region-second region building unit for constructing a first region and a second region, the first region being a distribution region of the first combined genotype SNP site, and the second region is a
  • the distribution region of the two combined genotype SNP sites, the first region and the second region are divided based on the difference between the first relationship and the second relationship, and the first relationship is the second highest base of the first combined genotype SNP site
  • the relationship between the depth and the depth of the highest base, the second relationship is the relationship between the depth of the second highest base of the second combined genotype SNP and the depth of the highest base, the first combination
  • the genotypes are ABaa and ABab
  • the second combined genotypes are AAaa
  • a fourth relationship is a relationship between a depth of a sub-high base of the AAab SNP site in the second combined genotype SNP site and a depth of the highest base thereof; and a closed fourth region building unit for A closed fourth region is constructed in the fourth region, and the closed fourth region is a distribution region of a second predetermined ratio of AAab SNP sites in the second combined genotype SNP site, and the closed fourth region is based on the AAab SNP site.
  • Both the high base and the highest base depth follow a normal distribution, and a second predetermined ratio is set, constructed from the fourth region; wherein AA and AB respectively represent homozygous and heterozygous nucleic acids from the first source SNP locus, aa and ab Representing homozygous and heterozygous respectively from the second The same SNP site of the source nucleic acid, defining A or a represents the highest base of the SNP site, and B or b represents the next highest base of the same SNP site.
  • the present invention provides a method for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and a gene in a second source nucleic acid a combination of types, the method comprising: constructing a first region and a second region based on a difference between the first relationship and the second relationship, the first region is a distribution region of the first combined genotype SNP site, and the second region is a second region
  • the distribution region of the combined genotype SNP locus, the first relationship is the relationship between the depth of the second highest base of the first combined genotype SNP locus and the depth of the highest base, and the second relationship is the second combined gene
  • the relationship between the depth of the next highest base of the SNP site and the depth of the highest base, the first combined genotypes are ABaa and ABab, and the second combined genotype is AAaa and AAab; based on the third relationship and the third
  • the present invention provides a method for distinguishing different combinations of genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and in a second source nucleic acid.
  • a combination of genotypes comprising: sequence determining at least a portion of the nucleic acid in the mixed nucleic acid sample, obtaining sequencing data, the sequencing data consisting of a plurality of reads, the mixed nucleic acid sample comprising the first source nucleic acid and the second source nucleic acid; The sequencing data is aligned with the reference sequence to obtain a comparison result; based on the comparison result, the SNP site is identified and the distribution region where the SNP site is located is determined, and the distribution region is a different combination according to one aspect of the present invention.
  • a method for constructing a distribution region of a genotype SNP locus determining a combined genotype of the SNP locus based on a distribution region in which the SNP locus is located.
  • the second source nucleic acid concentration is estimated by using each SNP site falling into the closed fourth region to obtain a set of second source nucleic acid concentration values, and the median value is the second.
  • Source nucleic acid concentration For mixed nucleic acid samples containing only two sources of nucleic acid, the nucleic acid content of one of the sources is determined and the other is determined.
  • the present invention provides a method for determining a fetal nucleic acid content in a pregnant woman sample, the method comprising: obtaining a sequencing result, the obtaining of the sequencing result comprising sequencing a nucleic acid of at least a portion of the sample of the pregnant woman,
  • the sequencing result consists of multiple reads.
  • the maternal sample contains maternal nucleic acid and fetal nucleic acid; the sequencing result is compared with the reference sequence to obtain the alignment result; based on the comparison result, the SNP locus is identified; based on the comparison result, the SNP is determined.
  • the distribution region in which the site is located, the distribution region is constructed according to an aspect of the present invention for constructing a distribution region of different combined genotype SNP sites; based on the fourth region in the distribution region or the SNP site in the fourth region;
  • the fetal nucleic acid content in the pregnant woman sample is determined.
  • the sample of the pregnant woman to be tested is a sample of the pregnant woman's body fluid, for example, from the peripheral blood of the pregnant woman, the urine of the pregnant woman, and the like.
  • the present invention provides a method for determining a fetal nucleic acid content in a pregnant woman sample, the method comprising: obtaining a sequencing result, the obtaining of the sequencing result comprising sequencing, sequencing, and sequencing at least a portion of the nucleic acid in the pregnant woman sample
  • the result consists of multiple readings.
  • the maternal sample contains maternal nucleic acid and fetal nucleic acid; the sequencing result is compared with the reference sequence to obtain the alignment result; based on the comparison result, the SNP locus is identified; based on the comparison result, the SNP position is determined.
  • the distribution region in which the point is located, the distribution region is constructed according to an aspect of the present invention for constructing a distribution region of different combined genotype SNP sites; based on the fourth region in the distribution region or the SNP site in the closed fourth region, The fetal nucleic acid content of the pregnant woman sample; and, when the sequencing result contains less than 65X of data and/or the determined fetal nucleic acid content is less than 10%, the bias correction model is used to correct the fetal nucleic acid content, and the corrected fetal nucleic acid content is obtained. .
  • the present invention provides an apparatus for determining a fetal nucleic acid content in a pregnant woman sample, comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing Data, including an executable program; a processor coupled to the data input unit, the data output unit, and the storage unit for executing an executable program, the method of performing the method comprising determining the fetal nucleic acid content in the pregnant woman sample .
  • a computer readable medium for storing a program for execution by a computer, the method comprising performing any of the above methods for determining a fetal nucleic acid content in a sample of a pregnant woman.
  • the storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
  • the distribution regions of different combined genotype SNP loci can be obtained, so that the SNP loci of different combined genotypes can be distinguished, the combined genotype of the SNP can be determined, and the SNP position of the specific combined genotype can be obtained.
  • the point and the SNP site information of the specific combination genotype in the mixed nucleic acid sample are used to determine the content of the nucleic acid of different sources in the mixed nucleic acid sample, including the fetal nucleic acid content in the pregnant sample, and the content of the nucleic acid from the tumor cell in the tumor circulating blood sample.
  • the deviation correction model of the present invention it is possible to correct the calculation due to the fact that the amount of data acquired is small or the concentration of the target-derived nucleic acid in the mixed nucleic acid sample is relatively low, and the range of the delineated distribution region is relatively strict.
  • the deviation of the concentration of the target source nucleic acid is obtained such that the nucleic acid content of the second source is determined.
  • Figure 1 shows the ratio of the standard deviation to the probability of a standard normal distribution in one embodiment of the invention.
  • Figure 2 shows a schematic representation of a distribution region constructed in a specific embodiment of the invention.
  • Figure 3 shows the relationship between predicted fetal nucleic acid concentration and real fetal nucleic acid concentration in one embodiment of the invention, wherein the predicted fetal nucleic acid concentration in Figure 3a is not corrected using the bias correction model, the predicted in Figure 3b The fetal nucleic acid concentration is the predicted fetal nucleic acid concentration corrected by the bias model.
  • Figure 4 shows the effect of different sequencing depths on determining fetal nucleic acid content in one embodiment of the invention.
  • Figure 5 shows the effect of the number of sites in the AAab SNP distribution region on determining fetal nucleic acid content in one embodiment of the invention.
  • Figure 6 shows the difference in male fetus fetal nucleic acid concentration predicted using Y chromosome depth and the fetal fetal nucleic acid concentration predicted by the method of the present invention in one embodiment of the present invention.
  • a device for constructing a distribution region of different combined genotype SNP sites wherein the combined genotype is a genotype of a SNP site in a nucleic acid of a first source and a nucleic acid at a second source a combination of genotypes
  • the apparatus comprising: a first region-second region building unit for constructing a first region and a second region, the first region being a distribution region of the first combined genotype SNP site, and second The region is a distribution region of the second combined genotype SNP site, and the first region and the second region are divided based on the difference between the first relationship and the second relationship, and the first relationship is the time of the first combined genotype SNP site
  • the relationship between the depth of the high base and the depth of the highest base, and the second relationship is the relationship between the depth of the second highest base of the second combined genotype SNP and the depth of the highest base.
  • the first combined genotype is ABaa and ABab
  • the second combined genotype is AAaa and AAab
  • the third region-fourth region building unit is configured to construct a third region and a fourth region from the second region
  • the third region is Second combined genotype S a distribution area of a first predetermined ratio of AAaa SNP sites in the NP site
  • the fourth region is a distribution region of AAab SNP sites in the second combined genotype SNP site
  • the third region and the fourth region are based on
  • the difference between the third relationship and the fourth relationship is the third relationship being the depth of the second highest base of the first predetermined proportion of the AAaa SNP sites in the second combined genotype SNP site and the depth of the highest base
  • the relationship between the fourth relationship is the relationship between the depth of the sub-high base of the AAab SNP site in the second combined genotype SNP site and the depth of the highest base; wherein AA and AB respectively represent Homozygous and heterozy
  • B or b represents the next highest base of the same SNP site. It is generally considered that the SNP is dimorphic, that is, the second allele is composed of two bases, which are the highest base and the second highest. Base, in the same sequence measurement data, the base that obtains the most data support in a SNP site is called the highest base, and the data that supports the second most is called the second highest base, for example, covers a SNP. In the read segment obtained by sequencing of the site, the corresponding position is the same as the one base of the site, and the number of reads supported by the base is the read segment supporting the base. The most bases.
  • a and B respectively represent the highest base and the second highest base of a SNP site in one source nucleic acid, and correspondingly, a and b respectively represent the highest of the same SNP site in another source nucleic acid.
  • Base and second highest base The depth of the highest base and the depth of the next highest base refer to the amount of data supported by the highest base and the second highest base in the same measurement data, for example, the reads obtained by sequencing are compared.
  • the number of reads of the SNP site on the alignment (cover) reference sequence is the depth of the site, also known as the sequencing depth, and the corresponding position in the read of the site is aligned with the bit
  • the number of reads of the highest base of the point is the depth of the highest base, and correspondingly, the number of reads of the corresponding position in the read of the site is the same as the next highest base of the site. The depth of this high base.
  • a read that matches the corresponding position in the read on the site to the same position as the base of the site is referred to as a read that supports the base.
  • the device is used to construct the distribution region, and it is not necessary to obtain specific SNP site data in advance, that is, the construction of the distribution region does not depend on specific SNP site data information, including the highest base and the second highest base depth independent of the SNP site.
  • the device is based on four possible combinations of genotype formation in the two source nucleic acids by the same SNP site, assuming that the content of the first source nucleic acid is greater than the nucleic acid content of the second source, and if in the same sequence determination results, four combinations
  • the distribution of the four different combined genotype SNP sites was constructed by the difference in the relationship between the sub-high base depth and the highest base depth of the SNP site.
  • the four combined genotypes are shown in Table 1.
  • a method for visualizing the distribution region establishing a two-dimensional coordinate system, the y-axis represents the depth of the next-high base of the SNP site, and the x-axis represents the highest of the SNP site.
  • the depth of the base the first relationship can be expressed as x/2 ⁇ y ⁇ x/3, the second relationship can be expressed as 0 ⁇ y ⁇ x/3, and the third relationship can be expressed as 0 ⁇ y ⁇ (x+y) *e+m* ⁇ , the fourth relationship can be expressed as (x+y)*e+m* ⁇ y ⁇ 3/x, where e is the sequencing error rate, generally e ⁇ 1%, and ⁇ is the standard deviation.
  • ((x+y)*e) ⁇ 0.5,m depends on the first predetermined ratio, m is a non-negative number, and the relationship between m* ⁇ and the first predetermined ratio is the standard deviation in the standard normal distribution The ratio of the probability to the probability.
  • the probability density function curve of the standard normal distribution has a bell shape, and those skilled in the art can understand that the ratio of the standard deviation to the probability in the standard normal distribution is as shown in FIG. 1, such as: plus or minus one standard deviation. Between the total area of 68.26%; between plus or minus 1.96 standard deviations, including 95% of the total area; between plus and minus 2.58 standard deviations, including 99% of the total area; m is corresponding to the standard deviation
  • the apparatus further comprises: a closed fourth region building unit for constructing a closed fourth region from the fourth region, the closed fourth region being the second of the second combined genotype SNP sites a distribution area of a predetermined ratio of AAab SNP sites, wherein the closed fourth region is subjected to a normal distribution based on the sub-high base and the highest base depth of the AAab SNP site, and the second is set A predetermined ratio is obtained from the construction of the fourth region.
  • the second predetermined ratio is set to be not less than 95%.
  • the x-axis represents the depth of the highest base of the SNP site in a two-dimensional coordinate system
  • y x/3
  • y (x+y)*e+ m* ⁇
  • y D 0 -n* ⁇ -x
  • e is the sequencing error rate, generally e ⁇ 1%
  • ((x + y) * e) ⁇ 0.5
  • D 0 is the average depth of the SNP site
  • m n is non-negative, depending on the first predetermined ratio
  • m * ⁇ and the The relationship of the first predetermined ratio is a ratio relationship of the standard deviation and the probability in the standard normal distribution
  • n depends on the second predetermined ratio
  • the relationship between n* ⁇ and the second predetermined ratio is in the standard normal distribution
  • the ratio of standard deviation to probability is a ratio relationship of the standard deviation and the probability in the standard normal distribution.
  • the second predetermined ratio is not less than 95%.
  • the average depth of the so-called SNP site refers to the average of the depth of the SNP site, and the depth of the SNP site is the amount of support data obtained by the SNP site, for example, the reads obtained by sequencing are compared to the reference.
  • the number of reads of the SNP site in the reference sequence is the depth of the site, also referred to as the sequencing depth, ie the number of times the site is covered, preferably, D 0 ⁇ 100X.
  • a distribution region for obtaining different combined genotype SNP sites can be constructed.
  • the converse genotype of the SNP locus can be determined according to the distribution region in which the SNP locus falls in the data to be detected, the SNP locus of the specific combined genotype is obtained, and the mixed nucleic acid sample is utilized.
  • the SNP site information for a particular combination genotype determines the amount of nucleic acid from a different source in the mixed nucleic acid sample, including the fetal nucleic acid content in the pregnant sample.
  • a method for constructing a distribution region of different combined genotype SNP sites wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and in a second source a combination of genotypes in a nucleic acid, the method comprising: constructing a first region and a second region based on a difference between the first relationship and the second relationship, the first region being a distribution region of the first combined genotype SNP site, and second The region is a distribution region of the second combined genotype SNP locus, and the first relationship is a relationship between the depth of the second highest base of the first combined genotype SNP locus and the depth of the highest base, and the second relationship is The relationship between the depth of the second highest base of the second combined genotype SNP locus and the depth of the highest base, the first combined genotype is ABaa And ABab, the second combined genotype is AAaa and AAab; based on the difference between the third relationship
  • the construction of the distribution region does not depend on specific SNP locus data information, including the highest base and the second highest base depth independent of the SNP locus.
  • the method is based on four possible combinations of genotype formation in the two source nucleic acids of the same SNP site, assuming that the content of the first source nucleic acid is greater than the second source nucleic acid content, and if in the same sequence determination result, four combinations
  • the distribution of the four different combined genotype SNP sites was constructed by the difference in the relationship between the sub-high base depth and the highest base depth of the SNP site.
  • a method for visualizing said distribution region, establishing a two-dimensional coordinate system, the y-axis representing the depth of the next-high base of the SNP site, and the x-axis representing the SNP site
  • the depth of the highest base the first relationship can be expressed as x/2 ⁇ y ⁇ x / 3
  • the second relationship can be expressed as 0 ⁇ y ⁇ x / 3
  • the third relationship can be expressed as 0 ⁇ y ⁇ (x + y) *e+m* ⁇
  • m depends on the first predetermined ratio
  • m is a non-negative number
  • the relationship between m* ⁇ and the first predetermined ratio is in a standard normal distribution The ratio of standard deviation to probability.
  • the probability density function curve of the standard normal distribution has a bell shape, and those skilled in the art can understand that the ratio of the standard deviation to the probability in the standard normal distribution is as shown in FIG. 1 , where ⁇ refers to the standard deviation.
  • the first predetermined ratio corresponds to the percentage of the total area therein.
  • the method further comprises: constructing a closed fourth region from the fourth region, the closed fourth region being a second predetermined ratio of AAab SNP sites in the second combined genotype SNP site
  • the distribution region, the construction of the closed fourth region includes obeying a normal distribution based on the sub-high base and the highest base depth of the AAab SNP site, and setting the second predetermined ratio.
  • the second predetermined ratio is set to be not less than 95%.
  • the x-axis represents the depth of the highest base of the SNP site in a two-dimensional coordinate system
  • y x/3
  • y (x+y)*e+ m* ⁇
  • y D 0 -n* ⁇ -x
  • e is the sequencing error rate, generally e ⁇ 1%
  • ((x + y) * e) ⁇ 0.5
  • D 0 is the average depth of the SNP site
  • m n is non-negative, depending on the first predetermined ratio
  • m * ⁇ and the The relationship of the first predetermined ratio is a ratio relationship of the standard deviation and the probability in the standard normal distribution
  • n depends on the second predetermined ratio
  • the relationship between n* ⁇ and the second predetermined ratio is in the standard normal distribution
  • the ratio of standard deviation to probability is a ratio relationship of the standard deviation and the probability in the standard normal distribution.
  • the second predetermined ratio is not less than 95%.
  • the average depth of the so-called SNP site refers to the average of the depth of the SNP site, and the depth of the SNP site is the amount of support data obtained by the SNP site, for example, the reads obtained by sequencing are compared to the reference.
  • the number of reads of the SNP site in the reference sequence is the depth of the site, also referred to as the sequencing depth, ie the number of times the site is covered, preferably, D 0 ⁇ 100X.
  • the method comprises: sequence determining at least a portion of the nucleic acid in the mixed nucleic acid sample, the sequencing data comprising a plurality of reads, the mixed nucleic acid sample comprising the first source nucleic acid and the first a nucleic acid of two origins; aligning the sequencing data with a reference sequence to obtain a comparison result; identifying a SNP site based on the comparison result; determining a distribution region where the SNP site is located, the distribution region is based on The method for constructing a distribution region of different combined genotype SNP sites in any of the above embodiments is constructed; and based on the distribution region in which the SNP site is located, the combined genotype of the SNP site is determined.
  • the first source nucleic acid and the second source nucleic acid may be nucleic acids derived from different individuals, or may be nucleic acids of different tissues or parts of the same individual, such as nucleic acids from tumor cells and from non-tumor cells.
  • Obtaining sequencing data includes sequencing library preparation of mixed nucleic acid samples, and sequencing of the libraries. Sequencing can be performed using existing sequencing platforms, and library preparation can be performed according to the selected sequencing platform.
  • the optional sequencing platforms include, but are not limited to, CG (Complete Genomics) CGA, Illumina/Solexa, Life Technologies/Ion Torrent, and Roche 454. Preparation of single-ended or double-end sequencing libraries according to the selected sequencing platform.
  • the comparison can be performed by using software such as SOAP (Short Oligonucleotide Analysis Package), BWA, etc., and the embodiment does not limit this.
  • SOAP Short Oligonucleotide Analysis Package
  • BWA BWA
  • h base mismatch h is preferably 1 or 2. If more than h bases in a reads are mismatched, it is considered that the reads cannot be compared to the reference sequence.
  • the identification of SNP sites can be performed according to software default parameter settings using software such as SOAPsnp and GATK.
  • the reference sequence is a known sequence, and may be any reference template in the biological category to which the target individual belongs, such as a published genome assembly sequence of the same biological category, if the mixed nucleic acid sample is from a human, its genome
  • the reference sequence (also referred to as the reference genome) can be selected from the HG19 provided by the NCBI database.
  • the comparison result includes the comparison of each read segment with the reference sequence, including whether the read segment can compare the reference sequence, the position of the reference sequence on the read alignment, how many reads at a certain point are aligned, and the comparison The base type of the corresponding position of the read of a certain site, and the like.
  • the locus determines which distribution region the SNP locus falls into, because each distribution region corresponds to the four combined genotypes, such as the distribution region.
  • the first region is the distribution region of the combined genotype ABaa and ABab SNP sites
  • the third region is the distribution region of the first predetermined ratio of AAaa SNP sites
  • the fourth region or the closed fourth region is the AAab SNP site.
  • the combined genotype of the SNP can be determined, that is, the SNP is typed.
  • the proportion of the second source nucleic acid in the mixed nucleic acid is estimated based on the information of the SNP site falling into the fourth region. Since the combined genotype of the SNP locus falling into the fourth region or falling into the closed fourth region is AAab, the sub-high base is only from the second source nucleic acid, and the number of reads supported by the second highest base is overwriting the position.
  • the second source nucleic acid concentration is estimated by using each SNP site falling into the closed fourth region to obtain a set of second source nucleic acid concentration values, and the median value is the second. Source nucleic acid concentration.
  • a method for determining fetal nucleic acid content in a pregnant woman sample comprising: obtaining a sequencing result, the obtaining of the sequencing result comprising sequencing, sequencing, and sequencing at least a portion of the nucleic acid in the pregnant woman sample
  • the result consists of multiple readings.
  • the maternal sample contains maternal nucleic acid and fetal nucleic acid; the sequencing result is compared with the reference sequence to obtain the alignment result; based on the comparison result, the SNP locus is identified; based on the comparison result, the SNP position is determined.
  • the distribution area in which the point is located is constructed according to the method of constructing the distribution area of different combined genotype SNP sites in any of the foregoing embodiments; the SNP based on the fourth region in the distribution region or the closed fourth region A site that determines the fetal nucleic acid content in the sample of the pregnant woman.
  • the sample of the pregnant woman to be tested is a sample of a pregnant woman's body fluid, for example, from pregnant women's peripheral blood, pregnant women's urine, and the like.
  • the free DNA of pregnant women's body fluid contains the genomic information of mother and fetus, which can be divided into four categories: mother homozygous fetus is also homozygous (AAaa), mother homozygous fetal heterozygous (AAab), mother heterozygous but fetus Homozygous (ABaa), the mother's heterozygous fetus is also heterozygous (ABab).
  • This embodiment uses the constructed distribution region to distinguish these four categories, and then selects the SNP site of the mother homozygous fetal heterozygous (AAab) as an effective site for calculating the fetal concentration.
  • the description of the advantages and technical features of the constructed distribution area in any of the foregoing embodiments is equally applicable to this embodiment, and details are not described herein again.
  • the fetal nucleic acid content is a median of 2*y 4 /(x 4 +y 4 ), and y 4 is in the fourth region or in each of the SNP sites in the closed fourth region.
  • the depth of the next highest base, x 4 is the depth of the highest base in the fourth region or the corresponding respective SNP site of the closed fourth region, wherein the depth of the base is the number of supported reads for which it is obtained.
  • the sequencing result contains a data volume of not less than 65X, that is, the sequencing depth is not less than 65X.
  • the deviation correction model is used to correct and calculate The fetal nucleic acid content is obtained to obtain a corrected fetal nucleic acid content.
  • the so-called deviation correction model is capable of correcting the calculated deviation of the fetal nucleic acid concentration due to the small amount of data or the concentration of fetal nucleic acid in the pregnant woman sample being too low and the defined distribution area being relatively strict.
  • the deviation correction model may be established when the data amount of the pregnant woman sample to be tested is not less than 65X or the estimated fetal nucleic acid concentration is less than 10%, or may be established in advance and saved for use.
  • the sequencing result contains less than 65X of data and/or the determined fetal nucleic acid content is less than 10%
  • Adjusting the first predetermined ratio increasing the fourth region or closing the fourth region, so that the SNP sites falling into the fourth region or closing the fourth region are more and theoretically closer, which is beneficial to improve the calculated fetal nucleic acid.
  • the accuracy of the concentration The first predetermined ratio is lowered, and the third region is narrowed to increase the range of the fourth region or the fourth region.
  • K 2 is the number of mimic sites of the combined genotype AAab
  • K 3 is the number of combinatorial genotype ABaa mimic sites
  • K 4 is the number of combinatorial genotype ABab mimetic sites, K 2 /K ⁇ 0.5%, K 2 ⁇ 35
  • set different standard fetal nucleic acid content f based on the hypothesis, using the simulated locus of the combined genotype AAab, that is, in the fourth region or the closed region of the fourth region, the corresponding fetal nucleic acid content f 0 is calculated
  • Polynomial regression is performed on the obtained plurality of sets (f, f 0 ) to establish the deviation correction model; the hypothesis includes that the depths of the highest base and the second highest base of the combined genotype AAaa analog
  • N( ⁇ , ⁇ 2) represents a normal distribution with a mean (expected) of ⁇ and a variance of ⁇ 2, and N( ⁇ , ⁇ 2) is also often expressed as N( ⁇ , ⁇ ).
  • the deviation correction model is Normal mixed model. Based on the above assumptions, under a fixed average sequencing depth, taking a series of f, can obtain a series of f 0 , f 0 can be calculated by the above method, and the equations are fitted to multiple groups (f, f 0 ). A formula suitable for correcting f 0 at this average sequencing depth is obtained.
  • Equation regression can utilize existing methods.
  • the fitted curve corrects the calculated value f 0 to the corresponding theoretical value.
  • these fitted polynomial equations are significant. Since the sequencing depth (average sequencing depth of the simulated sites) assumed by either equation is greater than or equal to 50X, in practice, the sequencing depth can be corrected by the same equation in the range of ⁇ 5X of the assumed sequencing depth, for example, The amount of sequencing data for the sample is 55X.
  • the amount of sequencing data for the same sample to be inspected is 55X.
  • an apparatus for determining a fetal concentration in a pregnant woman sample comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data
  • the program includes an executable program; the processor is connected to the data input unit, the data output unit, and the storage unit, and is configured to execute an executable program stored in the storage unit, and the execution of the program includes completing various modes in the foregoing implementation manner. All or part of the steps of the method.
  • the method and device for detecting fetal nucleic acid concentration in a pregnant woman sample according to the present invention are based on the distribution regions of different combined genotype SNP sites and/or combined with a Gaussian distribution (normal distribution) mixed model to classify SNP sites.
  • Direct detection of fetal DNA concentration based on large fragment parallel sequencing sequences of pregnant women samples eliminates the need for experimental analysis of existing methods to determine specific types of SNP sites, saving trial and analysis costs. It is assumed that the depth of each type of SNP obeys the binomial distribution, and the normal distribution is obeyed under certain conditions to construct the distribution region and/or the deviation correction model in the present invention.
  • Using the method of the present invention for determining fetal nucleic acid content in a pregnant woman sample it is possible to accurately detect fetal DNA concentrations as low as 2.5% with only 65X data.
  • a mixed nucleic acid sample comprising two different source nucleic acids, such as a maternal plasma sample.
  • the genomic information of the mother and the fetus is contained in the free DNA of the pregnant woman's plasma.
  • the SNP sites can be divided into four categories: the mother homozygous fetus is also homozygous (AAaa); the mother is homozygous but the fetus is heterozygous (AAab); The mother is heterozygous but the fetus is homozygous (ABaa); the mother's heterozygous fetus is also heterozygous (ABab).
  • the following four types are distinguished by the difference in the relationship between the highest base depth and the second highest base depth of various types of sites, and the distribution regions of various SNP sites are constructed.
  • the key to the division of the distribution area is to determine the difference between the various SNP loci.
  • the difference can be expressed as a boundary line. It is necessary to determine at least 2 curves (demarcation lines) in order to delineate the distribution area belonging to only one type of SNP site.
  • the distribution of the ABaa and ABab sites is divided. The fetal concentration has not been reported to exceed 50%, so the expected depth of the sub-base of ABaa is between x/2 and x/3.
  • the next step is to separate the SNP sites of the AAaa and AAab types.
  • the site is not a SNP, but an AAaa-type SNP site is generated due to sequencing errors and assembly errors. According to previous studies, these SNP loci due to sequencing errors obey the binomial distribution.
  • the binomial distribution when the number is large enough, the binomial distribution can be regarded as a normal distribution, so these sites are regarded as It is a normal distribution, and according to the nature of the binomial distribution: when the number of sites is greater than 20 and the difference in depth between each is significant (p is less than 5%), the binomial distribution can be approximated as a Poisson distribution.
  • the sequencing quality value to Q20 (sequencing error rate is less than 1%). Therefore, the P of these SNP sites is theoretically less than or equal to 1%, which is consistent with the requirement that the binomial distribution is approximated as a Poisson distribution. Therefore, we assume that The variance of the state distribution is equal to the expectation.
  • the sequencing error is assumed to be 1%, and the expected (mean) is assumed to be the sequencing depth of each site multiplied by 1%, ie the expectation and the variance are both (x i + y i ) * 0.01, x i and y i represent the SNP bits, respectively The highest base depth and the second highest base depth of point i. In the normal distribution, 99.9% of the points will fall within three standard deviations from the mean, as shown in Figure 1.
  • uppercase letters represent bases from the site of the mother's nucleic acid
  • lowercase letters represent bases from the site of the fetal nucleic acid
  • A/a represents the highest base
  • B/b The next highest base representing the same site
  • the x-axis represents the highest base depth
  • the y-axis represents the next highest base depth.
  • the overall operational procedures include:
  • the distribution area of the AAab type SNP locus delineated in the first embodiment (hereinafter, the distribution area of the AAab type SNP locus is simply referred to as the AAab region), that is, the mother is homozygous but the fetus is selected.
  • the SNP site of the AAab is used as an effective site to calculate the predicted fetal nucleic acid concentration.
  • each SNP falling within the distribution area of the AAab SNP site is expressed according to the formula Calculate the fetal nucleic acid concentration corresponding to each locus, and finally take the median as the fetal nucleic acid concentration of the sample, and y 4 and x 4 are the times of a SNP locus in the distribution area of the AAab SNP locus, respectively. High base depth and highest base depth.
  • the boundary between the defined AAab and AAaa may appear too strict and may cause deviation. For example, if the number of sites falling into the AAab region is too small to estimate or because the median value of a set of fetal nucleic acid concentrations is taken as the fetal nucleic acid concentration of the sample, when the SNP site of the AAab near the x-axis is removed too much, more support will be provided.
  • AAaa has a homozygous site and a heterozygous site due to sequencing errors.
  • the sequencing error rate e is 0.26% based on the previously reported sequencing error rate of Hiseq2000.
  • the SNP site in the plasma can be regarded as a normal distribution with a variance equal to the mean, which can be expressed as follows:
  • the simulated site data is generated by the R language, the range of the AAab region of the first embodiment is adjusted, and the predicted fetal nucleic acid concentration is calculated according to the method of the second embodiment according to the simulated site falling into the adjusted AAab region.
  • a set of values of 0.5 to 25% of the standard fetal concentration and the corresponding predicted fetal concentration at a sequencing depth can be obtained, and then a fitting equation is obtained.
  • Table 2 shows the fitted equations at different depths produced for ease of use.
  • the corrected fetal nucleic acid content was 2.6%.
  • the depth of sequencing and the number of SNP loci in the AAab region are the most important factors affecting the accuracy of the calculated nucleic acid content, and the accuracy at different depths is also examined here.
  • the fetal nucleic acid concentration can be accurately obtained at a fetal nucleic acid content of about 3%, and there are relatively many effective SNP sites.
  • the test takes the change of 40 ⁇ 200X accuracy at different depths. The change of accuracy is shown in Fig. 4.
  • the relative error is not more than 10%
  • the SNP position in the AAab region is at least 35
  • the AAab-type SNP locus accounts for no more than 10% of all types of SNP sites in maternal plasma samples, which can accurately detect different fetuses.
  • the minimum number of all types of SNP sites required for nucleic acid concentration is shown in Table 3.
  • the example also uses the Y chromosome depth to calculate the 9 male fetal fetal nucleic acid concentrations in the 18 plasma samples, and compares them with the male fetal fetal nucleic acid concentration values calculated by the method of the present invention.
  • r 0.94; p ⁇ 0.0001
  • the use of the Y chromosome depth to calculate the male fetal nucleic acid concentration is a known method, and can be referred to [Struble C A, Syngelaki A, Oliphant A, et al. Fetal fraction estimate in twin pregnancies using directed cell-free DNA analysis [J].

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • General Health & Medical Sciences (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Biomedical Technology (AREA)
  • Sustainable Development (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne un dispositif permettant de construire une zone de distribution de différents génotypes combinés d'un site SNP (polymorphisme d'un seul nucléotide), comprenant : une unité de construction d'une première zone et d'une deuxième zone pour construire la première zone et la deuxième zone, la première zone étant la zone de distribution du premier génotype combiné du site SNP et la deuxième zone étant la zone de distribution du deuxième génotype combiné du site SNP ; une unité de construction d'une troisième zone et d'une quatrième zone pour construire la troisième zone et la quatrième zone à partir de la deuxième zone, la troisième zone étant la zone de distribution du premier rapport prédéterminé AAaa du site SNP dans le deuxième génotype combiné du site SNP et la quatrième zone étant la zone de distribution AAab du site SNP dans le deuxième génotype combiné du site SNP. La présente invention concerne également un procédé pour la construction de la zone de distribution de différents génotypes combinés du site SNP, un procédé et un dispositif pour déterminer la teneur en acide nucléique foetal d'un échantillon d'une femme enceinte.
PCT/CN2015/070900 2015-01-16 2015-01-16 Procédé et dispositif pour déterminer la teneur en acide nucléique foetal WO2016112539A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580072546.0A CN107109324B (zh) 2015-01-16 2015-01-16 确定胎儿核酸含量的方法和装置
PCT/CN2015/070900 WO2016112539A1 (fr) 2015-01-16 2015-01-16 Procédé et dispositif pour déterminer la teneur en acide nucléique foetal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/070900 WO2016112539A1 (fr) 2015-01-16 2015-01-16 Procédé et dispositif pour déterminer la teneur en acide nucléique foetal

Publications (1)

Publication Number Publication Date
WO2016112539A1 true WO2016112539A1 (fr) 2016-07-21

Family

ID=56405139

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/070900 WO2016112539A1 (fr) 2015-01-16 2015-01-16 Procédé et dispositif pour déterminer la teneur en acide nucléique foetal

Country Status (2)

Country Link
CN (1) CN107109324B (fr)
WO (1) WO2016112539A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4092130A4 (fr) * 2020-01-17 2023-09-27 BGI Shenzhen Procédé de détermination de la concentration d'acide nucléique f?tal et procédé de génotypage f?tal

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110993024B (zh) * 2019-12-20 2023-08-22 北京科迅生物技术有限公司 建立胎儿浓度校正模型的方法及装置与胎儿浓度定量的方法及装置
CN113981062B (zh) * 2021-10-14 2024-02-20 武汉蓝沙医学检验实验室有限公司 以非生父和母亲dna评估胎儿dna浓度的方法及应用

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102325901A (zh) * 2008-12-22 2012-01-18 赛卢拉有限公司 检测等位基因、基因组和转录物组的方法和基因型分析谱
CN102770558A (zh) * 2009-11-05 2012-11-07 香港中文大学 由母本生物样品进行胎儿基因组的分析
CN104232777A (zh) * 2014-09-19 2014-12-24 天津华大基因科技有限公司 同时确定胎儿核酸含量和染色体非整倍性的方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011041485A1 (fr) * 2009-09-30 2011-04-07 Gene Security Network, Inc. Méthode non invasive de détermination d'une ploïdie prénatale
CN103215350B (zh) * 2013-03-26 2016-11-02 苏州贝康医疗器械有限公司 一种基于单核苷酸多态性位点的孕妇血浆中胎儿dna含量的测定方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102325901A (zh) * 2008-12-22 2012-01-18 赛卢拉有限公司 检测等位基因、基因组和转录物组的方法和基因型分析谱
CN102770558A (zh) * 2009-11-05 2012-11-07 香港中文大学 由母本生物样品进行胎儿基因组的分析
CN104232777A (zh) * 2014-09-19 2014-12-24 天津华大基因科技有限公司 同时确定胎儿核酸含量和染色体非整倍性的方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4092130A4 (fr) * 2020-01-17 2023-09-27 BGI Shenzhen Procédé de détermination de la concentration d'acide nucléique f?tal et procédé de génotypage f?tal

Also Published As

Publication number Publication date
CN107109324B (zh) 2019-11-08
CN107109324A (zh) 2017-08-29

Similar Documents

Publication Publication Date Title
AU2022200046B2 (en) Maternal plasma transcriptome analysis by massively parallel RNA sequencing
AU2022203114A1 (en) Detecting mutations for cancer screening and fetal analysis
TWI611186B (zh) 多重妊娠之分子檢驗
CN104232778B (zh) 同时确定胎儿单体型及染色体非整倍性的方法及装置
US20190338349A1 (en) Methods and systems for high fidelity sequencing
US20110092763A1 (en) Methods for Embryo Characterization and Comparison
CN110846411B (zh) 一种基于二代测序的单独肿瘤样本区分基因突变类型的方法
US20210130900A1 (en) Multiplexed parallel analysis of targeted genomic regions for non-invasive prenatal testing
JP2015506684A (ja) ゲノムのコピー数変異の有無を判断する方法、システム及びコンピューター読み取り可能な記憶媒体
KR20190019219A (ko) 모체 혈장으로부터의 비침습적 산전 분자 핵형분석
US20190338350A1 (en) Method, device and kit for detecting fetal genetic mutation
WO2021073604A1 (fr) Procédé et système de nettoyage de données génétiques bruitées, de phasage d'haplotype et de reconstruction du génome de la descendance, et leur utilisation
CN109461473B (zh) 胎儿游离dna浓度获取方法和装置
WO2016112539A1 (fr) Procédé et dispositif pour déterminer la teneur en acide nucléique foetal
TW201823472A (zh) 基於單倍型之通用非侵入性單基因疾病產前檢測
GB2559437A (en) Prenatal screening and diagnostic system and method
CN113308548B (zh) 一种检测胎儿基因单倍体型的方法、装置和存储介质
US10106836B2 (en) Determining fetal genomes for multiple fetus pregnancies
CA3068198A1 (fr) Enrichissement de regions genomiques ciblees pour analyse parallele multiplexee
CN114531916A (zh) 确定精子提供者、卵母细胞提供者和对应受孕体之间的遗传关系的系统和方法
US11869630B2 (en) Screening system and method for determining a presence and an assessment score of cell-free DNA fragments
US20240287593A1 (en) Single-molecule strand-specific end modalities
CN117925820B (zh) 一种用于胚胎植入前变异检测的方法
Giannico et al. NIPAT as Non-Invasive Prenatal Paternity Testing Using a Panel of 861 SNVs. Genes 2023, 14, 312
KR20200137875A (ko) 2단계 Z-score에 기반한 비침습적 산전 검사 방법 및 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15877457

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01.12.2017)

122 Ep: pct application non-entry in european phase

Ref document number: 15877457

Country of ref document: EP

Kind code of ref document: A1