WO2016112539A1 - Method and device for determining fetal nucleic acid content - Google Patents

Method and device for determining fetal nucleic acid content Download PDF

Info

Publication number
WO2016112539A1
WO2016112539A1 PCT/CN2015/070900 CN2015070900W WO2016112539A1 WO 2016112539 A1 WO2016112539 A1 WO 2016112539A1 CN 2015070900 W CN2015070900 W CN 2015070900W WO 2016112539 A1 WO2016112539 A1 WO 2016112539A1
Authority
WO
WIPO (PCT)
Prior art keywords
region
relationship
snp site
nucleic acid
depth
Prior art date
Application number
PCT/CN2015/070900
Other languages
French (fr)
Chinese (zh)
Inventor
康雄斌
陈芳
刘萍
徐惠欣
芦静
蒋浩君
Original Assignee
深圳华大基因股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因股份有限公司 filed Critical 深圳华大基因股份有限公司
Priority to CN201580072546.0A priority Critical patent/CN107109324B/en
Priority to PCT/CN2015/070900 priority patent/WO2016112539A1/en
Publication of WO2016112539A1 publication Critical patent/WO2016112539A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12MAPPARATUS FOR ENZYMOLOGY OR MICROBIOLOGY; APPARATUS FOR CULTURING MICROORGANISMS FOR PRODUCING BIOMASS, FOR GROWING CELLS OR FOR OBTAINING FERMENTATION OR METABOLIC PRODUCTS, i.e. BIOREACTORS OR FERMENTERS
    • C12M1/00Apparatus for enzymology or microbiology
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12MAPPARATUS FOR ENZYMOLOGY OR MICROBIOLOGY; APPARATUS FOR CULTURING MICROORGANISMS FOR PRODUCING BIOMASS, FOR GROWING CELLS OR FOR OBTAINING FERMENTATION OR METABOLIC PRODUCTS, i.e. BIOREACTORS OR FERMENTERS
    • C12M1/00Apparatus for enzymology or microbiology
    • C12M1/34Measuring or testing with condition measuring or sensing means, e.g. colony counters
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the present invention relates to the field of biological information. Specifically, the present invention relates to a device for constructing a distribution region of different combined genotype SNP sites, a method for constructing a distribution region of different combined genotype SNP sites, and a method for distinguishing different combinations.
  • a method of genotypic SNP locus a method of determining fetal nucleic acid content in a pregnant sample, a device for determining fetal nucleic acid content in a pregnant sample, and a computer readable medium.
  • prenatal testing techniques Since the discovery of fetal free DNA in maternal plasma, prenatal testing techniques have undergone major innovations. Recently, with the continuous reduction of the price of second-generation sequencing technology and technological innovation, the development of non-invasive prenatal testing is rapidly applied. For example, the diagnosis of genetic diseases such as prenatal hemophilia, gender confusion and monogenic diseases. In these diagnoses, fetal concentration is an important parameter. In addition, abnormal fetal concentrations can also be used to aid in predicting some disease risks, such as higher levels of fetal concentration may be associated with preterm birth, and lower levels of fetal concentration may be used to aid in the identification of moderate or severe eclampsia.
  • the present invention provides a device for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and a gene in a second source nucleic acid a combination of types
  • the apparatus comprising: a first region-second region building unit for constructing a first region and a second region, the first region being a distribution region of the first combined genotype SNP site, and the second region is a
  • the distribution region of the two combined genotype SNP sites, the first region and the second region are divided based on the difference between the first relationship and the second relationship, and the first relationship is the second highest base of the first combined genotype SNP site
  • the relationship between the depth and the depth of the highest base, the second relationship is the relationship between the depth of the second highest base of the second combined genotype SNP and the depth of the highest base, the first combination
  • the genotypes are ABaa and ABab
  • the second combined genotypes are AAaa
  • the relationship between the depth of the next highest base of the first predetermined ratio of AAaa SNP sites and the depth of the highest base, and the fourth relationship is the number of AAab SNP sites in the second combined genotype SNP locus
  • a or a represents the highest base of the SNP site, and B or b represents the next highest base of the same SNP site.
  • the SNP is generally considered to be dimorphic, it is a di-allelic, consisting of two bases, the highest base and the second highest base, respectively.
  • a and B respectively represent a SNP locus in a source nucleic acid.
  • the highest base and the second highest base in the corresponding one, and a and b respectively represent the highest base and the next highest base of the same SNP site in the nucleic acid of another source.
  • the device is based on four possible combinations of genotype formation in the two source nucleic acids by the same SNP site, assuming that the first source nucleic acid is greater than the second source nucleic acid content, and if in the same sequence
  • the distribution regions of the four different combined genotype SNP sites were constructed by the difference in the relationship between the sub-high base depth and the highest base depth of the four combined SNP sites.
  • the distribution region constructed using the device can be used to determine the combined genotype of SNP sites in the data to be detected, and/or to distinguish different combined genotype SNP sites in the data to be detected and to obtain a certain combination genotype thereof. Data information of SNP sites.
  • the present invention provides a device for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and a gene in a second source nucleic acid a combination of types
  • the apparatus comprising: a first region-second region building unit for constructing a first region and a second region, the first region being a distribution region of the first combined genotype SNP site, and the second region is a
  • the distribution region of the two combined genotype SNP sites, the first region and the second region are divided based on the difference between the first relationship and the second relationship, and the first relationship is the second highest base of the first combined genotype SNP site
  • the relationship between the depth and the depth of the highest base, the second relationship is the relationship between the depth of the second highest base of the second combined genotype SNP and the depth of the highest base, the first combination
  • the genotypes are ABaa and ABab
  • the second combined genotypes are AAaa
  • a fourth relationship is a relationship between a depth of a sub-high base of the AAab SNP site in the second combined genotype SNP site and a depth of the highest base thereof; and a closed fourth region building unit for A closed fourth region is constructed in the fourth region, and the closed fourth region is a distribution region of a second predetermined ratio of AAab SNP sites in the second combined genotype SNP site, and the closed fourth region is based on the AAab SNP site.
  • Both the high base and the highest base depth follow a normal distribution, and a second predetermined ratio is set, constructed from the fourth region; wherein AA and AB respectively represent homozygous and heterozygous nucleic acids from the first source SNP locus, aa and ab Representing homozygous and heterozygous respectively from the second The same SNP site of the source nucleic acid, defining A or a represents the highest base of the SNP site, and B or b represents the next highest base of the same SNP site.
  • the present invention provides a method for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and a gene in a second source nucleic acid a combination of types, the method comprising: constructing a first region and a second region based on a difference between the first relationship and the second relationship, the first region is a distribution region of the first combined genotype SNP site, and the second region is a second region
  • the distribution region of the combined genotype SNP locus, the first relationship is the relationship between the depth of the second highest base of the first combined genotype SNP locus and the depth of the highest base, and the second relationship is the second combined gene
  • the relationship between the depth of the next highest base of the SNP site and the depth of the highest base, the first combined genotypes are ABaa and ABab, and the second combined genotype is AAaa and AAab; based on the third relationship and the third
  • the present invention provides a method for distinguishing different combinations of genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and in a second source nucleic acid.
  • a combination of genotypes comprising: sequence determining at least a portion of the nucleic acid in the mixed nucleic acid sample, obtaining sequencing data, the sequencing data consisting of a plurality of reads, the mixed nucleic acid sample comprising the first source nucleic acid and the second source nucleic acid; The sequencing data is aligned with the reference sequence to obtain a comparison result; based on the comparison result, the SNP site is identified and the distribution region where the SNP site is located is determined, and the distribution region is a different combination according to one aspect of the present invention.
  • a method for constructing a distribution region of a genotype SNP locus determining a combined genotype of the SNP locus based on a distribution region in which the SNP locus is located.
  • the second source nucleic acid concentration is estimated by using each SNP site falling into the closed fourth region to obtain a set of second source nucleic acid concentration values, and the median value is the second.
  • Source nucleic acid concentration For mixed nucleic acid samples containing only two sources of nucleic acid, the nucleic acid content of one of the sources is determined and the other is determined.
  • the present invention provides a method for determining a fetal nucleic acid content in a pregnant woman sample, the method comprising: obtaining a sequencing result, the obtaining of the sequencing result comprising sequencing a nucleic acid of at least a portion of the sample of the pregnant woman,
  • the sequencing result consists of multiple reads.
  • the maternal sample contains maternal nucleic acid and fetal nucleic acid; the sequencing result is compared with the reference sequence to obtain the alignment result; based on the comparison result, the SNP locus is identified; based on the comparison result, the SNP is determined.
  • the distribution region in which the site is located, the distribution region is constructed according to an aspect of the present invention for constructing a distribution region of different combined genotype SNP sites; based on the fourth region in the distribution region or the SNP site in the fourth region;
  • the fetal nucleic acid content in the pregnant woman sample is determined.
  • the sample of the pregnant woman to be tested is a sample of the pregnant woman's body fluid, for example, from the peripheral blood of the pregnant woman, the urine of the pregnant woman, and the like.
  • the present invention provides a method for determining a fetal nucleic acid content in a pregnant woman sample, the method comprising: obtaining a sequencing result, the obtaining of the sequencing result comprising sequencing, sequencing, and sequencing at least a portion of the nucleic acid in the pregnant woman sample
  • the result consists of multiple readings.
  • the maternal sample contains maternal nucleic acid and fetal nucleic acid; the sequencing result is compared with the reference sequence to obtain the alignment result; based on the comparison result, the SNP locus is identified; based on the comparison result, the SNP position is determined.
  • the distribution region in which the point is located, the distribution region is constructed according to an aspect of the present invention for constructing a distribution region of different combined genotype SNP sites; based on the fourth region in the distribution region or the SNP site in the closed fourth region, The fetal nucleic acid content of the pregnant woman sample; and, when the sequencing result contains less than 65X of data and/or the determined fetal nucleic acid content is less than 10%, the bias correction model is used to correct the fetal nucleic acid content, and the corrected fetal nucleic acid content is obtained. .
  • the present invention provides an apparatus for determining a fetal nucleic acid content in a pregnant woman sample, comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing Data, including an executable program; a processor coupled to the data input unit, the data output unit, and the storage unit for executing an executable program, the method of performing the method comprising determining the fetal nucleic acid content in the pregnant woman sample .
  • a computer readable medium for storing a program for execution by a computer, the method comprising performing any of the above methods for determining a fetal nucleic acid content in a sample of a pregnant woman.
  • the storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
  • the distribution regions of different combined genotype SNP loci can be obtained, so that the SNP loci of different combined genotypes can be distinguished, the combined genotype of the SNP can be determined, and the SNP position of the specific combined genotype can be obtained.
  • the point and the SNP site information of the specific combination genotype in the mixed nucleic acid sample are used to determine the content of the nucleic acid of different sources in the mixed nucleic acid sample, including the fetal nucleic acid content in the pregnant sample, and the content of the nucleic acid from the tumor cell in the tumor circulating blood sample.
  • the deviation correction model of the present invention it is possible to correct the calculation due to the fact that the amount of data acquired is small or the concentration of the target-derived nucleic acid in the mixed nucleic acid sample is relatively low, and the range of the delineated distribution region is relatively strict.
  • the deviation of the concentration of the target source nucleic acid is obtained such that the nucleic acid content of the second source is determined.
  • Figure 1 shows the ratio of the standard deviation to the probability of a standard normal distribution in one embodiment of the invention.
  • Figure 2 shows a schematic representation of a distribution region constructed in a specific embodiment of the invention.
  • Figure 3 shows the relationship between predicted fetal nucleic acid concentration and real fetal nucleic acid concentration in one embodiment of the invention, wherein the predicted fetal nucleic acid concentration in Figure 3a is not corrected using the bias correction model, the predicted in Figure 3b The fetal nucleic acid concentration is the predicted fetal nucleic acid concentration corrected by the bias model.
  • Figure 4 shows the effect of different sequencing depths on determining fetal nucleic acid content in one embodiment of the invention.
  • Figure 5 shows the effect of the number of sites in the AAab SNP distribution region on determining fetal nucleic acid content in one embodiment of the invention.
  • Figure 6 shows the difference in male fetus fetal nucleic acid concentration predicted using Y chromosome depth and the fetal fetal nucleic acid concentration predicted by the method of the present invention in one embodiment of the present invention.
  • a device for constructing a distribution region of different combined genotype SNP sites wherein the combined genotype is a genotype of a SNP site in a nucleic acid of a first source and a nucleic acid at a second source a combination of genotypes
  • the apparatus comprising: a first region-second region building unit for constructing a first region and a second region, the first region being a distribution region of the first combined genotype SNP site, and second The region is a distribution region of the second combined genotype SNP site, and the first region and the second region are divided based on the difference between the first relationship and the second relationship, and the first relationship is the time of the first combined genotype SNP site
  • the relationship between the depth of the high base and the depth of the highest base, and the second relationship is the relationship between the depth of the second highest base of the second combined genotype SNP and the depth of the highest base.
  • the first combined genotype is ABaa and ABab
  • the second combined genotype is AAaa and AAab
  • the third region-fourth region building unit is configured to construct a third region and a fourth region from the second region
  • the third region is Second combined genotype S a distribution area of a first predetermined ratio of AAaa SNP sites in the NP site
  • the fourth region is a distribution region of AAab SNP sites in the second combined genotype SNP site
  • the third region and the fourth region are based on
  • the difference between the third relationship and the fourth relationship is the third relationship being the depth of the second highest base of the first predetermined proportion of the AAaa SNP sites in the second combined genotype SNP site and the depth of the highest base
  • the relationship between the fourth relationship is the relationship between the depth of the sub-high base of the AAab SNP site in the second combined genotype SNP site and the depth of the highest base; wherein AA and AB respectively represent Homozygous and heterozy
  • B or b represents the next highest base of the same SNP site. It is generally considered that the SNP is dimorphic, that is, the second allele is composed of two bases, which are the highest base and the second highest. Base, in the same sequence measurement data, the base that obtains the most data support in a SNP site is called the highest base, and the data that supports the second most is called the second highest base, for example, covers a SNP. In the read segment obtained by sequencing of the site, the corresponding position is the same as the one base of the site, and the number of reads supported by the base is the read segment supporting the base. The most bases.
  • a and B respectively represent the highest base and the second highest base of a SNP site in one source nucleic acid, and correspondingly, a and b respectively represent the highest of the same SNP site in another source nucleic acid.
  • Base and second highest base The depth of the highest base and the depth of the next highest base refer to the amount of data supported by the highest base and the second highest base in the same measurement data, for example, the reads obtained by sequencing are compared.
  • the number of reads of the SNP site on the alignment (cover) reference sequence is the depth of the site, also known as the sequencing depth, and the corresponding position in the read of the site is aligned with the bit
  • the number of reads of the highest base of the point is the depth of the highest base, and correspondingly, the number of reads of the corresponding position in the read of the site is the same as the next highest base of the site. The depth of this high base.
  • a read that matches the corresponding position in the read on the site to the same position as the base of the site is referred to as a read that supports the base.
  • the device is used to construct the distribution region, and it is not necessary to obtain specific SNP site data in advance, that is, the construction of the distribution region does not depend on specific SNP site data information, including the highest base and the second highest base depth independent of the SNP site.
  • the device is based on four possible combinations of genotype formation in the two source nucleic acids by the same SNP site, assuming that the content of the first source nucleic acid is greater than the nucleic acid content of the second source, and if in the same sequence determination results, four combinations
  • the distribution of the four different combined genotype SNP sites was constructed by the difference in the relationship between the sub-high base depth and the highest base depth of the SNP site.
  • the four combined genotypes are shown in Table 1.
  • a method for visualizing the distribution region establishing a two-dimensional coordinate system, the y-axis represents the depth of the next-high base of the SNP site, and the x-axis represents the highest of the SNP site.
  • the depth of the base the first relationship can be expressed as x/2 ⁇ y ⁇ x/3, the second relationship can be expressed as 0 ⁇ y ⁇ x/3, and the third relationship can be expressed as 0 ⁇ y ⁇ (x+y) *e+m* ⁇ , the fourth relationship can be expressed as (x+y)*e+m* ⁇ y ⁇ 3/x, where e is the sequencing error rate, generally e ⁇ 1%, and ⁇ is the standard deviation.
  • ((x+y)*e) ⁇ 0.5,m depends on the first predetermined ratio, m is a non-negative number, and the relationship between m* ⁇ and the first predetermined ratio is the standard deviation in the standard normal distribution The ratio of the probability to the probability.
  • the probability density function curve of the standard normal distribution has a bell shape, and those skilled in the art can understand that the ratio of the standard deviation to the probability in the standard normal distribution is as shown in FIG. 1, such as: plus or minus one standard deviation. Between the total area of 68.26%; between plus or minus 1.96 standard deviations, including 95% of the total area; between plus and minus 2.58 standard deviations, including 99% of the total area; m is corresponding to the standard deviation
  • the apparatus further comprises: a closed fourth region building unit for constructing a closed fourth region from the fourth region, the closed fourth region being the second of the second combined genotype SNP sites a distribution area of a predetermined ratio of AAab SNP sites, wherein the closed fourth region is subjected to a normal distribution based on the sub-high base and the highest base depth of the AAab SNP site, and the second is set A predetermined ratio is obtained from the construction of the fourth region.
  • the second predetermined ratio is set to be not less than 95%.
  • the x-axis represents the depth of the highest base of the SNP site in a two-dimensional coordinate system
  • y x/3
  • y (x+y)*e+ m* ⁇
  • y D 0 -n* ⁇ -x
  • e is the sequencing error rate, generally e ⁇ 1%
  • ((x + y) * e) ⁇ 0.5
  • D 0 is the average depth of the SNP site
  • m n is non-negative, depending on the first predetermined ratio
  • m * ⁇ and the The relationship of the first predetermined ratio is a ratio relationship of the standard deviation and the probability in the standard normal distribution
  • n depends on the second predetermined ratio
  • the relationship between n* ⁇ and the second predetermined ratio is in the standard normal distribution
  • the ratio of standard deviation to probability is a ratio relationship of the standard deviation and the probability in the standard normal distribution.
  • the second predetermined ratio is not less than 95%.
  • the average depth of the so-called SNP site refers to the average of the depth of the SNP site, and the depth of the SNP site is the amount of support data obtained by the SNP site, for example, the reads obtained by sequencing are compared to the reference.
  • the number of reads of the SNP site in the reference sequence is the depth of the site, also referred to as the sequencing depth, ie the number of times the site is covered, preferably, D 0 ⁇ 100X.
  • a distribution region for obtaining different combined genotype SNP sites can be constructed.
  • the converse genotype of the SNP locus can be determined according to the distribution region in which the SNP locus falls in the data to be detected, the SNP locus of the specific combined genotype is obtained, and the mixed nucleic acid sample is utilized.
  • the SNP site information for a particular combination genotype determines the amount of nucleic acid from a different source in the mixed nucleic acid sample, including the fetal nucleic acid content in the pregnant sample.
  • a method for constructing a distribution region of different combined genotype SNP sites wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and in a second source a combination of genotypes in a nucleic acid, the method comprising: constructing a first region and a second region based on a difference between the first relationship and the second relationship, the first region being a distribution region of the first combined genotype SNP site, and second The region is a distribution region of the second combined genotype SNP locus, and the first relationship is a relationship between the depth of the second highest base of the first combined genotype SNP locus and the depth of the highest base, and the second relationship is The relationship between the depth of the second highest base of the second combined genotype SNP locus and the depth of the highest base, the first combined genotype is ABaa And ABab, the second combined genotype is AAaa and AAab; based on the difference between the third relationship
  • the construction of the distribution region does not depend on specific SNP locus data information, including the highest base and the second highest base depth independent of the SNP locus.
  • the method is based on four possible combinations of genotype formation in the two source nucleic acids of the same SNP site, assuming that the content of the first source nucleic acid is greater than the second source nucleic acid content, and if in the same sequence determination result, four combinations
  • the distribution of the four different combined genotype SNP sites was constructed by the difference in the relationship between the sub-high base depth and the highest base depth of the SNP site.
  • a method for visualizing said distribution region, establishing a two-dimensional coordinate system, the y-axis representing the depth of the next-high base of the SNP site, and the x-axis representing the SNP site
  • the depth of the highest base the first relationship can be expressed as x/2 ⁇ y ⁇ x / 3
  • the second relationship can be expressed as 0 ⁇ y ⁇ x / 3
  • the third relationship can be expressed as 0 ⁇ y ⁇ (x + y) *e+m* ⁇
  • m depends on the first predetermined ratio
  • m is a non-negative number
  • the relationship between m* ⁇ and the first predetermined ratio is in a standard normal distribution The ratio of standard deviation to probability.
  • the probability density function curve of the standard normal distribution has a bell shape, and those skilled in the art can understand that the ratio of the standard deviation to the probability in the standard normal distribution is as shown in FIG. 1 , where ⁇ refers to the standard deviation.
  • the first predetermined ratio corresponds to the percentage of the total area therein.
  • the method further comprises: constructing a closed fourth region from the fourth region, the closed fourth region being a second predetermined ratio of AAab SNP sites in the second combined genotype SNP site
  • the distribution region, the construction of the closed fourth region includes obeying a normal distribution based on the sub-high base and the highest base depth of the AAab SNP site, and setting the second predetermined ratio.
  • the second predetermined ratio is set to be not less than 95%.
  • the x-axis represents the depth of the highest base of the SNP site in a two-dimensional coordinate system
  • y x/3
  • y (x+y)*e+ m* ⁇
  • y D 0 -n* ⁇ -x
  • e is the sequencing error rate, generally e ⁇ 1%
  • ((x + y) * e) ⁇ 0.5
  • D 0 is the average depth of the SNP site
  • m n is non-negative, depending on the first predetermined ratio
  • m * ⁇ and the The relationship of the first predetermined ratio is a ratio relationship of the standard deviation and the probability in the standard normal distribution
  • n depends on the second predetermined ratio
  • the relationship between n* ⁇ and the second predetermined ratio is in the standard normal distribution
  • the ratio of standard deviation to probability is a ratio relationship of the standard deviation and the probability in the standard normal distribution.
  • the second predetermined ratio is not less than 95%.
  • the average depth of the so-called SNP site refers to the average of the depth of the SNP site, and the depth of the SNP site is the amount of support data obtained by the SNP site, for example, the reads obtained by sequencing are compared to the reference.
  • the number of reads of the SNP site in the reference sequence is the depth of the site, also referred to as the sequencing depth, ie the number of times the site is covered, preferably, D 0 ⁇ 100X.
  • the method comprises: sequence determining at least a portion of the nucleic acid in the mixed nucleic acid sample, the sequencing data comprising a plurality of reads, the mixed nucleic acid sample comprising the first source nucleic acid and the first a nucleic acid of two origins; aligning the sequencing data with a reference sequence to obtain a comparison result; identifying a SNP site based on the comparison result; determining a distribution region where the SNP site is located, the distribution region is based on The method for constructing a distribution region of different combined genotype SNP sites in any of the above embodiments is constructed; and based on the distribution region in which the SNP site is located, the combined genotype of the SNP site is determined.
  • the first source nucleic acid and the second source nucleic acid may be nucleic acids derived from different individuals, or may be nucleic acids of different tissues or parts of the same individual, such as nucleic acids from tumor cells and from non-tumor cells.
  • Obtaining sequencing data includes sequencing library preparation of mixed nucleic acid samples, and sequencing of the libraries. Sequencing can be performed using existing sequencing platforms, and library preparation can be performed according to the selected sequencing platform.
  • the optional sequencing platforms include, but are not limited to, CG (Complete Genomics) CGA, Illumina/Solexa, Life Technologies/Ion Torrent, and Roche 454. Preparation of single-ended or double-end sequencing libraries according to the selected sequencing platform.
  • the comparison can be performed by using software such as SOAP (Short Oligonucleotide Analysis Package), BWA, etc., and the embodiment does not limit this.
  • SOAP Short Oligonucleotide Analysis Package
  • BWA BWA
  • h base mismatch h is preferably 1 or 2. If more than h bases in a reads are mismatched, it is considered that the reads cannot be compared to the reference sequence.
  • the identification of SNP sites can be performed according to software default parameter settings using software such as SOAPsnp and GATK.
  • the reference sequence is a known sequence, and may be any reference template in the biological category to which the target individual belongs, such as a published genome assembly sequence of the same biological category, if the mixed nucleic acid sample is from a human, its genome
  • the reference sequence (also referred to as the reference genome) can be selected from the HG19 provided by the NCBI database.
  • the comparison result includes the comparison of each read segment with the reference sequence, including whether the read segment can compare the reference sequence, the position of the reference sequence on the read alignment, how many reads at a certain point are aligned, and the comparison The base type of the corresponding position of the read of a certain site, and the like.
  • the locus determines which distribution region the SNP locus falls into, because each distribution region corresponds to the four combined genotypes, such as the distribution region.
  • the first region is the distribution region of the combined genotype ABaa and ABab SNP sites
  • the third region is the distribution region of the first predetermined ratio of AAaa SNP sites
  • the fourth region or the closed fourth region is the AAab SNP site.
  • the combined genotype of the SNP can be determined, that is, the SNP is typed.
  • the proportion of the second source nucleic acid in the mixed nucleic acid is estimated based on the information of the SNP site falling into the fourth region. Since the combined genotype of the SNP locus falling into the fourth region or falling into the closed fourth region is AAab, the sub-high base is only from the second source nucleic acid, and the number of reads supported by the second highest base is overwriting the position.
  • the second source nucleic acid concentration is estimated by using each SNP site falling into the closed fourth region to obtain a set of second source nucleic acid concentration values, and the median value is the second. Source nucleic acid concentration.
  • a method for determining fetal nucleic acid content in a pregnant woman sample comprising: obtaining a sequencing result, the obtaining of the sequencing result comprising sequencing, sequencing, and sequencing at least a portion of the nucleic acid in the pregnant woman sample
  • the result consists of multiple readings.
  • the maternal sample contains maternal nucleic acid and fetal nucleic acid; the sequencing result is compared with the reference sequence to obtain the alignment result; based on the comparison result, the SNP locus is identified; based on the comparison result, the SNP position is determined.
  • the distribution area in which the point is located is constructed according to the method of constructing the distribution area of different combined genotype SNP sites in any of the foregoing embodiments; the SNP based on the fourth region in the distribution region or the closed fourth region A site that determines the fetal nucleic acid content in the sample of the pregnant woman.
  • the sample of the pregnant woman to be tested is a sample of a pregnant woman's body fluid, for example, from pregnant women's peripheral blood, pregnant women's urine, and the like.
  • the free DNA of pregnant women's body fluid contains the genomic information of mother and fetus, which can be divided into four categories: mother homozygous fetus is also homozygous (AAaa), mother homozygous fetal heterozygous (AAab), mother heterozygous but fetus Homozygous (ABaa), the mother's heterozygous fetus is also heterozygous (ABab).
  • This embodiment uses the constructed distribution region to distinguish these four categories, and then selects the SNP site of the mother homozygous fetal heterozygous (AAab) as an effective site for calculating the fetal concentration.
  • the description of the advantages and technical features of the constructed distribution area in any of the foregoing embodiments is equally applicable to this embodiment, and details are not described herein again.
  • the fetal nucleic acid content is a median of 2*y 4 /(x 4 +y 4 ), and y 4 is in the fourth region or in each of the SNP sites in the closed fourth region.
  • the depth of the next highest base, x 4 is the depth of the highest base in the fourth region or the corresponding respective SNP site of the closed fourth region, wherein the depth of the base is the number of supported reads for which it is obtained.
  • the sequencing result contains a data volume of not less than 65X, that is, the sequencing depth is not less than 65X.
  • the deviation correction model is used to correct and calculate The fetal nucleic acid content is obtained to obtain a corrected fetal nucleic acid content.
  • the so-called deviation correction model is capable of correcting the calculated deviation of the fetal nucleic acid concentration due to the small amount of data or the concentration of fetal nucleic acid in the pregnant woman sample being too low and the defined distribution area being relatively strict.
  • the deviation correction model may be established when the data amount of the pregnant woman sample to be tested is not less than 65X or the estimated fetal nucleic acid concentration is less than 10%, or may be established in advance and saved for use.
  • the sequencing result contains less than 65X of data and/or the determined fetal nucleic acid content is less than 10%
  • Adjusting the first predetermined ratio increasing the fourth region or closing the fourth region, so that the SNP sites falling into the fourth region or closing the fourth region are more and theoretically closer, which is beneficial to improve the calculated fetal nucleic acid.
  • the accuracy of the concentration The first predetermined ratio is lowered, and the third region is narrowed to increase the range of the fourth region or the fourth region.
  • K 2 is the number of mimic sites of the combined genotype AAab
  • K 3 is the number of combinatorial genotype ABaa mimic sites
  • K 4 is the number of combinatorial genotype ABab mimetic sites, K 2 /K ⁇ 0.5%, K 2 ⁇ 35
  • set different standard fetal nucleic acid content f based on the hypothesis, using the simulated locus of the combined genotype AAab, that is, in the fourth region or the closed region of the fourth region, the corresponding fetal nucleic acid content f 0 is calculated
  • Polynomial regression is performed on the obtained plurality of sets (f, f 0 ) to establish the deviation correction model; the hypothesis includes that the depths of the highest base and the second highest base of the combined genotype AAaa analog
  • N( ⁇ , ⁇ 2) represents a normal distribution with a mean (expected) of ⁇ and a variance of ⁇ 2, and N( ⁇ , ⁇ 2) is also often expressed as N( ⁇ , ⁇ ).
  • the deviation correction model is Normal mixed model. Based on the above assumptions, under a fixed average sequencing depth, taking a series of f, can obtain a series of f 0 , f 0 can be calculated by the above method, and the equations are fitted to multiple groups (f, f 0 ). A formula suitable for correcting f 0 at this average sequencing depth is obtained.
  • Equation regression can utilize existing methods.
  • the fitted curve corrects the calculated value f 0 to the corresponding theoretical value.
  • these fitted polynomial equations are significant. Since the sequencing depth (average sequencing depth of the simulated sites) assumed by either equation is greater than or equal to 50X, in practice, the sequencing depth can be corrected by the same equation in the range of ⁇ 5X of the assumed sequencing depth, for example, The amount of sequencing data for the sample is 55X.
  • the amount of sequencing data for the same sample to be inspected is 55X.
  • an apparatus for determining a fetal concentration in a pregnant woman sample comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data
  • the program includes an executable program; the processor is connected to the data input unit, the data output unit, and the storage unit, and is configured to execute an executable program stored in the storage unit, and the execution of the program includes completing various modes in the foregoing implementation manner. All or part of the steps of the method.
  • the method and device for detecting fetal nucleic acid concentration in a pregnant woman sample according to the present invention are based on the distribution regions of different combined genotype SNP sites and/or combined with a Gaussian distribution (normal distribution) mixed model to classify SNP sites.
  • Direct detection of fetal DNA concentration based on large fragment parallel sequencing sequences of pregnant women samples eliminates the need for experimental analysis of existing methods to determine specific types of SNP sites, saving trial and analysis costs. It is assumed that the depth of each type of SNP obeys the binomial distribution, and the normal distribution is obeyed under certain conditions to construct the distribution region and/or the deviation correction model in the present invention.
  • Using the method of the present invention for determining fetal nucleic acid content in a pregnant woman sample it is possible to accurately detect fetal DNA concentrations as low as 2.5% with only 65X data.
  • a mixed nucleic acid sample comprising two different source nucleic acids, such as a maternal plasma sample.
  • the genomic information of the mother and the fetus is contained in the free DNA of the pregnant woman's plasma.
  • the SNP sites can be divided into four categories: the mother homozygous fetus is also homozygous (AAaa); the mother is homozygous but the fetus is heterozygous (AAab); The mother is heterozygous but the fetus is homozygous (ABaa); the mother's heterozygous fetus is also heterozygous (ABab).
  • the following four types are distinguished by the difference in the relationship between the highest base depth and the second highest base depth of various types of sites, and the distribution regions of various SNP sites are constructed.
  • the key to the division of the distribution area is to determine the difference between the various SNP loci.
  • the difference can be expressed as a boundary line. It is necessary to determine at least 2 curves (demarcation lines) in order to delineate the distribution area belonging to only one type of SNP site.
  • the distribution of the ABaa and ABab sites is divided. The fetal concentration has not been reported to exceed 50%, so the expected depth of the sub-base of ABaa is between x/2 and x/3.
  • the next step is to separate the SNP sites of the AAaa and AAab types.
  • the site is not a SNP, but an AAaa-type SNP site is generated due to sequencing errors and assembly errors. According to previous studies, these SNP loci due to sequencing errors obey the binomial distribution.
  • the binomial distribution when the number is large enough, the binomial distribution can be regarded as a normal distribution, so these sites are regarded as It is a normal distribution, and according to the nature of the binomial distribution: when the number of sites is greater than 20 and the difference in depth between each is significant (p is less than 5%), the binomial distribution can be approximated as a Poisson distribution.
  • the sequencing quality value to Q20 (sequencing error rate is less than 1%). Therefore, the P of these SNP sites is theoretically less than or equal to 1%, which is consistent with the requirement that the binomial distribution is approximated as a Poisson distribution. Therefore, we assume that The variance of the state distribution is equal to the expectation.
  • the sequencing error is assumed to be 1%, and the expected (mean) is assumed to be the sequencing depth of each site multiplied by 1%, ie the expectation and the variance are both (x i + y i ) * 0.01, x i and y i represent the SNP bits, respectively The highest base depth and the second highest base depth of point i. In the normal distribution, 99.9% of the points will fall within three standard deviations from the mean, as shown in Figure 1.
  • uppercase letters represent bases from the site of the mother's nucleic acid
  • lowercase letters represent bases from the site of the fetal nucleic acid
  • A/a represents the highest base
  • B/b The next highest base representing the same site
  • the x-axis represents the highest base depth
  • the y-axis represents the next highest base depth.
  • the overall operational procedures include:
  • the distribution area of the AAab type SNP locus delineated in the first embodiment (hereinafter, the distribution area of the AAab type SNP locus is simply referred to as the AAab region), that is, the mother is homozygous but the fetus is selected.
  • the SNP site of the AAab is used as an effective site to calculate the predicted fetal nucleic acid concentration.
  • each SNP falling within the distribution area of the AAab SNP site is expressed according to the formula Calculate the fetal nucleic acid concentration corresponding to each locus, and finally take the median as the fetal nucleic acid concentration of the sample, and y 4 and x 4 are the times of a SNP locus in the distribution area of the AAab SNP locus, respectively. High base depth and highest base depth.
  • the boundary between the defined AAab and AAaa may appear too strict and may cause deviation. For example, if the number of sites falling into the AAab region is too small to estimate or because the median value of a set of fetal nucleic acid concentrations is taken as the fetal nucleic acid concentration of the sample, when the SNP site of the AAab near the x-axis is removed too much, more support will be provided.
  • AAaa has a homozygous site and a heterozygous site due to sequencing errors.
  • the sequencing error rate e is 0.26% based on the previously reported sequencing error rate of Hiseq2000.
  • the SNP site in the plasma can be regarded as a normal distribution with a variance equal to the mean, which can be expressed as follows:
  • the simulated site data is generated by the R language, the range of the AAab region of the first embodiment is adjusted, and the predicted fetal nucleic acid concentration is calculated according to the method of the second embodiment according to the simulated site falling into the adjusted AAab region.
  • a set of values of 0.5 to 25% of the standard fetal concentration and the corresponding predicted fetal concentration at a sequencing depth can be obtained, and then a fitting equation is obtained.
  • Table 2 shows the fitted equations at different depths produced for ease of use.
  • the corrected fetal nucleic acid content was 2.6%.
  • the depth of sequencing and the number of SNP loci in the AAab region are the most important factors affecting the accuracy of the calculated nucleic acid content, and the accuracy at different depths is also examined here.
  • the fetal nucleic acid concentration can be accurately obtained at a fetal nucleic acid content of about 3%, and there are relatively many effective SNP sites.
  • the test takes the change of 40 ⁇ 200X accuracy at different depths. The change of accuracy is shown in Fig. 4.
  • the relative error is not more than 10%
  • the SNP position in the AAab region is at least 35
  • the AAab-type SNP locus accounts for no more than 10% of all types of SNP sites in maternal plasma samples, which can accurately detect different fetuses.
  • the minimum number of all types of SNP sites required for nucleic acid concentration is shown in Table 3.
  • the example also uses the Y chromosome depth to calculate the 9 male fetal fetal nucleic acid concentrations in the 18 plasma samples, and compares them with the male fetal fetal nucleic acid concentration values calculated by the method of the present invention.
  • r 0.94; p ⁇ 0.0001
  • the use of the Y chromosome depth to calculate the male fetal nucleic acid concentration is a known method, and can be referred to [Struble C A, Syngelaki A, Oliphant A, et al. Fetal fraction estimate in twin pregnancies using directed cell-free DNA analysis [J].

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • General Health & Medical Sciences (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Biomedical Technology (AREA)
  • Sustainable Development (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Disclosed in the present invention is a device for constructing distribution area of different combined genotypes of SNP site, comprising: construction unit of first area and second area for constructing the first area and the second area, wherein the first area is the distribution area of the first combined genotype of SNP site, and the second area is the distribution area of the second combined genotype of SNP site; construction unit of third area and fourth area for constructing the third area and the fourth area from the second area, wherein the third area is the distribution area of the first predetermined ratio AAaa SNP site in the second combined genotype of SNP site, and the fourth area is the distribution area of AAab SNP site in the second combined genotype of SNP site. Also disclosed in the present invention are a method for constructing distribution area of different combined genotypes of SNP site, a method and device for determining the fetal nucleic acid content from a pregnant woman sample.

Description

确定胎儿核酸含量的方法和装置Method and apparatus for determining fetal nucleic acid content 技术领域Technical field
本发明涉及生物信息领域,具体的,本发明涉及一种构建不同组合基因型SNP位点的分布区域的装置、一种构建不同组合基因型SNP位点的分布区域的方法、一种区分不同组合基因型SNP位点的方法,一种确定孕妇样本中胎儿核酸含量的方法、一种确定孕妇样本中胎儿核酸含量的装置和一种计算机可读介质。The present invention relates to the field of biological information. Specifically, the present invention relates to a device for constructing a distribution region of different combined genotype SNP sites, a method for constructing a distribution region of different combined genotype SNP sites, and a method for distinguishing different combinations. A method of genotypic SNP locus, a method of determining fetal nucleic acid content in a pregnant sample, a device for determining fetal nucleic acid content in a pregnant sample, and a computer readable medium.
背景技术Background technique
自从在母亲血浆中发现胎儿游离DNA,产前检测技术发生了很大的革新。如今随着二代测序技术价格不断降低以及技术创新,无创产前检测发展迅速应用广泛。比如产前血友病、性别混乱和单基因病等遗传病的诊断。在这些诊断中,胎儿浓度是一个重要的参数。另外,异常的胎儿浓度还可以用以辅助预测一些疾病风险,比如较高水平的胎儿浓度可能与早产相关联,较低水平的胎儿浓度可用于辅助鉴定是否患有中度或重度的子痫。Since the discovery of fetal free DNA in maternal plasma, prenatal testing techniques have undergone major innovations. Nowadays, with the continuous reduction of the price of second-generation sequencing technology and technological innovation, the development of non-invasive prenatal testing is rapidly applied. For example, the diagnosis of genetic diseases such as prenatal hemophilia, gender confusion and monogenic diseases. In these diagnoses, fetal concentration is an important parameter. In addition, abnormal fetal concentrations can also be used to aid in predicting some disease risks, such as higher levels of fetal concentration may be associated with preterm birth, and lower levels of fetal concentration may be used to aid in the identification of moderate or severe eclampsia.
关于如何检测孕妇血浆中胎儿DNA浓度已经存在好几种方法,比如直接统计孕妇血浆中Y染色体和常染色体各自在游离DNA中所占的比例,然而当怀的是女胎的时候,该方法不可行。有研究通过母亲和胎儿基因组中甲基化和未甲基化等一些表观遗传标记的差异来计算胎儿浓度,但该方法受到重亚硫酸盐转换或者甲基化限制酶消化等影响,常常精度不高。还有研究是基于新一代测序技术分析胎儿和母亲基因组中差异的位点,然而在有些无创产前检测中,无法提前获取胎儿的基因组信息。再有研究是基于父亲和母亲的基因组信息找出胎儿特异性基因位点,但是很多时候无法获取胎儿父亲的基因组信息,且另外分析父母的基因组信息需要增加额外费用。There are several methods for detecting fetal DNA concentration in pregnant women's plasma, such as directly counting the proportion of Y chromosome and autosome in pregnant women's plasma in free DNA. However, when pregnant women are pregnant, this method is not available. Row. Studies have calculated fetal concentrations by differences in epigenetic markers such as methylation and unmethylation in the maternal and fetal genomes, but this method is affected by bisulfite conversion or methylation restriction enzyme digestion, often with precision. not tall. Other studies are based on next-generation sequencing technology to analyze differences in fetal and maternal genomes. However, in some non-invasive prenatal tests, fetal genomic information cannot be obtained in advance. Further research is based on the genomic information of fathers and mothers to identify fetal-specific genetic loci, but in many cases it is impossible to obtain the fetal father's genomic information, and additional analysis of the parent's genomic information requires additional costs.
发明内容Summary of the invention
依据本发明的一方面,本发明提供一种构建不同组合基因型SNP位点分布区域的装置,组合基因型为SNP位点在第一来源核酸中的基因型和在第二来源核酸中的基因型的组合,该装置包括:第一区域-第二区域构建单元,用于构建第一区域和第二区域,第一区域是第一组合基因型SNP位点的分布区域,第二区域是第二组合基因型SNP位点的分布区域,第一区域和第二区域是基于第一关系和第二关系的差异划分开的,第一关系为第一组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第二关系为第二组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第一组合基因型为ABaa和ABab,第二组合基因型为AAaa和AAab;第三区域-第四区域构建单元,用于从第二区域中构建第三区域和第四区域,第三区域是第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的分布区域,第四区域是第二组合基 因型SNP位点中的AAab SNP位点的分布区域,第三区域和第四区域是基于第三关系和第四关系的差异划分开的,第三关系为第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第四关系为第二组合基因型SNP位点中的AAab SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系;其中,AA和AB分别表示纯合和杂合的来自第一来源核酸的SNP位点,aa和ab分别表示纯合和杂合的来自第二来源核酸的相同SNP位点,定义A或a表示SNP位点的最高碱基,B或b表示相同SNP位点的次高碱基。由于一般认为SNP为二态性,即为二等位基因,由两种碱基组成,分别为最高碱基和次高碱基,这里以A和B分别表示一SNP位点在一种来源核酸中的最高碱基和次高碱基,对应的,以a和b分别表示同一SNP位点在另一种来源核酸中的最高碱基和次高碱基。在本发明的这一方面,该装置基于同一SNP位点在两种来源核酸中的基因型形成的4种可能组合,假定第一来源核酸的含量大于第二来源核酸含量,以及若在同一序列测定结果中,4种组合的SNP位点的次高碱基深度和最高碱基深度各自满足的关系的差异来构建4种不同组合基因型SNP位点的分布区域。利用该装置构建得的分布区域能够用以确定待检测数据中的SNP位点的组合基因型,和/或区分开待检测数据中的不同组合基因型SNP位点以及获取其中某种组合基因型SNP位点的数据信息。According to an aspect of the present invention, the present invention provides a device for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and a gene in a second source nucleic acid a combination of types, the apparatus comprising: a first region-second region building unit for constructing a first region and a second region, the first region being a distribution region of the first combined genotype SNP site, and the second region is a The distribution region of the two combined genotype SNP sites, the first region and the second region are divided based on the difference between the first relationship and the second relationship, and the first relationship is the second highest base of the first combined genotype SNP site The relationship between the depth and the depth of the highest base, the second relationship is the relationship between the depth of the second highest base of the second combined genotype SNP and the depth of the highest base, the first combination The genotypes are ABaa and ABab, the second combined genotypes are AAaa and AAab; the third region-fourth region building unit is configured to construct a third region and a fourth region from the second region, and the third region is the second combination Genotype SNP locus Distribution area AAaa SNP site of a first predetermined ratio, the second region is a combination of the fourth group The distribution area of the AAab SNP locus in the type-dependent SNP locus, the third region and the fourth region are divided based on the difference between the third relationship and the fourth relationship, and the third relationship is in the second combined genotype SNP locus. The relationship between the depth of the next highest base of the first predetermined ratio of AAaa SNP sites and the depth of the highest base, and the fourth relationship is the number of AAab SNP sites in the second combined genotype SNP locus The relationship between the depth of the high base and the depth of the highest base; wherein AA and AB respectively represent homozygous and heterozygous SNP sites from the first source nucleic acid, and aa and ab respectively represent homozygous and heterozygous The same SNP site from the second source nucleic acid, defining A or a represents the highest base of the SNP site, and B or b represents the next highest base of the same SNP site. Since the SNP is generally considered to be dimorphic, it is a di-allelic, consisting of two bases, the highest base and the second highest base, respectively. Here, A and B respectively represent a SNP locus in a source nucleic acid. The highest base and the second highest base in the corresponding one, and a and b respectively represent the highest base and the next highest base of the same SNP site in the nucleic acid of another source. In this aspect of the invention, the device is based on four possible combinations of genotype formation in the two source nucleic acids by the same SNP site, assuming that the first source nucleic acid is greater than the second source nucleic acid content, and if in the same sequence In the measurement results, the distribution regions of the four different combined genotype SNP sites were constructed by the difference in the relationship between the sub-high base depth and the highest base depth of the four combined SNP sites. The distribution region constructed using the device can be used to determine the combined genotype of SNP sites in the data to be detected, and/or to distinguish different combined genotype SNP sites in the data to be detected and to obtain a certain combination genotype thereof. Data information of SNP sites.
依据本发明的一方面,本发明提供一种构建不同组合基因型SNP位点分布区域的装置,组合基因型为SNP位点在第一来源核酸中的基因型和在第二来源核酸中的基因型的组合,该装置包括:第一区域-第二区域构建单元,用于构建第一区域和第二区域,第一区域是第一组合基因型SNP位点的分布区域,第二区域是第二组合基因型SNP位点的分布区域,第一区域和第二区域是基于第一关系和第二关系的差异划分开的,第一关系为第一组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第二关系为第二组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第一组合基因型为ABaa和ABab,第二组合基因型为AAaa和AAab;第三区域-第四区域构建单元,用于从第二区域中构建第三区域和第四区域,第三区域是第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的分布区域,第四区域是第二组合基因型SNP位点中的AAab SNP位点的分布区域,第三区域和第四区域是基于第三关系和第四关系的差异划分开的,第三关系为第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第四关系为第二组合基因型SNP位点中的AAab SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系;以及闭合第四区域构建单元,用于从第四区域中构建闭合第四区域,闭合第四区域是第二组合基因型SNP位点中的第二预定比例的AAab SNP位点的分布区域,闭合第四区域是基于AAab SNP位点的次高碱基和最高碱基的深度都服从正态分布,以及设定第二预定比例,从第四区域中构建获得的;其中,AA和AB分别表示纯合和杂合的来自第一来源核酸的SNP位点,aa和ab分别表示纯合和杂合的来自第二 来源核酸的相同SNP位点,定义A或a表示SNP位点的最高碱基,B或b表示相同SNP位点的次高碱基。According to an aspect of the present invention, the present invention provides a device for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and a gene in a second source nucleic acid a combination of types, the apparatus comprising: a first region-second region building unit for constructing a first region and a second region, the first region being a distribution region of the first combined genotype SNP site, and the second region is a The distribution region of the two combined genotype SNP sites, the first region and the second region are divided based on the difference between the first relationship and the second relationship, and the first relationship is the second highest base of the first combined genotype SNP site The relationship between the depth and the depth of the highest base, the second relationship is the relationship between the depth of the second highest base of the second combined genotype SNP and the depth of the highest base, the first combination The genotypes are ABaa and ABab, the second combined genotypes are AAaa and AAab; the third region-fourth region building unit is configured to construct a third region and a fourth region from the second region, and the third region is the second combination Genotype SNP locus a distribution area of the first predetermined ratio of AAaa SNP sites, the fourth region is a distribution region of AAab SNP sites in the second combined genotype SNP site, and the third region and the fourth region are based on the third relationship and the third region The relationship between the four relationships is divided, and the third relationship is the relationship between the depth of the second highest base of the AAaa SNP site in the first predetermined proportion of the second combined genotype SNP site and the depth of the highest base. a fourth relationship is a relationship between a depth of a sub-high base of the AAab SNP site in the second combined genotype SNP site and a depth of the highest base thereof; and a closed fourth region building unit for A closed fourth region is constructed in the fourth region, and the closed fourth region is a distribution region of a second predetermined ratio of AAab SNP sites in the second combined genotype SNP site, and the closed fourth region is based on the AAab SNP site. Both the high base and the highest base depth follow a normal distribution, and a second predetermined ratio is set, constructed from the fourth region; wherein AA and AB respectively represent homozygous and heterozygous nucleic acids from the first source SNP locus, aa and ab Representing homozygous and heterozygous respectively from the second The same SNP site of the source nucleic acid, defining A or a represents the highest base of the SNP site, and B or b represents the next highest base of the same SNP site.
依据本发明一方面,本发明提供一种构建不同组合基因型SNP位点的分布区域的方法,组合基因型为SNP位点在第一来源核酸中的基因型和在第二来源核酸中的基因型的组合,该方法包括:基于第一关系和第二关系的差异,构建第一区域和第二区域,第一区域是第一组合基因型SNP位点的分布区域,第二区域是第二组合基因型SNP位点的分布区域,第一关系为第一组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第二关系为第二组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第一组合基因型为ABaa和ABab,第二组合基因型为AAaa和AAab;基于第三关系和第四关系的差异,从第二区域中构建第三区域和第四区域,第三区域是第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的分布区域,第四区域是第二组合基因型SNP位点中的AAab SNP位点的分布区域,第三关系为第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第四关系为第二组合基因型SNP位点中的AAab SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系;其中,AA和AB分别表示纯合和杂合的来自第一来源核酸的SNP位点,aa和ab分别表示纯合和杂合的来自第二来源核酸的相同SNP位点,以A或a表示SNP位点的最高碱基,B或b表示相同SNP位点的次高碱基。According to an aspect of the present invention, the present invention provides a method for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and a gene in a second source nucleic acid a combination of types, the method comprising: constructing a first region and a second region based on a difference between the first relationship and the second relationship, the first region is a distribution region of the first combined genotype SNP site, and the second region is a second region The distribution region of the combined genotype SNP locus, the first relationship is the relationship between the depth of the second highest base of the first combined genotype SNP locus and the depth of the highest base, and the second relationship is the second combined gene The relationship between the depth of the next highest base of the SNP site and the depth of the highest base, the first combined genotypes are ABaa and ABab, and the second combined genotype is AAaa and AAab; based on the third relationship and the third The difference between the four relationships is to construct a third region and a fourth region from the second region, and the third region is a distribution region of the first predetermined ratio of AAaa SNP sites in the second combined genotype SNP site, and the fourth region is Second combination base The distribution of the AAab SNP locus in the type-dependent SNP locus, the third relationship being the depth of the sub-high base of the first predetermined proportion of the AAaa SNP locus in the second combined genotype SNP locus and its highest base The relationship between the depths of the fourth relationship is the relationship between the depth of the sub-high base of the AAab SNP site in the second combined genotype SNP site and the depth of the highest base; wherein AA and AB denotes homozygous and heterozygous SNP sites from the first source nucleic acid, respectively, and aa and ab represent homozygous and heterozygous identical SNP sites from the second source nucleic acid, respectively, and A or a represents the SNP site. The highest base, B or b, represents the next highest base of the same SNP site.
依据本发明的另一方面,本发明提供一种区分不同组合基因型SNP位点的方法,所称的组合基因型为SNP位点在第一来源核酸中的基因型和在第二来源核酸中的基因型的组合,该方法包括:对混合核酸样本中的至少一部分核酸进行序列测定,获得测序数据,测序数据由多个读段组成,混合核酸样本包含第一来源核酸和第二来源核酸;将测序数据与参考序列比对,获得比对结果;基于比对结果,识别出SNP位点以及确定SNP位点所处的分布区域,所称的分布区域是依据本发明一方面的构建不同组合基因型SNP位点的分布区域的方法构建的;基于SNP位点所处的分布区域,确定该SNP位点的组合基因型。According to another aspect of the present invention, the present invention provides a method for distinguishing different combinations of genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and in a second source nucleic acid. a combination of genotypes, the method comprising: sequence determining at least a portion of the nucleic acid in the mixed nucleic acid sample, obtaining sequencing data, the sequencing data consisting of a plurality of reads, the mixed nucleic acid sample comprising the first source nucleic acid and the second source nucleic acid; The sequencing data is aligned with the reference sequence to obtain a comparison result; based on the comparison result, the SNP site is identified and the distribution region where the SNP site is located is determined, and the distribution region is a different combination according to one aspect of the present invention. A method for constructing a distribution region of a genotype SNP locus; determining a combined genotype of the SNP locus based on a distribution region in which the SNP locus is located.
进一步的,依据本发明的一方面,依据落入第四区域的SNP位点的信息估算第二来源核酸在混合核酸中占的比例。由于落入第四区域或者落入闭合第四区域的SNP位点的组合基因型为AAab,次高碱基只来自第二来源核酸,利用次高碱基获得的读段支持数量在覆盖该位点的读段总数中占的比例,可估算出第二来源核酸的浓度,可表示为第二来源核酸浓度=2*次高碱基深度/(最高碱基深度+次高碱基深度),公式中的次高碱基和最高碱基来自同一AAab SNP位点。在本发明的一个具体实施方式中,利用落入闭合第四区域每个SNP位点进行第二来源核酸浓度估算,获得一组第二来源核酸浓度数值,取数值中的中位数为第二来源核酸浓度。对于只包含两种来源核酸的混合核酸样本,其中的一个来源的核酸含量确定,另一个也就随之确定。 Further, according to an aspect of the present invention, the proportion of the second source nucleic acid in the mixed nucleic acid is estimated based on the information of the SNP site falling into the fourth region. Since the combined genotype of the SNP locus falling into the fourth region or falling into the closed fourth region is AAab, the sub-high base is only from the second source nucleic acid, and the number of reads supported by the second highest base is overwriting the position. The proportion of the total number of reads in the point, the concentration of the second source nucleic acid can be estimated, which can be expressed as the second source nucleic acid concentration = 2 * second highest base depth / (highest base depth + second highest base depth), The next highest base and the highest base in the formula are from the same AAab SNP site. In a specific embodiment of the present invention, the second source nucleic acid concentration is estimated by using each SNP site falling into the closed fourth region to obtain a set of second source nucleic acid concentration values, and the median value is the second. Source nucleic acid concentration. For mixed nucleic acid samples containing only two sources of nucleic acid, the nucleic acid content of one of the sources is determined and the other is determined.
依据本发明的又一方面,本发明提供一种确定孕妇样本中胎儿核酸含量的方法,该方法包括:获取测序结果,所述测序结果的获取包括对孕妇样本中的至少一部分核酸进行序列测定,测序结果由多个读段组成,孕妇样本包含母体核酸和胎儿核酸;将测序结果与参考序列比对,获得比对结果;基于比对结果,识别出SNP位点;基于比对结果,确定SNP位点所处的分布区域,分布区域依据本发明一方面的构建不同组合基因型SNP位点的分布区域的方法构建;基于处于分布区域中的第四区域或者闭合第四区域的SNP位点,确定该孕妇样本中的胎儿核酸含量。在本发明这一方面的方法中,待测孕妇样本为孕妇体液样本,例如,来自孕妇外周血、孕妇尿液等。According to still another aspect of the present invention, the present invention provides a method for determining a fetal nucleic acid content in a pregnant woman sample, the method comprising: obtaining a sequencing result, the obtaining of the sequencing result comprising sequencing a nucleic acid of at least a portion of the sample of the pregnant woman, The sequencing result consists of multiple reads. The maternal sample contains maternal nucleic acid and fetal nucleic acid; the sequencing result is compared with the reference sequence to obtain the alignment result; based on the comparison result, the SNP locus is identified; based on the comparison result, the SNP is determined. a distribution region in which the site is located, the distribution region is constructed according to an aspect of the present invention for constructing a distribution region of different combined genotype SNP sites; based on the fourth region in the distribution region or the SNP site in the fourth region; The fetal nucleic acid content in the pregnant woman sample is determined. In the method of this aspect of the invention, the sample of the pregnant woman to be tested is a sample of the pregnant woman's body fluid, for example, from the peripheral blood of the pregnant woman, the urine of the pregnant woman, and the like.
依据本发明的一方面,本发明提供一种确定孕妇样本中胎儿核酸含量的方法,该方法包括:获取测序结果,所述测序结果的获取包括对孕妇样本中的至少一部分核酸进行序列测定,测序结果由多个读段组成,孕妇样本包含母体核酸和胎儿核酸;将测序结果与参考序列比对,获得比对结果;基于比对结果,识别出SNP位点;基于比对结果,确定SNP位点所处的分布区域,分布区域依据本发明一方面的构建不同组合基因型SNP位点的分布区域的方法构建;基于处于分布区域中的第四区域或者闭合第四区域的SNP位点,确定该孕妇样本中的胎儿核酸含量;以及,当测序结果包含的数据量少于65X和/或确定出的胎儿核酸含量小于10%,利用偏差校正模型来校正胎儿核酸含量,获得校正的胎儿核酸含量。According to an aspect of the present invention, the present invention provides a method for determining a fetal nucleic acid content in a pregnant woman sample, the method comprising: obtaining a sequencing result, the obtaining of the sequencing result comprising sequencing, sequencing, and sequencing at least a portion of the nucleic acid in the pregnant woman sample The result consists of multiple readings. The maternal sample contains maternal nucleic acid and fetal nucleic acid; the sequencing result is compared with the reference sequence to obtain the alignment result; based on the comparison result, the SNP locus is identified; based on the comparison result, the SNP position is determined. a distribution region in which the point is located, the distribution region is constructed according to an aspect of the present invention for constructing a distribution region of different combined genotype SNP sites; based on the fourth region in the distribution region or the SNP site in the closed fourth region, The fetal nucleic acid content of the pregnant woman sample; and, when the sequencing result contains less than 65X of data and/or the determined fetal nucleic acid content is less than 10%, the bias correction model is used to correct the fetal nucleic acid content, and the corrected fetal nucleic acid content is obtained. .
依据本发明的还一方面,本发明提供一种确定孕妇样本中胎儿核酸含量的装置,其包括:数据输入单元,用于输入数据;数据输出单元,用于输出数据;存储单元,用于存储数据,其中包括可执行程序;处理器,与所述数据输入单元、数据输出单元和存储单元连接,用于执行可执行程序,执行该可执行程序包括完成上述确定孕妇样本中胎儿核酸含量的方法。According to still another aspect of the present invention, the present invention provides an apparatus for determining a fetal nucleic acid content in a pregnant woman sample, comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing Data, including an executable program; a processor coupled to the data input unit, the data output unit, and the storage unit for executing an executable program, the method of performing the method comprising determining the fetal nucleic acid content in the pregnant woman sample .
依据本发明的再一方面,提供一种计算机可读介质,用于存储供计算机执行的程序,执行程序包括完成上述任一确定孕妇样本中胎儿核酸含量的方法。本领域普通技术人员可以理解,在执行该程序时,通过指令相关硬件可完成上述确定孕妇样本中胎儿核酸含量的方法的全部或部分步骤。所称存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。According to still another aspect of the present invention, a computer readable medium for storing a program for execution by a computer, the method comprising performing any of the above methods for determining a fetal nucleic acid content in a sample of a pregnant woman is provided. One of ordinary skill in the art will appreciate that all or part of the above-described methods of determining fetal nucleic acid content in a sample of a pregnant woman can be accomplished by instructing the associated hardware while executing the procedure. The storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
依据本发明装置和/或方法,能够获得不同组合基因型SNP位点的分布区域,使得能够区分开不同组合基因型的SNP位点,确定SNP的组合基因型,获取特定组合基因型的SNP位点以及利用混合核酸样本中特定组合基因型的SNP位点信息确定混合核酸样本中的不同来源核酸的含量,包括孕妇样本中的胎儿核酸含量,肿瘤循环血液样本中来自肿瘤细胞的核酸的含量。进一步的,依据本发明的偏差校正模型,能够校正由于在获取的数据量较小或者混合核酸样本中的目标来源核酸的浓度比较低,而划定的分布区域范围又相对严格,而引起的计算得的目标来源核酸的浓度的偏差,使确定的第二来源核酸含量。 According to the apparatus and/or method of the present invention, the distribution regions of different combined genotype SNP loci can be obtained, so that the SNP loci of different combined genotypes can be distinguished, the combined genotype of the SNP can be determined, and the SNP position of the specific combined genotype can be obtained. The point and the SNP site information of the specific combination genotype in the mixed nucleic acid sample are used to determine the content of the nucleic acid of different sources in the mixed nucleic acid sample, including the fetal nucleic acid content in the pregnant sample, and the content of the nucleic acid from the tumor cell in the tumor circulating blood sample. Further, according to the deviation correction model of the present invention, it is possible to correct the calculation due to the fact that the amount of data acquired is small or the concentration of the target-derived nucleic acid in the mixed nucleic acid sample is relatively low, and the range of the delineated distribution region is relatively strict. The deviation of the concentration of the target source nucleic acid is obtained such that the nucleic acid content of the second source is determined.
附图说明DRAWINGS
本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from
图1显示本发明的一个具体实施方式中标准正态分布的标准差和概率的比率关系。Figure 1 shows the ratio of the standard deviation to the probability of a standard normal distribution in one embodiment of the invention.
图2显示本发明的一个具体实施方式中构建的分布区域的示意图。Figure 2 shows a schematic representation of a distribution region constructed in a specific embodiment of the invention.
图3显示本发明的一个具体实施方式中预测的胎儿核酸浓度和真实胎儿核酸浓度之间的关系,其中,图3a中的预测的胎儿核酸浓度未利用偏差校正模型校正,图3b中的预测的胎儿核酸浓度是利用偏差模型校正后的预测的胎儿核酸浓度。Figure 3 shows the relationship between predicted fetal nucleic acid concentration and real fetal nucleic acid concentration in one embodiment of the invention, wherein the predicted fetal nucleic acid concentration in Figure 3a is not corrected using the bias correction model, the predicted in Figure 3b The fetal nucleic acid concentration is the predicted fetal nucleic acid concentration corrected by the bias model.
图4显示本发明的一个具体实施方式中不同测序深度对确定胎儿核酸含量的影响。Figure 4 shows the effect of different sequencing depths on determining fetal nucleic acid content in one embodiment of the invention.
图5显示本发明的一个具体实施方式中AAab SNP分布区域中的位点数对确定胎儿核酸含量的影响。Figure 5 shows the effect of the number of sites in the AAab SNP distribution region on determining fetal nucleic acid content in one embodiment of the invention.
图6显示本发明的一个具体实施方式中利用Y染色体深度预测的男胎胎儿核酸浓度与利用本发明方法预测的男胎胎儿核酸浓度的差异。Figure 6 shows the difference in male fetus fetal nucleic acid concentration predicted using Y chromosome depth and the fetal fetal nucleic acid concentration predicted by the method of the present invention in one embodiment of the present invention.
具体实施方式detailed description
依据本发明的一种实施方式,提供一种构建不同组合基因型SNP位点分布区域的装置,所称的组合基因型为SNP位点在第一来源核酸中的基因型和在第二来源核酸中的基因型的组合,该装置包括:第一区域-第二区域构建单元,用于构建第一区域和第二区域,第一区域是第一组合基因型SNP位点的分布区域,第二区域是第二组合基因型SNP位点的分布区域,第一区域和第二区域是基于第一关系和第二关系的差异划分开的,第一关系为第一组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第二关系为第二组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第一组合基因型为ABaa和ABab,第二组合基因型为AAaa和AAab;第三区域-第四区域构建单元,用于从第二区域中构建第三区域和第四区域,第三区域是第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的分布区域,第四区域是第二组合基因型SNP位点中的AAab SNP位点的分布区域,第三区域和第四区域是基于第三关系和第四关系的差异划分开的,第三关系为第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第四关系为第二组合基因型SNP位点中的AAab SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系;其中,AA和AB分别表示纯合和杂合的来自第一来源核酸的SNP位点,aa和ab分别表示纯合和杂合的来自第二来源核酸的相同SNP位点,定义A或a表示SNP位点的最高碱基,B或b表示相同SNP位点的次高碱基。一般认为SNP为二态性,即为二等位基因,由两种碱基组成,分别为最高碱基和次高 碱基,可将在同一序列测定数据中,一SNP位点中获得最多数据支持的碱基称为最高碱基,将获得数据支持第二多的称为次高碱基,例如,覆盖一SNP位点的测序获得的读段中,相应位置与该位点一碱基相同的,为支持该碱基的读段,最高碱基为覆盖其所在位点的读段中支持其的读段数目最多的碱基。这里,以A和B分别表示一SNP位点在一种来源核酸中的最高碱基和次高碱基,对应的,以a和b分别表示同一SNP位点在另一种来源核酸中的最高碱基和次高碱基。最高碱基的深度和次高碱基的深度,指在同一测定数据中,最高碱基和次高碱基各自得到的数据支持的量,例如,将测序获得的读段(reads)比对到参考序列上,比对上(覆盖)参考序列一SNP位点的读段的数目即为该位点的深度,也称为测序深度,比对上该位点的读段中相应位置与该位点的最高碱基一样的读段的数目为该最高碱基的深度,相应的,比对上该位点的读段中相应位置与该位点的次高碱基一样的读段的数目为该次高碱基的深度。将比对上(覆盖)该位点的读段中相应位置与该位点的某个碱基一样的读段称为支持该碱基的读段。利用该装置进行分布区域的构建,无需预先获取具体SNP位点数据,即分布区域的构建不依赖具体SNP位点数据信息,包括不依赖SNP位点的最高碱基和次高碱基深度。该装置基于同一SNP位点在两种来源核酸中的基因型形成的4种可能组合,假定第一来源核酸的含量大于第二来源核酸含量,以及若在同一序列测定结果中,4种组合的SNP位点的次高碱基深度和最高碱基深度各自满足的关系的差异来构建4种不同组合基因型SNP位点的分布区域。4种组合基因型如表1所示,需要说明的是,理论上,当一个位点在第一来源核酸和第二来源核酸中都是纯合的,该位点非SNP位点,但由于实际获取SNP位点的过程中存在随机错误使得产生组合基因型为AAaa的SNP位点,随机错误包括测序错误,根据之前的研究,由于测序错误产生的SNP位点服从二项分布,另外根据中心极限定理:当数目够大的时候二项分布可以看成是正态分布,这一实施方式将这些位点看成是正态分布。According to an embodiment of the present invention, there is provided a device for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a nucleic acid of a first source and a nucleic acid at a second source a combination of genotypes, the apparatus comprising: a first region-second region building unit for constructing a first region and a second region, the first region being a distribution region of the first combined genotype SNP site, and second The region is a distribution region of the second combined genotype SNP site, and the first region and the second region are divided based on the difference between the first relationship and the second relationship, and the first relationship is the time of the first combined genotype SNP site The relationship between the depth of the high base and the depth of the highest base, and the second relationship is the relationship between the depth of the second highest base of the second combined genotype SNP and the depth of the highest base. The first combined genotype is ABaa and ABab, the second combined genotype is AAaa and AAab; and the third region-fourth region building unit is configured to construct a third region and a fourth region from the second region, the third region is Second combined genotype S a distribution area of a first predetermined ratio of AAaa SNP sites in the NP site, the fourth region is a distribution region of AAab SNP sites in the second combined genotype SNP site, and the third region and the fourth region are based on The difference between the third relationship and the fourth relationship is the third relationship being the depth of the second highest base of the first predetermined proportion of the AAaa SNP sites in the second combined genotype SNP site and the depth of the highest base The relationship between the fourth relationship is the relationship between the depth of the sub-high base of the AAab SNP site in the second combined genotype SNP site and the depth of the highest base; wherein AA and AB respectively represent Homozygous and heterozygous SNP sites from nucleic acids of the first source, aa and ab represent homozygous and heterozygous identical SNP sites from the second source of nucleic acid, respectively, and A or a represents the highest base of the SNP site. , B or b represents the next highest base of the same SNP site. It is generally considered that the SNP is dimorphic, that is, the second allele is composed of two bases, which are the highest base and the second highest. Base, in the same sequence measurement data, the base that obtains the most data support in a SNP site is called the highest base, and the data that supports the second most is called the second highest base, for example, covers a SNP. In the read segment obtained by sequencing of the site, the corresponding position is the same as the one base of the site, and the number of reads supported by the base is the read segment supporting the base. The most bases. Here, A and B respectively represent the highest base and the second highest base of a SNP site in one source nucleic acid, and correspondingly, a and b respectively represent the highest of the same SNP site in another source nucleic acid. Base and second highest base. The depth of the highest base and the depth of the next highest base refer to the amount of data supported by the highest base and the second highest base in the same measurement data, for example, the reads obtained by sequencing are compared. On the reference sequence, the number of reads of the SNP site on the alignment (cover) reference sequence is the depth of the site, also known as the sequencing depth, and the corresponding position in the read of the site is aligned with the bit The number of reads of the highest base of the point is the depth of the highest base, and correspondingly, the number of reads of the corresponding position in the read of the site is the same as the next highest base of the site. The depth of this high base. A read that matches the corresponding position in the read on the site to the same position as the base of the site is referred to as a read that supports the base. The device is used to construct the distribution region, and it is not necessary to obtain specific SNP site data in advance, that is, the construction of the distribution region does not depend on specific SNP site data information, including the highest base and the second highest base depth independent of the SNP site. The device is based on four possible combinations of genotype formation in the two source nucleic acids by the same SNP site, assuming that the content of the first source nucleic acid is greater than the nucleic acid content of the second source, and if in the same sequence determination results, four combinations The distribution of the four different combined genotype SNP sites was constructed by the difference in the relationship between the sub-high base depth and the highest base depth of the SNP site. The four combined genotypes are shown in Table 1. It should be noted that, in theory, when a site is homozygous in both the first source nucleic acid and the second source nucleic acid, the site is not a SNP site, but There are random errors in the process of actually obtaining SNP loci, resulting in a SNP locus with a combined genotype of AAaa. Random errors include sequencing errors. According to previous studies, SNP loci due to sequencing errors obey the binomial distribution, and according to the center Limit Theorem: When the number is large enough, the binomial distribution can be considered as a normal distribution. This embodiment considers these sites to be normally distributed.
表1Table 1
Figure PCTCN2015070900-appb-000001
Figure PCTCN2015070900-appb-000001
在本发明的一个具体实施方式中,提供一种方式可视化所说的分布区域,建立一个二维坐标系,y轴表示SNP位点的次高碱基的深度,x轴表示SNP位点的最高碱基的深度,第一关系可表示为x/2≥y≥x/3,第二关系可表示为0<y<x/3,第三关系可表示为0<y<(x+y)*e+m*δ,第四关系可表示为(x+y)*e+m*δ<y<3/x,其中,e为测序错误率,一般e≤1%,δ为标准差,δ=((x+y)*e)^0.5,m取决于所述第一预定比例,m为非负数,m*δ和所述第一预定比例的关系为标准正态分布中的标准差和概率的比率关系。相应的,第一关系和第二关系的差异以及第三关系和第四关系的差异也同样可视化,显示为不同区域的分界线,第一关系和第二关系的差异可表示为y=x/3, 第三关系和第四关系的差异为y=(x+y)*e+m*δ。标准正态分布的概率密度函数曲线呈钟形,本领域普通技术人员可以理解,所说的标准正态分布中的标准差和概率的比率关系如图1所示,如:正负一个标准差之间,包含总面积的68.26%;正负1.96个标准差之间,包含总面积的95%;正负2.58个标准差之间,包含总面积的99%;m即对应其中的标准差的个数,第一预定比例即对应其中的总面积的百分比。较佳的,第一预定比例不小于95%。在本发明的一个具体实施方式中,第一预定比例为99.9%,对应的m=3。In a specific embodiment of the present invention, a method is provided for visualizing the distribution region, establishing a two-dimensional coordinate system, the y-axis represents the depth of the next-high base of the SNP site, and the x-axis represents the highest of the SNP site. The depth of the base, the first relationship can be expressed as x/2≥y≥x/3, the second relationship can be expressed as 0<y<x/3, and the third relationship can be expressed as 0<y<(x+y) *e+m*δ, the fourth relationship can be expressed as (x+y)*e+m*δ<y<3/x, where e is the sequencing error rate, generally e≤1%, and δ is the standard deviation. δ=((x+y)*e)^0.5,m depends on the first predetermined ratio, m is a non-negative number, and the relationship between m*δ and the first predetermined ratio is the standard deviation in the standard normal distribution The ratio of the probability to the probability. Correspondingly, the difference between the first relationship and the second relationship and the difference between the third relationship and the fourth relationship are also visualized, which are displayed as boundaries of different regions, and the difference between the first relationship and the second relationship may be expressed as y=x/ 3, The difference between the third relationship and the fourth relationship is y=(x+y)*e+m*δ. The probability density function curve of the standard normal distribution has a bell shape, and those skilled in the art can understand that the ratio of the standard deviation to the probability in the standard normal distribution is as shown in FIG. 1, such as: plus or minus one standard deviation. Between the total area of 68.26%; between plus or minus 1.96 standard deviations, including 95% of the total area; between plus and minus 2.58 standard deviations, including 99% of the total area; m is corresponding to the standard deviation The number, the first predetermined ratio is the percentage corresponding to the total area. Preferably, the first predetermined ratio is not less than 95%. In a specific embodiment of the invention, the first predetermined ratio is 99.9%, corresponding to m=3.
根据本发明的一个具体实施方式,该装置还包括:闭合第四区域构建单元,用于从第四区域中构建闭合第四区域,闭合第四区域是第二组合基因型SNP位点中的第二预定比例的AAab SNP位点的分布区域,所述闭合第四区域是基于所述AAab SNP位点的次高碱基和最高碱基的深度都服从正态分布,以及设定所述第二预定比例,从所述第四区域中构建获得的。一般设定第二预定比例不小于95%。在一个y轴表示SNP位点的次高碱基的深度,x轴表示SNP位点的最高碱基的深度二维坐标系中,y=x/3、y=(x+y)*e+m*δ、y=D0-n*δ-x和y=D0+n*δ-x即构成所说的第四闭合区域,其中,e为测序错误率,一般e≤1%,δ为标准差,δ=((x+y)*e)^0.5,D0为SNP位点的平均深度,m、n为非负数,取决于所述第一预定比例,m*δ和所述第一预定比例的关系为标准正态分布中的标准差和概率的比率关系,n取决于所述第二预定比例,n*δ和所述第二预定比例的关系为标准正态分布中的标准差和概率的比率关系。较佳的,第二预定比例不小于95%。在本发明的一个具体实施方式中,第一预定比例和第二预定比例均为99.9%,对应m=n=3。所称SNP位点的平均深度指SNP位点的深度的平均值,SNP位点的深度为该SNP位点获得的支持数据的量,例如,将测序获得的读段(reads)比对到参考序列上,比对上参考序列该SNP位点的读段的数目即为该位点的深度,也称为测序深度,即该位点被覆盖的次数,较佳的,D0≥100X。According to a specific embodiment of the present invention, the apparatus further comprises: a closed fourth region building unit for constructing a closed fourth region from the fourth region, the closed fourth region being the second of the second combined genotype SNP sites a distribution area of a predetermined ratio of AAab SNP sites, wherein the closed fourth region is subjected to a normal distribution based on the sub-high base and the highest base depth of the AAab SNP site, and the second is set A predetermined ratio is obtained from the construction of the fourth region. Generally, the second predetermined ratio is set to be not less than 95%. In a y-axis representing the depth of the next highest base of the SNP site, the x-axis represents the depth of the highest base of the SNP site in a two-dimensional coordinate system, y=x/3, y=(x+y)*e+ m*δ, y=D 0 -n*δ-x and y=D 0 +n*δ-x constitute the fourth closed region, where e is the sequencing error rate, generally e≤1%, δ Is the standard deviation, δ = ((x + y) * e) ^ 0.5, D 0 is the average depth of the SNP site, m, n is non-negative, depending on the first predetermined ratio, m * δ and the The relationship of the first predetermined ratio is a ratio relationship of the standard deviation and the probability in the standard normal distribution, n depends on the second predetermined ratio, and the relationship between n*δ and the second predetermined ratio is in the standard normal distribution The ratio of standard deviation to probability. Preferably, the second predetermined ratio is not less than 95%. In a specific embodiment of the invention, the first predetermined ratio and the second predetermined ratio are both 99.9%, corresponding to m=n=3. The average depth of the so-called SNP site refers to the average of the depth of the SNP site, and the depth of the SNP site is the amount of support data obtained by the SNP site, for example, the reads obtained by sequencing are compared to the reference. In sequence, the number of reads of the SNP site in the reference sequence is the depth of the site, also referred to as the sequencing depth, ie the number of times the site is covered, preferably, D 0 ≥ 100X.
利用上述任一实施方式中的装置,依据不同组合基因型SNP位点的次高碱基深度和最高碱基深度各自满足的函数关系的差异,能够构建获得不同组合基因型SNP位点的分布区域,而划分确定出各个分布区域后,反过来能够依据待检测数据中SNP位点落入的分布区域,确定SNP位点的组合基因型,获取特定组合基因型的SNP位点以及利用混合核酸样本中特定组合基因型的SNP位点信息确定混合核酸样本中的不同来源核酸的含量,包括孕妇样本中的胎儿核酸含量。By using the device in any of the above embodiments, according to the difference in the functional relationship between the sub-high base depth and the highest base depth of different combined genotype SNP sites, a distribution region for obtaining different combined genotype SNP sites can be constructed. After the division determines the respective distribution regions, the converse genotype of the SNP locus can be determined according to the distribution region in which the SNP locus falls in the data to be detected, the SNP locus of the specific combined genotype is obtained, and the mixed nucleic acid sample is utilized. The SNP site information for a particular combination genotype determines the amount of nucleic acid from a different source in the mixed nucleic acid sample, including the fetal nucleic acid content in the pregnant sample.
依据本发明的一种实施方式,提供一种构建不同组合基因型SNP位点的分布区域的方法,所称的组合基因型为SNP位点在第一来源核酸中的基因型和在第二来源核酸中的基因型的组合,该方法包括:基于第一关系和第二关系的差异,构建第一区域和第二区域,第一区域是第一组合基因型SNP位点的分布区域,第二区域是第二组合基因型SNP位点的分布区域,第一关系为第一组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第二关系为第二组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第一组合基因型为ABaa 和ABab,第二组合基因型为AAaa和AAab;基于第三关系和第四关系的差异,从第二区域中构建第三区域和第四区域,第三区域是第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的分布区域,第四区域是第二组合基因型SNP位点中的AAab SNP位点的分布区域,第三关系为第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,第四关系为第二组合基因型SNP位点中的AAab SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系;其中,AA和AB分别表示纯合和杂合的来自第一来源核酸的SNP位点,aa和ab分别表示纯合和杂合的来自第二来源核酸的相同SNP位点,以A或a表示SNP位点的最高碱基,B或b表示相同SNP位点的次高碱基。利用该方法进行分布区域的构建,无需预先获取具体SNP位点数据,即分布区域的构建不依赖具体SNP位点数据信息,包括不依赖SNP位点的最高碱基和次高碱基深度。该方法基于同一SNP位点在两种来源核酸中的基因型形成的4种可能组合,假定第一来源核酸的含量大于第二来源核酸含量,以及若在同一序列测定结果中,4种组合的SNP位点的次高碱基深度和最高碱基深度各自满足的关系的差异来构建4种不同组合基因型SNP位点的分布区域。前述对本发明任一实施方式中的构建不同组合基因型SNP位点的分布区域的装置的技术特征和优点的描述,也同样适用本发明这一实施方式中的方法。According to an embodiment of the present invention, there is provided a method for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and in a second source a combination of genotypes in a nucleic acid, the method comprising: constructing a first region and a second region based on a difference between the first relationship and the second relationship, the first region being a distribution region of the first combined genotype SNP site, and second The region is a distribution region of the second combined genotype SNP locus, and the first relationship is a relationship between the depth of the second highest base of the first combined genotype SNP locus and the depth of the highest base, and the second relationship is The relationship between the depth of the second highest base of the second combined genotype SNP locus and the depth of the highest base, the first combined genotype is ABaa And ABab, the second combined genotype is AAaa and AAab; based on the difference between the third relationship and the fourth relationship, the third region and the fourth region are constructed from the second region, and the third region is the second combined genotype SNP site a distribution area of the first predetermined ratio of AAaa SNP sites, the fourth region is a distribution region of AAab SNP sites in the second combined genotype SNP site, and the third relationship is in the second combined genotype SNP site The relationship between the depth of the next highest base of the first predetermined ratio of AAaa SNP sites and the depth of the highest base, and the fourth relationship is the number of AAab SNP sites in the second combined genotype SNP locus The relationship between the depth of the high base and the depth of the highest base; wherein AA and AB respectively represent homozygous and heterozygous SNP sites from the first source nucleic acid, and aa and ab respectively represent homozygous and heterozygous The same SNP site from the second source nucleic acid, with A or a representing the highest base of the SNP site, and B or b representing the next highest base of the same SNP site. Using this method to construct a distribution region, it is not necessary to obtain specific SNP locus data in advance, that is, the construction of the distribution region does not depend on specific SNP locus data information, including the highest base and the second highest base depth independent of the SNP locus. The method is based on four possible combinations of genotype formation in the two source nucleic acids of the same SNP site, assuming that the content of the first source nucleic acid is greater than the second source nucleic acid content, and if in the same sequence determination result, four combinations The distribution of the four different combined genotype SNP sites was constructed by the difference in the relationship between the sub-high base depth and the highest base depth of the SNP site. The foregoing description of the technical features and advantages of the apparatus for constructing the distribution regions of different combinations of genotype SNP sites in any of the embodiments of the present invention is equally applicable to the method of this embodiment of the present invention.
例如,在本发明的一个具体实施方式中,提供一种方式可视化所说的分布区域,建立一个二维坐标系,y轴表示SNP位点的次高碱基的深度,x轴表示SNP位点的最高碱基的深度,第一关系可表示为x/2≥y≥x/3,第二关系可表示为0<y<x/3,第三关系可表示为0<y<(x+y)*e+m*δ,第四关系可表示为(x+y)*e+m*δ<y<3/x,其中,e为测序错误率,一般e≤1%,δ为标准差,δ=((x+y)*e)^0.5,m取决于所述第一预定比例,m为非负数,m*δ和所述第一预定比例的关系为标准正态分布中的标准差和概率的比率关系。相应的,第一关系和第二关系的差异以及第三关系和第四关系的差异也同样可视化,显示为不同区域的分界线,第一关系和第二关系的差异可表示为y=x/3,第三关系和第四关系的差异为y=(x+y)*e+m*δ。标准正态分布的概率密度函数曲线呈钟形,本领域普通技术人员可以理解,所说的标准正态分布中的标准差和概率的比率关系如图1所示,图中σ指标准差,正负一个标准差之间,包含曲线围成的总面积的68.26%,正负1.96个标准差之间,包含总面积的95%,正负2.58个标准差之间,包含总面积的99%;m即对应其中的标准差的个数,第一预定比例即对应其中的总面积的百分比。较佳的,第一预定比例不小于95%。在本发明的一个具体实施方式中,第一预定比例为99.9%,对应的m=3。For example, in one embodiment of the invention, a method is provided for visualizing said distribution region, establishing a two-dimensional coordinate system, the y-axis representing the depth of the next-high base of the SNP site, and the x-axis representing the SNP site The depth of the highest base, the first relationship can be expressed as x/2 ≥ y ≥ x / 3, the second relationship can be expressed as 0 < y < x / 3, and the third relationship can be expressed as 0 < y < (x + y) *e+m*δ, the fourth relationship can be expressed as (x+y)*e+m*δ<y<3/x, where e is the sequencing error rate, generally e≤1%, δ is the standard Poor, δ=((x+y)*e)^0.5, m depends on the first predetermined ratio, m is a non-negative number, and the relationship between m*δ and the first predetermined ratio is in a standard normal distribution The ratio of standard deviation to probability. Correspondingly, the difference between the first relationship and the second relationship and the difference between the third relationship and the fourth relationship are also visualized, which are displayed as boundaries of different regions, and the difference between the first relationship and the second relationship may be expressed as y=x/ 3. The difference between the third relationship and the fourth relationship is y=(x+y)*e+m*δ. The probability density function curve of the standard normal distribution has a bell shape, and those skilled in the art can understand that the ratio of the standard deviation to the probability in the standard normal distribution is as shown in FIG. 1 , where σ refers to the standard deviation. Between positive and negative one standard deviation, including 68.26% of the total area enclosed by the curve, between plus or minus 1.96 standard deviations, including 95% of the total area, between plus and minus 2.58 standard deviations, including 99% of the total area ;m corresponds to the number of standard deviations therein, and the first predetermined ratio corresponds to the percentage of the total area therein. Preferably, the first predetermined ratio is not less than 95%. In a specific embodiment of the invention, the first predetermined ratio is 99.9%, corresponding to m=3.
根据本发明的一个具体实施方式,该方法还包括:从第四区域中构建闭合第四区域,闭合第四区域是第二组合基因型SNP位点中的第二预定比例的AAab SNP位点的分布区域,所述闭合第四区域的构建包括,基于所述AAab SNP位点的次高碱基和最高碱基的深度都服从正态分布,以及设定所述第二预定比例。一般设定第二预定比例不小于95%。在一个y轴表示SNP位点的次高碱 基的深度,x轴表示SNP位点的最高碱基的深度二维坐标系中,y=x/3、y=(x+y)*e+m*δ、y=D0-n*δ-x和y=D0+n*δ-x即构成所说的第四闭合区域,其中,e为测序错误率,一般e≤1%,δ为标准差,δ=((x+y)*e)^0.5,D0为SNP位点的平均深度,m、n为非负数,取决于所述第一预定比例,m*δ和所述第一预定比例的关系为标准正态分布中的标准差和概率的比率关系,n取决于所述第二预定比例,n*δ和所述第二预定比例的关系为标准正态分布中的标准差和概率的比率关系。较佳的,第二预定比例不小于95%。在本发明的一个具体实施方式中,第一预定比例和第二预定比例均为99.9%,对应m=n=3。所称SNP位点的平均深度指SNP位点的深度的平均值,SNP位点的深度为该SNP位点获得的支持数据的量,例如,将测序获得的读段(reads)比对到参考序列上,比对上参考序列该SNP位点的读段的数目即为该位点的深度,也称为测序深度,即该位点被覆盖的次数,较佳的,D0≥100X。According to a particular embodiment of the invention, the method further comprises: constructing a closed fourth region from the fourth region, the closed fourth region being a second predetermined ratio of AAab SNP sites in the second combined genotype SNP site The distribution region, the construction of the closed fourth region includes obeying a normal distribution based on the sub-high base and the highest base depth of the AAab SNP site, and setting the second predetermined ratio. Generally, the second predetermined ratio is set to be not less than 95%. In a y-axis representing the depth of the next highest base of the SNP site, the x-axis represents the depth of the highest base of the SNP site in a two-dimensional coordinate system, y=x/3, y=(x+y)*e+ m*δ, y=D 0 -n*δ-x and y=D 0 +n*δ-x constitute the fourth closed region, where e is the sequencing error rate, generally e≤1%, δ Is the standard deviation, δ = ((x + y) * e) ^ 0.5, D 0 is the average depth of the SNP site, m, n is non-negative, depending on the first predetermined ratio, m * δ and the The relationship of the first predetermined ratio is a ratio relationship of the standard deviation and the probability in the standard normal distribution, n depends on the second predetermined ratio, and the relationship between n*δ and the second predetermined ratio is in the standard normal distribution The ratio of standard deviation to probability. Preferably, the second predetermined ratio is not less than 95%. In a specific embodiment of the invention, the first predetermined ratio and the second predetermined ratio are both 99.9%, corresponding to m=n=3. The average depth of the so-called SNP site refers to the average of the depth of the SNP site, and the depth of the SNP site is the amount of support data obtained by the SNP site, for example, the reads obtained by sequencing are compared to the reference. In sequence, the number of reads of the SNP site in the reference sequence is the depth of the site, also referred to as the sequencing depth, ie the number of times the site is covered, preferably, D 0 ≥ 100X.
依据本发明的另一个实施方式,提供一种区分不同组合基因型SNP位点的方法,组合基因型为SNP位点在第一来源核酸中的基因型和在第二来源核酸中的基因型的组合,该方法包括:对混合核酸样本中的至少一部分核酸进行序列测定,获得测序数据,所述测序数据由多个读段组成,所述混合核酸样本包含所述第一来源核酸和所述第二来源核酸;将所述测序数据与参考序列比对,获得比对结果;基于所述比对结果,识别出SNP位点;确定所述SNP位点所处的分布区域,所述分布区域依据上述任一具体实施方式中的构建不同组合基因型SNP位点的分布区域的方法来构建;基于所述SNP位点所处的分布区域,确定所述SNP位点的组合基因型。所说的第一来源核酸和第二来源核酸,可以是来源自不同个体的核酸,也可以是同一个体不同组织或部位的核酸,例如来自肿瘤细胞和来自非肿瘤细胞的核酸。获得测序数据包括对混合核酸样本进行测序文库(library)制备,以及对文库进行上机测序。测序可利用现有测序平台进行,可依据所选择的测序平台进行相应的文库制备,可选用的测序平台包括但不限于CG(Complete Genomics)CGA、Illumina/Solexa、Life Technologies/Ion Torrent和Roche 454,依据所选测序平台进行单端或双末端测序文库的制备。比对可以利用SOAP(Short Oligonucleotide Analysis Package),BWA等软件进行,本实施方式对此不作限制,比对过程中,根据比对参数的设置,例如设置测序数据中的每条读段最多允许有h个碱基错配(mismatch),h优选为1或2,若一条reads中有超过h个碱基发生错配,则视为该条reads无法比对到参考序列。SNP位点的识别可利用SOAPsnp、GATK等软件依照软件默认参数设置进行。所说的参考序列是已知序列,可以是预先获得的目标个体所属生物类别中的任意的参考模板,例如,同一生物类别的已公开的基因组组装序列,若混合核酸样本为来自人类,其基因组参考序列(也称为参考基因组)可选择NCBI数据库提供的HG19。比对结果包含各条读段与参考序列的比对情况,包括读段是否能够比对上参考序列、读段比对上参考序列的位置、某一位点多少读段比对上、比对上某位点的读段的相应位置的碱基类型等。基于比对结果,识别出SNP 位点,基于SNP位点的最高碱基和次高碱基的深度满足的关系,判定该SNP位点落入哪个分布区域,由于各个分布区域与4种组合基因型是对应的,如分布区域中的第一区域是组合基因型ABaa和ABab SNP位点的分布区域,第三区域是第一预定比例的AAaa SNP位点的分布区域,第四区域或者闭合第四区域是AAab SNP位点的分布区域,通过判定SNP位点落入的分布区域,就能确定该SNP的组合基因型,即对该SNP进行分型。前述任一对本发明具体实施方式中的分布区域的优点和技术特征的描述,同样会使这一实施方式的方法带有同样的优点和技术特征,在此不再赘述。According to another embodiment of the present invention, there is provided a method of distinguishing different combinations of genotype SNP sites, wherein the combined genotype is a genotype of a SNP site in a first source nucleic acid and a genotype in a second source nucleic acid In combination, the method comprises: sequence determining at least a portion of the nucleic acid in the mixed nucleic acid sample, the sequencing data comprising a plurality of reads, the mixed nucleic acid sample comprising the first source nucleic acid and the first a nucleic acid of two origins; aligning the sequencing data with a reference sequence to obtain a comparison result; identifying a SNP site based on the comparison result; determining a distribution region where the SNP site is located, the distribution region is based on The method for constructing a distribution region of different combined genotype SNP sites in any of the above embodiments is constructed; and based on the distribution region in which the SNP site is located, the combined genotype of the SNP site is determined. The first source nucleic acid and the second source nucleic acid may be nucleic acids derived from different individuals, or may be nucleic acids of different tissues or parts of the same individual, such as nucleic acids from tumor cells and from non-tumor cells. Obtaining sequencing data includes sequencing library preparation of mixed nucleic acid samples, and sequencing of the libraries. Sequencing can be performed using existing sequencing platforms, and library preparation can be performed according to the selected sequencing platform. The optional sequencing platforms include, but are not limited to, CG (Complete Genomics) CGA, Illumina/Solexa, Life Technologies/Ion Torrent, and Roche 454. Preparation of single-ended or double-end sequencing libraries according to the selected sequencing platform. The comparison can be performed by using software such as SOAP (Short Oligonucleotide Analysis Package), BWA, etc., and the embodiment does not limit this. In the comparison process, according to the setting of the comparison parameter, for example, each reading in the sequencing data is allowed to be allowed at most. h base mismatch, h is preferably 1 or 2. If more than h bases in a reads are mismatched, it is considered that the reads cannot be compared to the reference sequence. The identification of SNP sites can be performed according to software default parameter settings using software such as SOAPsnp and GATK. The reference sequence is a known sequence, and may be any reference template in the biological category to which the target individual belongs, such as a published genome assembly sequence of the same biological category, if the mixed nucleic acid sample is from a human, its genome The reference sequence (also referred to as the reference genome) can be selected from the HG19 provided by the NCBI database. The comparison result includes the comparison of each read segment with the reference sequence, including whether the read segment can compare the reference sequence, the position of the reference sequence on the read alignment, how many reads at a certain point are aligned, and the comparison The base type of the corresponding position of the read of a certain site, and the like. Identifying SNPs based on alignment results The locus, based on the relationship between the highest base of the SNP locus and the depth of the sub-high base, determines which distribution region the SNP locus falls into, because each distribution region corresponds to the four combined genotypes, such as the distribution region. The first region is the distribution region of the combined genotype ABaa and ABab SNP sites, the third region is the distribution region of the first predetermined ratio of AAaa SNP sites, and the fourth region or the closed fourth region is the AAab SNP site. In the distribution region, by determining the distribution region in which the SNP locus falls, the combined genotype of the SNP can be determined, that is, the SNP is typed. Any description of the advantages and technical features of the distribution area in the specific embodiments of the present invention may also bring the same advantages and technical features to the method of the embodiment, and details are not described herein again.
进一步的,依据本发明的一个具体实施方式,依据落入第四区域的SNP位点的信息估算第二来源核酸在混合核酸中占的比例。由于落入第四区域或者落入闭合第四区域的SNP位点的组合基因型为AAab,次高碱基只来自第二来源核酸,利用次高碱基获得的读段支持数量在覆盖该位点的读段总数中占的比例,可估算出第二来源核酸的浓度,可表示为第二来源核酸浓度=2*次高碱基深度/(最高碱基深度+次高碱基深度),公式中的次高碱基和最高碱基来自同一AAab SNP位点。在本发明的一个具体实施方式中,利用落入闭合第四区域每个SNP位点进行第二来源核酸浓度估算,获得一组第二来源核酸浓度数值,取数值中的中位数为第二来源核酸浓度。Further, according to a specific embodiment of the present invention, the proportion of the second source nucleic acid in the mixed nucleic acid is estimated based on the information of the SNP site falling into the fourth region. Since the combined genotype of the SNP locus falling into the fourth region or falling into the closed fourth region is AAab, the sub-high base is only from the second source nucleic acid, and the number of reads supported by the second highest base is overwriting the position. The proportion of the total number of reads in the point, the concentration of the second source nucleic acid can be estimated, which can be expressed as the second source nucleic acid concentration = 2 * second highest base depth / (highest base depth + second highest base depth), The next highest base and the highest base in the formula are from the same AAab SNP site. In a specific embodiment of the present invention, the second source nucleic acid concentration is estimated by using each SNP site falling into the closed fourth region to obtain a set of second source nucleic acid concentration values, and the median value is the second. Source nucleic acid concentration.
胎儿浓度对于产前遗传检测,例如拷贝数变异、子痫、早产以及遗传病研究是一个重要的参数。依据本发明的又一个实施方式,提供一种确定孕妇样本中胎儿核酸含量的方法,该方法包括:获取测序结果,所述测序结果的获取包括对孕妇样本中的至少一部分核酸进行序列测定,测序结果由多个读段组成,孕妇样本包含母体核酸和胎儿核酸;将测序结果与参考序列比对,获得比对结果;基于比对结果,识别出SNP位点;基于比对结果,确定SNP位点所处的分布区域,分布区域依据前述任一具体实施方式中的构建不同组合基因型SNP位点的分布区域的方法构建获得;基于处于分布区域中的第四区域或者闭合第四区域的SNP位点,确定该孕妇样本中的胎儿核酸含量。在这一实施方式中,待测孕妇样本为孕妇体液样本,例如,来自孕妇外周血、孕妇尿液等。在孕妇体液的游离DNA中包含着母亲和胎儿的基因组信息,可以将其分为四类:母亲纯合胎儿也纯合(AAaa),母亲纯合胎儿杂合(AAab),母亲杂合但胎儿纯合(ABaa),母亲杂合胎儿也杂合(ABab)。这一实施方式利用构建的分布区域把这四类区分开,再选取母亲纯合胎儿杂合(AAab)的SNP位点作为有效位点,用于计算胎儿浓度。前述任一具体实施方式中关于所构建的分布区域的优点和技术特征的描述,对这一实施方式同样适用,在此不再赘述。Fetal concentrations are an important parameter for prenatal genetic testing, such as copy number variation, eclampsia, preterm birth, and genetic disease research. According to still another embodiment of the present invention, a method for determining fetal nucleic acid content in a pregnant woman sample is provided, the method comprising: obtaining a sequencing result, the obtaining of the sequencing result comprising sequencing, sequencing, and sequencing at least a portion of the nucleic acid in the pregnant woman sample The result consists of multiple readings. The maternal sample contains maternal nucleic acid and fetal nucleic acid; the sequencing result is compared with the reference sequence to obtain the alignment result; based on the comparison result, the SNP locus is identified; based on the comparison result, the SNP position is determined. The distribution area in which the point is located, the distribution area is constructed according to the method of constructing the distribution area of different combined genotype SNP sites in any of the foregoing embodiments; the SNP based on the fourth region in the distribution region or the closed fourth region A site that determines the fetal nucleic acid content in the sample of the pregnant woman. In this embodiment, the sample of the pregnant woman to be tested is a sample of a pregnant woman's body fluid, for example, from pregnant women's peripheral blood, pregnant women's urine, and the like. The free DNA of pregnant women's body fluid contains the genomic information of mother and fetus, which can be divided into four categories: mother homozygous fetus is also homozygous (AAaa), mother homozygous fetal heterozygous (AAab), mother heterozygous but fetus Homozygous (ABaa), the mother's heterozygous fetus is also heterozygous (ABab). This embodiment uses the constructed distribution region to distinguish these four categories, and then selects the SNP site of the mother homozygous fetal heterozygous (AAab) as an effective site for calculating the fetal concentration. The description of the advantages and technical features of the constructed distribution area in any of the foregoing embodiments is equally applicable to this embodiment, and details are not described herein again.
在本发明的一个具体实施方式中,胎儿核酸含量为2*y4/(x4+y4)的中位数,y4为处于第四区域或者闭合第四区域的每个SNP位点的次高碱基的深度,x4为处于第四区域或者闭合第四区域的相应的每个SNP位点的最高碱基的深度,其中,碱基的深度为其获得的支持读段的数目。In a specific embodiment of the invention, the fetal nucleic acid content is a median of 2*y 4 /(x 4 +y 4 ), and y 4 is in the fourth region or in each of the SNP sites in the closed fourth region. The depth of the next highest base, x 4 is the depth of the highest base in the fourth region or the corresponding respective SNP site of the closed fourth region, wherein the depth of the base is the number of supported reads for which it is obtained.
在利用这一实施方式的方法确定胎儿核酸浓度时,获取的数据量较小或者孕妇样本中的胎儿核酸的浓度过低时,划定的分布区域会显得相对严格,易引起的计算得的胎儿核酸的浓度有偏差,较 佳的,测序结果包含的数据量不小于65X,即测序深度不小于65X。在本发明的一个具体实施方式中,当测序结果包含的数据量少于65X和/或确定出的胎儿核酸含量小于10%,为更准确的确定胎儿核酸浓度,利用偏差校正模型来校正计算出的胎儿核酸含量,获得校正的胎儿核酸含量。所称的偏差校正模型能够校正由于数据量较小或者孕妇样本中的胎儿核酸的浓度过低、划定的分布区域相对严格,而引起的计算得的胎儿核酸浓度的偏差。偏差校正模型可以在待检孕妇样本的数据量不小于65X或者估算得的胎儿核酸浓度小于10%时建立,也可以预先建立,保存备用。When determining the fetal nucleic acid concentration by the method of this embodiment, if the amount of data acquired is small or the concentration of fetal nucleic acid in the pregnant woman sample is too low, the delineated distribution area may appear relatively strict, and the calculated fetal is easily caused. The concentration of nucleic acid is biased. Preferably, the sequencing result contains a data volume of not less than 65X, that is, the sequencing depth is not less than 65X. In a specific embodiment of the present invention, when the sequencing result contains less than 65X of data and/or the determined fetal nucleic acid content is less than 10%, in order to more accurately determine the fetal nucleic acid concentration, the deviation correction model is used to correct and calculate The fetal nucleic acid content is obtained to obtain a corrected fetal nucleic acid content. The so-called deviation correction model is capable of correcting the calculated deviation of the fetal nucleic acid concentration due to the small amount of data or the concentration of fetal nucleic acid in the pregnant woman sample being too low and the defined distribution area being relatively strict. The deviation correction model may be established when the data amount of the pregnant woman sample to be tested is not less than 65X or the estimated fetal nucleic acid concentration is less than 10%, or may be established in advance and saved for use.
在本发明的一个具体实施方式中,当测序结果包含的数据量少于65X和/或确定出的胎儿核酸含量小于10%,为更准确的确定胎儿核酸浓度,在利用偏差校正模型进行校正之前,调整第一预定比例,增大第四区域或者闭合第四区域范围,使落入第四区域或者闭合第四区域的SNP位点更多、与理论上更接近,利于提高计算出的胎儿核酸浓度的准确性。调低第一预定比例,缩小第三区域范围则增大第四区域或者闭合第四区域的范围。In a specific embodiment of the present invention, when the sequencing result contains less than 65X of data and/or the determined fetal nucleic acid content is less than 10%, in order to more accurately determine the fetal nucleic acid concentration, before using the deviation correction model for correction Adjusting the first predetermined ratio, increasing the fourth region or closing the fourth region, so that the SNP sites falling into the fourth region or closing the fourth region are more and theoretically closer, which is beneficial to improve the calculated fetal nucleic acid. The accuracy of the concentration. The first predetermined ratio is lowered, and the third region is narrowed to increase the range of the fourth region or the fourth region.
在本发明的一个具体实施方式中,偏差校正模型的建立包括:获取K个模拟位点,K=K1+K2+K3+K4,K1为组合基因型AAaa模拟位点的数目,K2为组合基因型AAab的模拟位点的数目,K3为组合基因型ABaa模拟位点的数目,K4为组合基因型ABab模拟位点的数目,K2/K≥0.5%,K2≥35;设定不同标准胎儿核酸含量f,基于假设,利用组合基因型AAab的模拟位点,即处于第四区域或者闭合第四区域的模拟位点,计算相应的胎儿核酸含量f0;对获得的多组(f,f0)进行多项式回归,以建立所述偏差校正模型;所述假设包括,组合基因型AAaa模拟位点的最高碱基和次高碱基的深度分别服从N(D-e*D,(D-e*D))和N(e*D,e*D),组合基因型AAab模拟位点的最高碱基和次高碱基的深度分别服从N(D*(1-f/2),D*(1-f/2))和N(D*f/2,(D*f/2)),组合基因型ABaa模拟位点的最高碱基和次高碱基的深度分别服从N(D*(1/2-f/2),D*(1/2-f/2))和N(D*(1/2+f/2),D*(1/2+f/2)),组合基因型ABab模拟位点的最高碱基和次高碱基的深度分别服从N(D/2,D/2)和N
Figure PCTCN2015070900-appb-000002
其中,e为测序错误率,e=K1/K≤1%,D为模拟位点的平均测序深度,f为标准胎儿核酸含量,0.5%≤f≤25%。N(μ,σ^2)表示均值(期望)为μ、方差为σ^2的正态分布,N(μ,σ^2)也常表示成N(μ,σ),该偏差校正模型为正态混合模型。基于上述假设,在一固定平均测序深度下,取一系列的f,能够对应获得一系列f0,f0可利用前述方式计算获得,对多组(f,f0)进行方程拟合,能够得到适于这一平均测序深度下校正f0的公式。
In a specific embodiment of the present invention, the establishment of the deviation correction model comprises: acquiring K simulation sites, K=K 1 +K 2 +K 3 +K 4 , and K 1 is the number of combined genotype AAaa simulation sites. K 2 is the number of mimic sites of the combined genotype AAab, K 3 is the number of combinatorial genotype ABaa mimic sites, K 4 is the number of combinatorial genotype ABab mimetic sites, K 2 /K ≥ 0.5%, K 2 ≥ 35; set different standard fetal nucleic acid content f, based on the hypothesis, using the simulated locus of the combined genotype AAab, that is, in the fourth region or the closed region of the fourth region, the corresponding fetal nucleic acid content f 0 is calculated; Polynomial regression is performed on the obtained plurality of sets (f, f 0 ) to establish the deviation correction model; the hypothesis includes that the depths of the highest base and the second highest base of the combined genotype AAaa analog sites are respectively obeyed by N ( De*D, (De*D)) and N(e*D, e*D), the depth of the highest base and the next highest base of the combined genotype AAab mimic sites are respectively obeyed by N(D*(1-f) /2), D*(1-f/2)) and N(D*f/2, (D*f/2)), the depth of the highest base and the next highest base of the combined genotype ABaa analog site Obey N separately (D*(1/2-f/2), D*(1/2-f/2)) and N(D*(1/2+f/2), D*(1/2+f/2 )), the depth of the highest base and the second highest base of the combined genotype ABab mimetic site are respectively obeyed by N(D/2, D/2) and N
Figure PCTCN2015070900-appb-000002
Where e is the sequencing error rate, e = K 1 / K ≤ 1 %, D is the average sequencing depth of the simulated site, f is the standard fetal nucleic acid content, 0.5% ≤ f ≤ 25%. N(μ, σ^2) represents a normal distribution with a mean (expected) of μ and a variance of σ^2, and N(μ, σ^2) is also often expressed as N(μ, σ). The deviation correction model is Normal mixed model. Based on the above assumptions, under a fixed average sequencing depth, taking a series of f, can obtain a series of f 0 , f 0 can be calculated by the above method, and the equations are fitted to multiple groups (f, f 0 ). A formula suitable for correcting f 0 at this average sequencing depth is obtained.
在本发明的一个具体实施方式中,提供适于校正不同测序深度下的计算得的胎儿核酸浓度的方程,该偏差校正模型包括:D=50X时,f=10.981f0 3-8.401f0 2+3.1292f0-0.1883;D=60X时,f=14.449f0 3-9.757f0 2+3.2212f0-0.1759;D=70X时,f=18.57f0 3-11.595f0 2+3.4261f0-0.1745;D=80X时,f=18.693f0 3-11.293 f0 2+3.279f0-0.1566;D=90X时,f=20.076f0 3-11.749f0 2+3.2816f0-0.1494;D=100X时,f=19.126f0 3-11.025f0 2+3.098f0-0.1337; D=110X时,f=19.81f0 3-11.159f0 2+3.0725f0-0.1279;D=120X时,f=20.61f0 3-11.38f0 2+3.0554f0-0.1226;D=130X时,f=19.808f0 3-10.82f0 2+2.9285f0-0.1128;D=140X时,f=20.752f0 3-10.892f0 2+2.8731f0-0.1061;D=150X时,f=16.71f0 3-9.1447f0 2+2.623f0-0.0937;D=160X时,f=16.878f0 3-9.1543f0 2+2.6011f0-0.0904;D=170X时,f=15.433f0 3-8.3874f0 2+2.4715f0-0.0831;D=180X时,f=17f0 3-8.9749f0 2+2.5224f0-0.0828;D=190X时,f=14.627f0 3-7.8187f0 2+2.3464f0-0.0743;D=200X时,f=13.3f0 3-7.2048f0 2+2.2491f0-0.0688。方程回归可以利用现有方法,在这一具体实施方式中,由于每组(f,f0)之间表现出一个相对固定的差异,拟合曲线将计算值f0校正得跟对应的理论值f一样,以得来上述方程,这些拟合的多项式方程都是显著的。由于任一方程先假定的测序深度(模拟位点的平均测序深度)都大于等于50X,在实际中,测序深度在假定测序深度的±5X的范围,用同一方程都能实现校正,例如,待检样本的测序数据量为55X,利用时,f=10.981f0 3-8.401f0 2+3.1292f0-0.1883(D=50X);或者f=14.449f0 3-9.757f0 2+3.2212f0-0.1759(D=60X)都能实现对f0的偏差校正。另一种实施方式,对于同样的待检样本的测序数据量为55X,经过前面的解释,本领域普通技术人员可以理解,也可以使D=55X,获取多组(f,f0)来拟合校正方程,以获得最适偏差校正方程。In a specific embodiment of the invention, an equation suitable for correcting the calculated fetal nucleic acid concentration at different sequencing depths is provided, the deviation correction model comprising: D=50X, f=10.981f 0 3 -8.401f 0 2 +3.1292f 0 -0.1883; when D=60X, f=14.449f 0 3 -9.757f 0 2 +3.2212f 0 -0.1759; when D=70X, f=18.57f 0 3 -11.595f 0 2 +3.4261f 0 -0.1745; when D=80X, f=18.693f 0 3 -11.293 f 0 2 +3.279f 0 -0.1566; when D=90X, f=20.076f 0 3 -11.749f 0 2 +3.2816f 0 -0.1494;D When =100X, f=19.126f 0 3 -11.025f 0 2 +3.098f 0 -0.1337; when D=110X, f=19.81f 0 3 -11.159f 0 2 +3.0725f 0 -0.1279; when D=120X, f=20.61f 0 3 -11.38f 0 2 +3.0554f 0 -0.1226; when D=130X, f=19.808f 0 3 -10.82f 0 2 +2.9285f 0 -0.1128; when D=140X, f=20.752f 0 3 -10.892f 0 2 +2.8731f 0 -0.1061; when D=150X, f=16.71f 0 3 -9.1447f 0 2 +2.623f 0 -0.0937; when D=160X, f=16.878f 0 3 -9.1543 f 0 2 +2.6011f 0 -0.0904; when D=170X, f=15.433f 0 3 -8.3874f 0 2 +2.4715f 0 -0.0831; when D=180X, f=17f 0 3 -8.9749f 0 2 +2.5224f 0 -0.0828; when D=190X, f=14.627f 0 3 -7.8187f 0 2 +2.3464f 0 -0.0743; when D=200X, f=13.3f 0 3 -7.2048f 0 2 +2.2491f 0 -0.0688. Equation regression can utilize existing methods. In this embodiment, since each group (f, f 0 ) exhibits a relatively fixed difference, the fitted curve corrects the calculated value f 0 to the corresponding theoretical value. Like f, in order to derive the above equations, these fitted polynomial equations are significant. Since the sequencing depth (average sequencing depth of the simulated sites) assumed by either equation is greater than or equal to 50X, in practice, the sequencing depth can be corrected by the same equation in the range of ±5X of the assumed sequencing depth, for example, The amount of sequencing data for the sample is 55X. When used, f=10.981f 0 3 -8.401f 0 2 +3.1292f 0 -0.1883 (D=50X); or f=14.449f 0 3 -9.757f 0 2 +3.2212f 0 -0.1759 (D=60X) can achieve offset correction for f 0 . In another embodiment, the amount of sequencing data for the same sample to be inspected is 55X. After the foregoing explanation, those skilled in the art can understand that D=55X can also be obtained, and multiple groups (f, f 0 ) can be obtained. Correct the equation to obtain the optimal deviation correction equation.
本领域普通技术人员可以理解,上述实施方式中各种方法的全部或部分步骤可以通过程序来指令相关硬件完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。A person skilled in the art can understand that all or part of the steps of the various methods in the above embodiments may be completed by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory. , random access memory, disk or CD, etc.
依据本发明的再一实施方式,还提供一种确定孕妇样本中胎儿浓度的装置,包括:数据输入单元,用于输入数据;数据输出单元,用于输出数据;存储单元,用于存储数据,其中包括可执行的程序;处理器,与上述数据输入单元、数据输出单元及存储单元数据连接,用于执行存储单元中存储的可执行的程序,该程序的执行包括完成上述实施方式中各种方法的全部或部分步骤。According to still another embodiment of the present invention, there is provided an apparatus for determining a fetal concentration in a pregnant woman sample, comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data, The program includes an executable program; the processor is connected to the data input unit, the data output unit, and the storage unit, and is configured to execute an executable program stored in the storage unit, and the execution of the program includes completing various modes in the foregoing implementation manner. All or part of the steps of the method.
本发明的检测孕妇样本中胎儿核酸浓度的方法和装置,基于构建的不同组合基因型SNP位点的分布区域,和/或结合高斯分布(正态分布)混合模型,对SNP位点分类,能直接基于孕妇样本的大片段并行测序序列检测胎儿DNA浓度,免去了目前现有方法所需的实验分析确定特定类型SNP位点的过程,节省试验和分析费用。基于假设每一类SNP的深度都服从二项分布,而且在特定条件下都服从正态分布来构建本发明中的分布区域和/或偏差校正模型。利用本发明的确定孕妇样本中胎儿核酸含量的方法,只需65X数据就可以精确地检测低至2.5%的胎儿DNA浓度。The method and device for detecting fetal nucleic acid concentration in a pregnant woman sample according to the present invention are based on the distribution regions of different combined genotype SNP sites and/or combined with a Gaussian distribution (normal distribution) mixed model to classify SNP sites. Direct detection of fetal DNA concentration based on large fragment parallel sequencing sequences of pregnant women samples eliminates the need for experimental analysis of existing methods to determine specific types of SNP sites, saving trial and analysis costs. It is assumed that the depth of each type of SNP obeys the binomial distribution, and the normal distribution is obeyed under certain conditions to construct the distribution region and/or the deviation correction model in the present invention. Using the method of the present invention for determining fetal nucleic acid content in a pregnant woman sample, it is possible to accurately detect fetal DNA concentrations as low as 2.5% with only 65X data.
以下结合具体参数对依据本发明的装置/方法构建不同组合基因型SNP位点分布区域进行详细描述,结合具体核酸样本数据对依据本发明的方法确定SNP位点的组合基因型、区分不同组合基因型、确定孕妇样本中的胎儿核酸含量进行详细描述。本发明的描述中,″第一″、″第二″、″第三″、 ″第四″等为指代或描述方便,不能理解为有顺序关系或者有相对重要性指示,除非另有说明,″多个″、″多组″的含义是两个(组)或两个(组)以上。The following is a detailed description of constructing different combined genotype SNP locus distribution regions according to the device/method of the present invention in combination with specific parameters, and combining the specific nucleic acid sample data to determine the combined genotype of SNP loci according to the method of the present invention, and distinguishing different combinatorial genes The type and determination of fetal nucleic acid content in pregnant women samples are described in detail. In the description of the present invention, "first", "second", "third", "Fourth" and the like are meant to be convenient or to describe, and are not to be understood as having a sequential relationship or a relative importance indication, unless otherwise stated, "multiple", "multiple groups" have the meaning of two (groups) or two. (group) or more.
除另有交待,以下实施例中涉及的未特别交待的试剂、软件及仪器,都是常规市售产品或者公开的。Unless otherwise stated, the reagents, software, and instruments not specifically addressed in the following examples are conventionally commercially available or disclosed.
实施例一 Embodiment 1
包含两种不同来源核酸的混合核酸样本,例如孕妇血浆样本。在孕妇的血浆的游离DNA中包含着母亲和胎儿的基因组信息,其中的SNP位点可以分为四类:母亲纯合胎儿也纯合(AAaa);母亲纯合但胎儿杂合(AAab);母亲杂合但胎儿纯合(ABaa);母亲杂合胎儿也杂合(ABab)。以下通过各类位点的最高碱基深度和次高碱基深度满足的关系的差异,把这四类区分开,构建出各类SNP位点的分布区域。A mixed nucleic acid sample comprising two different source nucleic acids, such as a maternal plasma sample. The genomic information of the mother and the fetus is contained in the free DNA of the pregnant woman's plasma. The SNP sites can be divided into four categories: the mother homozygous fetus is also homozygous (AAaa); the mother is homozygous but the fetus is heterozygous (AAab); The mother is heterozygous but the fetus is homozygous (ABaa); the mother's heterozygous fetus is also heterozygous (ABab). The following four types are distinguished by the difference in the relationship between the highest base depth and the second highest base depth of various types of sites, and the distribution regions of various SNP sites are constructed.
分布区域的划分关键在于确定各类SNP位点的所说的差异,在以y表示次高碱基深度、x表示最高碱基深度的二维坐标系中,差异可表示为分界线,为此,需要确定至少2条曲线(分界线)才能划定出只属于一种类型SNP位点的分布区域。首先,将ABaa和ABab位点的分布区域划分出,之前尚无报道胎儿浓度超过50%,所以ABaa的次高碱基的深度的预期值在x/2到x/3之间,另外ABab的次高碱基的最大深度的预期值在x/2,所以可利用y=x/3这条拟合线划分出ABaa和ABab这两类的SNP位点的分布区域。接下来是划分开AAaa和AAab类型的SNP位点。在理论上当母亲和胎儿都是纯合的情况下该位点不是SNP,但是由于测序错误和组装错误的存在产生了AAaa类型的SNP位点。根据之前的研究这些由于测序错误产生的SNP位点服从二项分布的,另外根据中心极限定理:数目够大的时候二项分布可以看成是正态分布,所以在此将这些位点看成是正态分布,而根据二项分布的性质:当位点数目大于20且各个之间的深度的差异具有显著性(p小于5%)时二项分布可以近似看成泊松分布。我们设定测序质量值为Q20(测序错误率低于1%)所以理论上这些SNP位点的P小于或者等于1%,符合二项分布近似当作泊松分布的要求,因此我们假设该正态分布的方差等于期望。将测序错误假设为1%,期望(均值)假设为每个位点的测序深度乘以1%即期望和方差都是(xi+yi)*0.01,xi和yi分别表示SNP位点i的最高碱基深度和次高碱基深度。在正态分布中,99.9%的点会落在距均值三倍标准差以内,如图1所示。因此可用y=(xi+yi)*e+3*δ即y=(xi+yi)*0.01+3*((xi+yi)*0.01)^0.5将AAab与AAaa区分开来。最后为AAab类型SNP位点分布区域划定左右边界以便排除其他不确定因素的影响。根据之前的研究,测序深度分布趋于泊松分布,按照泊松分布的原理:当λ>5时泊松分布可以近似当作正态分布来考虑,泊松分布的方差和均值都为λ。当测序深度为200X、胎儿DNA浓度为5%的时候,AAab的次高碱基的深度为5,最高碱基深度为195,所以可以假设AAab的最高碱基和次高碱基也都服从 正态分布,再根据之前提及的三倍标准差原理划定AAab的左右边界,划分的分布区域的整体结果如图2所示。图中,大写字母(A/B)代表来自母亲核酸的位点的碱基,小写字母(a/b)代表来自胎儿核酸的位点的碱基,A/a代表最高碱基,B/b代表同样位点的次高碱基,x轴代表最高碱基深度,y轴代表次高碱基深度。The key to the division of the distribution area is to determine the difference between the various SNP loci. In the two-dimensional coordinate system in which y represents the next highest base depth and x represents the highest base depth, the difference can be expressed as a boundary line. It is necessary to determine at least 2 curves (demarcation lines) in order to delineate the distribution area belonging to only one type of SNP site. First, the distribution of the ABaa and ABab sites is divided. The fetal concentration has not been reported to exceed 50%, so the expected depth of the sub-base of ABaa is between x/2 and x/3. The expected value of the maximum depth of the second highest base is at x/2, so the distribution area of the SNP sites of ABaa and ABab can be divided by the fitting line of y=x/3. The next step is to separate the SNP sites of the AAaa and AAab types. In theory, when the mother and the fetus are homozygous, the site is not a SNP, but an AAaa-type SNP site is generated due to sequencing errors and assembly errors. According to previous studies, these SNP loci due to sequencing errors obey the binomial distribution. In addition, according to the central limit theorem: when the number is large enough, the binomial distribution can be regarded as a normal distribution, so these sites are regarded as It is a normal distribution, and according to the nature of the binomial distribution: when the number of sites is greater than 20 and the difference in depth between each is significant (p is less than 5%), the binomial distribution can be approximated as a Poisson distribution. We set the sequencing quality value to Q20 (sequencing error rate is less than 1%). Therefore, the P of these SNP sites is theoretically less than or equal to 1%, which is consistent with the requirement that the binomial distribution is approximated as a Poisson distribution. Therefore, we assume that The variance of the state distribution is equal to the expectation. The sequencing error is assumed to be 1%, and the expected (mean) is assumed to be the sequencing depth of each site multiplied by 1%, ie the expectation and the variance are both (x i + y i ) * 0.01, x i and y i represent the SNP bits, respectively The highest base depth and the second highest base depth of point i. In the normal distribution, 99.9% of the points will fall within three standard deviations from the mean, as shown in Figure 1. Therefore, it is possible to distinguish AAab from AAaa by y=(x i +y i )*e+3*δ, ie y=(x i +y i )*0.01+3*((x i +y i )*0.01)^0.5 Open. Finally, the left and right boundaries are delineated for the AAab type SNP site distribution area to eliminate the influence of other uncertain factors. According to the previous study, the depth distribution of the sequencing tends to Poisson distribution. According to the Poisson distribution principle, the Poisson distribution can be approximated as a normal distribution when λ>5, and the variance and mean of the Poisson distribution are λ. When the sequencing depth is 200X and the fetal DNA concentration is 5%, the sub-high base depth of AAab is 5, and the highest base depth is 195, so it can be assumed that the highest base and the second highest base of AAab are also obeyed. State distribution, and then delineate the left and right boundaries of AAab according to the principle of triple standard deviation mentioned above. The overall result of the divided distribution area is shown in Fig. 2. In the figure, uppercase letters (A/B) represent bases from the site of the mother's nucleic acid, lowercase letters (a/b) represent bases from the site of the fetal nucleic acid, and A/a represents the highest base, B/b The next highest base representing the same site, the x-axis represents the highest base depth, and the y-axis represents the next highest base depth.
实施例二 Embodiment 2
确定一个孕妇外周血样本中的胎儿核酸浓度,整体的操作流程包括:To determine the fetal nucleic acid concentration in a pregnant woman's peripheral blood sample, the overall operational procedures include:
1)从孕妇血浆中提取DNA,1) Extract DNA from maternal plasma,
2)对提取的DNA的至少一部分序列进行建库测序,例如利用芯片富集捕获基因组上的目标区域,然后建库测序,测序深度为100X,获得测序数据,2) Build a library for at least a part of the extracted DNA, for example, using a chip enrichment to capture a target region on the genome, and then construct a library for sequencing, the sequencing depth is 100X, and the sequencing data is obtained.
3)用软件SOAP2将测序数据比对到Hg19,3) Compare the sequencing data to Hg19 with software SOAP2,
4)识别出SNP位点,如用SAMTOOLS软件生成含有SNP信息的文件,4) Identify SNP sites, such as using SAMTOOLS software to generate files containing SNP information,
5)依据落入实施例一划定出的AAab类型SNP位点的分布区域中的SNP位点(以下将AAab类型SNP位点的分布区域简称为AAab区域),即选取母亲纯合但胎儿杂合(AAab)的SNP位点作为有效位点,计算预测胎儿核酸浓度。具体的,利用落在AAab SNP位点的分布区域内的每个SNP按照公式
Figure PCTCN2015070900-appb-000003
计算出每个位点对应的胎儿核酸浓度,最后取其中的中位数作为该样本的胎儿核酸浓度,y4和x4分别为落在AAab SNP位点的分布区域的一SNP位点的次高碱基深度和最高碱基深度。
5) According to the SNP locus in the distribution area of the AAab type SNP locus delineated in the first embodiment (hereinafter, the distribution area of the AAab type SNP locus is simply referred to as the AAab region), that is, the mother is homozygous but the fetus is selected. The SNP site of the AAab is used as an effective site to calculate the predicted fetal nucleic acid concentration. Specifically, each SNP falling within the distribution area of the AAab SNP site is expressed according to the formula
Figure PCTCN2015070900-appb-000003
Calculate the fetal nucleic acid concentration corresponding to each locus, and finally take the median as the fetal nucleic acid concentration of the sample, and y 4 and x 4 are the times of a SNP locus in the distribution area of the AAab SNP locus, respectively. High base depth and highest base depth.
实施例三 Embodiment 3
当待检测的孕妇样本的测序深度比较小(<65X)或者胎儿浓度比较低(<10%)的时候,划定的AAab与AAaa的界线会显得太过严格,而容易引起偏差。比如落到AAab区域的位点数太少无法估算或者由于取一组胎儿核酸浓度数值的中位数作为样本胎儿核酸浓度,当靠近x轴的AAab的SNP位点被去掉太多将较多的支持低胎儿浓度的SNP位点去掉导致结果偏大,如图2所示,所以当测序深度低于65X或者胎儿浓度低于10%时,我们调整AAab和AAaa的分界线且另外建立正态模型来修正胎儿核酸浓度值。第一步将AAab和AAaa的界线由原来的三倍标准差调为两倍标准差,从统计学的角度来讲正态分布中97.7%的点落在距离均值两倍标准差内。When the sequencing depth of the pregnant woman sample to be tested is relatively small (<65X) or the fetal concentration is relatively low (<10%), the boundary between the defined AAab and AAaa may appear too strict and may cause deviation. For example, if the number of sites falling into the AAab region is too small to estimate or because the median value of a set of fetal nucleic acid concentrations is taken as the fetal nucleic acid concentration of the sample, when the SNP site of the AAab near the x-axis is removed too much, more support will be provided. The removal of the SNP site with low fetal concentration results in a large result, as shown in Figure 2, so when the sequencing depth is lower than 65X or the fetal concentration is less than 10%, we adjust the boundary between AAab and AAaa and establish a normal model. Correct the fetal nucleic acid concentration value. In the first step, the boundary between AAab and AAaa is adjusted from the original standard deviation of three times to two standard deviations. From a statistical point of view, 97.7% of the points in the normal distribution fall within two standard deviations from the mean.
根据实施例一的构建过程的描述,我们可以将各类SNP点看成是正态分布,因此可以基于以下假设建立混合正态模型来修正偏差:According to the description of the construction process of the first embodiment, we can consider various SNP points as a normal distribution, so a mixed normal model can be established based on the following assumptions to correct the deviation:
1)产生10,000个模拟位点,符合报道的假定AAaa、AAab、ABaa和ABab位点在孕妇血浆中占的比例为7∶1∶1∶1。AAaa有纯合位点又有由于测序错误产生的杂合位点,根据之前报道的Hiseq2000的测序错误率设定测序错误率e为0.26%。 1) 10,000 mimetic sites were generated, in proportion to the reported putative AAaa, AAab, ABaa and ABab sites in pregnant women plasma of 7:1:1:1. AAaa has a homozygous site and a heterozygous site due to sequencing errors. The sequencing error rate e is 0.26% based on the previously reported sequencing error rate of Hiseq2000.
2)设定标准胎儿DNA浓度为0.5%到25%,间距0.5%依次取多个标准胎儿DNA浓度。2) Set the standard fetal DNA concentration to 0.5% to 25%, and the interval 0.5% to take multiple standard fetal DNA concentrations.
3)根据实施例一的构建过程的描述,可以将血浆中的SNP位点看成是方差等于均值的正态分布,可表示如下:3) According to the description of the construction process of the first embodiment, the SNP site in the plasma can be regarded as a normal distribution with a variance equal to the mean, which can be expressed as follows:
a)对于AAaa位点:a) For AAaa loci:
xi~N(D-0.0026*D,(D-0.0026*D)^0.5)x i ~ N (D-0.0026*D, (D-0.0026*D)^0.5)
yi~N(0.0026*(D,0.0026*D)^0.5)y i ~ N (0.0026 * (D, 0.0026 * D) ^ 0.5)
b)对于AAab位点:b) For AAab sites:
xi~N(D*(1-fe/2),(D*(1-fe/2))^0.5)x i ~N(D*(1-fe/2),(D*(1-fe/2))^0.5)
yi~N(D*fe/2,(D*fe/2)^0.5)y i ~ N (D*fe/2, (D*fe/2)^0.5)
c)对于ABaa位点:c) For the ABaa locus:
xi~N(D*(1/2-fe/2),(D*(1/2-fe/2))^0.5)x i ~ N (D*(1/2-fe/2), (D*(1/2-fe/2))^0.5)
yi~N(D*(1/2+fe/2),(D*(1/2+fe/2))^0.5)y i ~ N (D*(1/2+fe/2), (D*(1/2+fe/2))^0.5)
d)对于ABab位点:d) For ABab sites:
xi~N(D/2,(D/2)^0.5)x i ~N(D/2,(D/2)^0.5)
yi~N(D/2,(D/2)^0.5)y i ~N(D/2,(D/2)^0.5)
根据上面假设的,用R语言产生模拟位点数据,调整实施例一的AAab区域的范围,依据落入调整后的AAab区域的模拟位点,依据实施例二的方法计算出预测的胎儿核酸浓度,可以得到一个测序深度下0.5~25%标准胎儿浓度和对应的预测的胎儿浓度两组数值,然后得到拟合方程。表2显示便于使用的产生出的不同深度下的拟合方程。当从孕妇血浆中得到一个计算的胎儿DNA浓度的时候,可先评估一下它的测序深度是否大于65X或者胎儿DNA浓度是否大于10%,如果其中任一是否定的或者两个都是否定的,可用拟合方程修正预测值,获得更加准确的结果。According to the above assumption, the simulated site data is generated by the R language, the range of the AAab region of the first embodiment is adjusted, and the predicted fetal nucleic acid concentration is calculated according to the method of the second embodiment according to the simulated site falling into the adjusted AAab region. A set of values of 0.5 to 25% of the standard fetal concentration and the corresponding predicted fetal concentration at a sequencing depth can be obtained, and then a fitting equation is obtained. Table 2 shows the fitted equations at different depths produced for ease of use. When obtaining a calculated fetal DNA concentration from maternal plasma, first assess whether its sequencing depth is greater than 65X or whether the fetal DNA concentration is greater than 10%, if any of them are negative or both are negative, The predicted value can be corrected by the fitting equation to obtain more accurate results.
表2Table 2
测序深度Sequencing depth 公式formula
50X50X f=10.981f0 3-8.401f0 2+3.1292f0-0.1883f=10.981f 0 3 -8.401f 0 2 +3.1292f 0 -0.1883
60X60X f=14.449f0 3-9.757f0 2+3.2212f0-0.1759f=14.449f 0 3 -9.757f 0 2 +3.2212f 0 -0.1759
70X70X f=18.57f0 3-11.595f0 2+3.4261f0-0.1745f=18.57f 0 3 -11.595f 0 2 +3.4261f 0 -0.1745
80X80X f=18.693f0 3-11.293 f0 2+3.279f0-0.1566f=18.693f 0 3 -11.293 f 0 2 +3.279f 0 -0.1566
90X90X f=20.076f0 3-11.749f0 2+3.2816f0-0.1494f=20.076f 0 3 -11.749f 0 2 +3.2816f 0 -0.1494
100X100X f=19.126f0 3-11.025f0 2+3.098f0-0.1337f=19.126f 0 3 -11.025f 0 2 +3.098f 0 -0.1337
110X110X f=19.81f0 3-11.159f0 2+3.0725f0-0.1279f=19.81f 0 3 -11.159f 0 2 +3.0725f 0 -0.1279
120X120X f=20.61f0 3-11.38f0 2+3.0554f0-0.1226f=20.61f 0 3 -11.38f 0 2 +3.0554f 0 -0.1226
130X130X f=19.808f0 3-10.82f0 2+2.9285f0-0.1128f=19.808f 0 3 -10.82f 0 2 +2.9285f 0 -0.1128
140X140X f=20.752f0 3-10.892f0 2+2.8731f0-0.1061f=20.752f 0 3 -10.892f 0 2 +2.8731f 0 -0.1061
150X150X f=16.71f0 3-9.1447f0 2+2.623f0-0.0937f=16.71f 0 3 -9.1447f 0 2 +2.623f 0 -0.0937
160X160X f=16.878f0 3-9.1543f0 2+2.6011f0-0.0904f=16.878f 0 3 -9.1543f 0 2 +2.6011f 0 -0.0904
170X170X f=15.433f0 3-8.3874f0 2+2.4715f0-0.0831f=15.433f 0 3 -8.3874f 0 2 +2.4715f 0 -0.0831
180X180X f=17f0 3-8.9749f0 2+2.5224f0-0.0828f=17f 0 3 -8.9749f 0 2 +2.5224f 0 -0.0828
190X190X f=14.627f0 3-7.8187f0 2+2.3464f0-0.0743f=14.627f 0 3 -7.8187f 0 2 +2.3464f 0 -0.0743
200X200X f=13.3f0 3-7.2048f0 2+2.2491f0-0.0688f=13.3f 0 3 -7.2048f 0 2 +2.2491f 0 -0.0688
实施例四 Embodiment 4
实施例二的孕妇血浆样本,计算出的其中的胎儿核酸含量为5.2%,利用实施例三的偏差校正模型中的方程f=19.126f0 3-11.025f0 2+3.098f0-0.1337进行校正,获得校正后的胎儿核酸含量为2.6%。The plasma sample of the pregnant woman of Example 2 was calculated to have a fetal nucleic acid content of 5.2%, which was corrected by the equation f=19.126f 0 3 -11.025f 0 2 +3.098f 0 -0.1337 in the deviation correction model of the third embodiment. The corrected fetal nucleic acid content was 2.6%.
实施例五 Embodiment 5
混合母亲和胎儿核酸的测序数据来模拟不同胎儿核酸含量的多个孕妇血浆样本,检验利用偏差校正模型校正计算出的胎儿核酸含量的准确度,测序深度(数据量)设定为150X。整体结果如图3所示,其中图3a为校正前的结果,图3b为校正后的结果,可看出,在真实胎儿核酸含量小于10%时,计算出的胎儿核酸含量与真实值有较明显偏离,如真实值大概在2%的,计算值为大于5%,经过校正,计算出的数值与真实值无明显偏离。说明在低胎儿核酸含量时,校正有利于获得准确的胎儿核酸浓度。Sequencing data of maternal and fetal nucleic acids were mixed to simulate multiple maternal plasma samples of different fetal nucleic acid contents, and the accuracy of the calculated fetal nucleic acid content was corrected using the bias correction model, and the sequencing depth (data amount) was set to 150X. The overall results are shown in Figure 3, where Figure 3a is the result before the correction, Figure 3b is the corrected result, it can be seen that when the real fetal nucleic acid content is less than 10%, the calculated fetal nucleic acid content is compared with the actual value. Obvious deviation, if the true value is about 2%, the calculated value is greater than 5%. After correction, the calculated value does not deviate significantly from the true value. It is indicated that at low fetal nucleic acid levels, the correction facilitates obtaining accurate fetal nucleic acid concentrations.
测序深度和AAab区域内SNP位点数目是影响计算出的核酸含量准确度的最主要因素,在此也检验分析下不同深度下的准确率。如图3所示,在胎儿核酸含量3%左右也能够准确的获得胎儿核酸浓度,且有相对较多的有效SNP位点。测试取不同深度40~200X准确率的变化,准确率的变化如图4所示,当测序深度为65X,绝对偏差e1(e1%=|f-f0|)小于1%。The depth of sequencing and the number of SNP loci in the AAab region are the most important factors affecting the accuracy of the calculated nucleic acid content, and the accuracy at different depths is also examined here. As shown in Fig. 3, the fetal nucleic acid concentration can be accurately obtained at a fetal nucleic acid content of about 3%, and there are relatively many effective SNP sites. The test takes the change of 40~200X accuracy at different depths. The change of accuracy is shown in Fig. 4. When the sequencing depth is 65X, the absolute deviation e 1 (e 1 %=|f-f0|) is less than 1%.
为了分析计算浓度所需的最少SNP位点数,本发明随机抽取SNP位点计算AAab区域内SNP所需的最少位点再推算出实际所需SNP位点数。如图5所示,不论是在4.8%、10%还是19.8%,当AAab区域内的SNP位点超过35个的时候计算值f0与真实值f之间的相对误差e2(e2%=
Figure PCTCN2015070900-appb-000004
小于10%。使相对误差不大于10%,AAab区域内的SNP位点至少为35个,及在孕妇血浆样本中AAab类型SNP位点占全部类型SNP位点不大于10%,计算得能够准确检测出不同胎儿核酸浓度所需的最少的全部类型SNP位点总数,如表3所示。
In order to analyze the minimum number of SNP sites required to calculate the concentration, the present invention randomly samples the SNP site to calculate the minimum number of sites required for SNPs in the AAab region and then derives the actual number of SNP sites required. As shown in Fig. 5, whether it is 4.8%, 10% or 19.8%, the relative error e 2 (e 2 %) between the calculated value f 0 and the true value f when the SNP site in the AAab region exceeds 35 =
Figure PCTCN2015070900-appb-000004
Less than 10%. The relative error is not more than 10%, the SNP position in the AAab region is at least 35, and the AAab-type SNP locus accounts for no more than 10% of all types of SNP sites in maternal plasma samples, which can accurately detect different fetuses. The minimum number of all types of SNP sites required for nucleic acid concentration is shown in Table 3.
表3table 3
Figure PCTCN2015070900-appb-000005
Figure PCTCN2015070900-appb-000005
Figure PCTCN2015070900-appb-000006
Figure PCTCN2015070900-appb-000006
实施例六Embodiment 6
利用18个孕妇血浆样本来测试本发明确定孕妇样本胎儿核酸含量的方法的可行性,18个样本中有4个是同一个孕妇的不同孕期的血浆。结果如表4所示,其中的特异性=落在AAab区域内的真实AAab位点数目/落在AAab区域内位点总数,灵敏度=落在AAab区域内的真实AAab位点/AAab位点总数,其中的真实AAab位点数目和AAab位点总数可通过传统试验分型方法获得,从表4数据可以看出本发明方法准确可行。为了更进一步比较分析,该示例还用Y染色体深度计算这18份血浆样品中的9个男胎胎儿核酸浓度,且与用本发明方法计算得的男胎胎儿核酸浓度值进行比较,结果如图6所示,有较强的相关性(r=0.94;p<0.0001)。所说的利用Y染色体深度计算男胎核酸浓度是已知方法,可参考[Struble C A,Syngelaki A,Oliphant A,et al.Fetal fraction estimate in twin pregnancies using directed cell-free DNA analysis[J].Fetal diagnosis and therapy,2013,35(3):161-165.]进行,包括获取样本中Y染色体在dbSNP数据库中的SNP位点的深度,过滤掉在女胎也有的位点的数据,取剩下的位点的深度的中位数乘以2(因Y染色体只有一条)再除以常染色体深度的中位数,得到利用Y染色体深度计算出来的胎儿核酸浓度。Eighteen maternal plasma samples were used to test the feasibility of the method for determining the fetal nucleic acid content of a pregnant woman sample. Four of the 18 samples were plasma of different pregnant women of the same pregnant woman. The results are shown in Table 4, in which the specificity = the number of true AAab sites falling within the AAab region / the total number of sites within the AAab region, sensitivity = the total number of true AAab sites / AAab sites falling within the AAab region The number of true AAab sites and the total number of AAab sites can be obtained by the traditional experimental typing method. It can be seen from the data in Table 4 that the method of the present invention is accurate and feasible. For further comparative analysis, the example also uses the Y chromosome depth to calculate the 9 male fetal fetal nucleic acid concentrations in the 18 plasma samples, and compares them with the male fetal fetal nucleic acid concentration values calculated by the method of the present invention. As shown in 6, there is a strong correlation (r = 0.94; p < 0.0001). The use of the Y chromosome depth to calculate the male fetal nucleic acid concentration is a known method, and can be referred to [Struble C A, Syngelaki A, Oliphant A, et al. Fetal fraction estimate in twin pregnancies using directed cell-free DNA analysis [J]. Fetal diagnosis and therapy, 2013, 35(3): 161-165.] performed, including obtaining the depth of the SNP site of the Y chromosome in the dbSNP database in the sample, filtering out the data at the site of the female fetus, and taking the remaining The median of the depth of the lower site is multiplied by 2 (since the Y chromosome is only one) and divided by the median of the autosomal depth to obtain the fetal nucleic acid concentration calculated using the depth of the Y chromosome.
表4Table 4
Figure PCTCN2015070900-appb-000007
Figure PCTCN2015070900-appb-000007
Figure PCTCN2015070900-appb-000008
Figure PCTCN2015070900-appb-000008

Claims (31)

  1. 一种构建不同组合基因型SNP位点分布区域的装置,所述组合基因型为所述SNP位点在第一来源核酸中的基因型和在第二来源核酸中的基因型的组合,其特征在于,所述装置包括,A device for constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a combination of a genotype of the SNP site in a first source nucleic acid and a genotype in a second source nucleic acid, the characteristics thereof Wherein the device comprises
    第一区域-第二区域构建单元,用于构建第一区域和第二区域,所述第一区域是第一组合基因型SNP位点的分布区域,所述第二区域是第二组合基因型SNP位点的分布区域,a first region-second region building unit for constructing a first region and a second region, the first region being a distribution region of a first combined genotype SNP site, and the second region being a second combined genotype The distribution area of the SNP locus,
    所述第一区域和所述第二区域是基于第一关系和第二关系的差异划分开的,The first area and the second area are divided based on a difference between the first relationship and the second relationship,
    所述第一关系为所述第一组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,The first relationship is a relationship between a depth of a second highest base of the first combined genotype SNP site and a depth of the highest base,
    所述第二关系为所述第二组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,The second relationship is a relationship between a depth of a second highest base of the second combined genotype SNP site and a depth of the highest base,
    所述第一组合基因型为ABaa和ABab,所述第二组合基因型为AAaa和AAab;The first combined genotype is ABaa and ABab, and the second combined genotype is AAaa and AAab;
    第三区域-第四区域构建单元,用于从所述第二区域中构建第三区域和第四区域,所述第三区域是所述第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的分布区域,所述第四区域是所述第二组合基因型SNP位点中的AAab SNP位点的分布区域,a third region-fourth region building unit for constructing a third region and a fourth region from the second region, the third region being a first predetermined ratio of the second combined genotype SNP sites a distribution region of the AAaa SNP site, wherein the fourth region is a distribution region of the AAab SNP site in the second combined genotype SNP site,
    所述第三区域和所述第四区域是基于第三关系和第四关系的差异划分开的,The third area and the fourth area are divided based on a difference between the third relationship and the fourth relationship,
    所述第三关系为所述第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,The third relationship is a relationship between a depth of a second highest base of the AAaa SNP site of the first predetermined ratio of the second combined genotype SNP site and a depth of the highest base thereof,
    所述第四关系为所述第二组合基因型SNP位点中的AAab SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系;其中,The fourth relationship is a relationship between a depth of a sub-high base of the AAab SNP site in the second combined genotype SNP site and a depth of the highest base; wherein
    AA和AB分别表示纯合和杂合的来自所述第一来源核酸的SNP位点,aa和ab分别表示纯合和杂合的来自所述第二来源核酸的相同SNP位点,AA and AB respectively represent homozygous and heterozygous SNP sites from the first source nucleic acid, and aa and ab respectively represent homozygous and heterozygous identical SNP sites from the second source nucleic acid,
    A或a表示所述SNP位点的最高碱基,B或b表示相同所述SNP位点的次高碱基。A or a represents the highest base of the SNP site, and B or b represents the next highest base of the same SNP site.
  2. 权利要求1的装置,其特征在于,还包括,The device of claim 1 further comprising
    闭合第四区域构建单元,用于从所述第四区域中构建闭合第四区域,所述闭合第四区域是所述第二组合基因型SNP位点中的第二预定比例的AAab SNP位点的分布区域,Closing a fourth region building block for constructing a closed fourth region from the fourth region, the closed fourth region being a second predetermined ratio of AAab SNP sites in the second combined genotype SNP site Distribution area,
    所述闭合第四区域是基于所述AAab SNP位点的次高碱基和最高碱基的深度都服从正态分布,以及设定所述第二预定比例,从所述第四区域中构建获得的。The closed fourth region is based on a normal distribution based on the sub-high base and the highest base depth of the AAab SNP site, and sets the second predetermined ratio to be constructed from the fourth region of.
  3. 权利要求1或2的装置,其特征在于,所述第一预定比例不小于95%。The device of claim 1 or 2, wherein said first predetermined ratio is not less than 95%.
  4. 权利要求1的装置其特征在于所述第一关系为x/2≥y≥x/3所述第二关系为0<y<x/3,所述第三关系为0<y<(x+y)*e+m*δ,所述第四关系为(x+y)*e+m*δ<y<3/x,所述第一 关系和第二关系的差异为y=x/3,所述第三关系和第四关系的差异为y=(x+y)*e+m*δ,其中,The apparatus of claim 1 wherein said first relationship is x/2 ≥ y ≥ x / 3 said second relationship is 0 < y < x / 3 and said third relationship is 0 < y < (x + y) *e+m*δ, the fourth relationship is (x+y)*e+m*δ<y<3/x, the first The difference between the relationship and the second relationship is y=x/3, and the difference between the third relationship and the fourth relationship is y=(x+y)*e+m*δ, where
    y为SNP位点的次高碱基的深度,y is the depth of the next highest base of the SNP site,
    x为SNP位点的最高碱基的深度,x is the depth of the highest base of the SNP site,
    e为测序错误率,e is the sequencing error rate,
    δ为标准差,δ=((x+y)*e)^0.5,δ is the standard deviation, δ=((x+y)*e)^0.5,
    m取决于所述第一预定比例,m*δ和所述第一预定比例的关系为标准正态分布中的标准差和概率的比率关系。m depends on the first predetermined ratio, and the relationship between m*δ and the first predetermined ratio is a ratio relationship of the standard deviation and the probability in the standard normal distribution.
  5. 权利要求4的装置,其特征在于,e≤1%。The device of claim 4 wherein e ≤ 1%.
  6. 权利要求4的装置,其特征在于,当所述第一预定比例为99.9%时,m=3。The apparatus of claim 4 wherein m = 3 when said first predetermined ratio is 99.9%.
  7. 权利要求2的装置,其特征在于,所述第二预定比例不小于95%。The device of claim 2 wherein said second predetermined ratio is no less than 95%.
  8. 权利要求2的装置,其特征在于,所述闭合第四区域为y=x/3、y=(x+y)*e+m*δ、y=D0-n*δ-x和y=D0+n*δ-x形成的区域,其中,The apparatus of claim 2 wherein said closed fourth region is y = x / 3, y = (x + y) * e + m * δ, y = D0 - n * δ - x and y = D 0 +n*δ-x formed by the region, wherein
    y为SNP位点的次高碱基的深度,y is the depth of the next highest base of the SNP site,
    x为SNP位点的最高碱基的深度,x is the depth of the highest base of the SNP site,
    e为测序错误率,e≤1%,e is the sequencing error rate, e ≤ 1%,
    δ为标准差,δ=((x+y)*e)^0.5,δ is the standard deviation, δ=((x+y)*e)^0.5,
    D0为SNP位点的平均深度,D 0 is the average depth of the SNP site,
    m取决于所述第一预定比例,m*δ和所述第一预定比例的关系为标准正态分布中的标准差和概率的比率关系,m depends on the first predetermined ratio, and the relationship between m*δ and the first predetermined ratio is a ratio relationship between the standard deviation and the probability in the standard normal distribution,
    n取决于所述第二预定比例,n*δ和所述第二预定比例的关系为标准正态分布中的标准差和概率的比率关系。n Depends on the second predetermined ratio, the relationship between n*δ and the second predetermined ratio is a ratio relationship of the standard deviation and the probability in the standard normal distribution.
  9. 权利要求8的装置,其特征在于,当所述第一预定比例和所述第二预定比例均为99.9%时,m=n=3。The apparatus of claim 8 wherein m = n = 3 when said first predetermined ratio and said second predetermined ratio are both 99.9%.
  10. 权利要求8的装置,其特征在于,D0≥100X。The device of claim 8 wherein D 0 ≥ 100X.
  11. 一种构建不同组合基因型SNP位点的分布区域的方法,所述组合基因型为所述SNP位点在第一来源核酸中的基因型和在第二来源核酸中的基因型的组合,其特征在于,所述方法包括,A method of constructing a distribution region of different combined genotype SNP sites, wherein the combined genotype is a combination of a genotype of the SNP site in a first source nucleic acid and a genotype in a second source nucleic acid, Characteristically, the method comprises
    基于第一关系和第二关系的差异,构建第一区域和第二区域,Constructing the first area and the second area based on the difference between the first relationship and the second relationship,
    所述第一区域是第一组合基因型SNP位点的分布区域,The first region is a distribution region of a first combined genotype SNP site,
    所述第二区域是第二组合基因型SNP位点的分布区域,The second region is a distribution region of a second combined genotype SNP site,
    所述第一关系为所述第一组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系, The first relationship is a relationship between a depth of a second highest base of the first combined genotype SNP site and a depth of the highest base,
    所述第二关系为所述第二组合基因型SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,The second relationship is a relationship between a depth of a second highest base of the second combined genotype SNP site and a depth of the highest base,
    所述第一组合基因型为ABaa和ABab,所述第二组合基因型为AAaa和AAab;基于第三关系和第四关系的差异,从所述第二区域中构建第三区域和第四区域,The first combined genotype is ABaa and ABab, and the second combined genotype is AAaa and AAab; and the third region and the fourth region are constructed from the second region based on the difference between the third relationship and the fourth relationship ,
    所述第三区域是所述第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的分布区域,The third region is a distribution region of a first predetermined ratio of AAaa SNP sites in the second combined genotype SNP site,
    所述第四区域是所述第二组合基因型SNP位点中的AAab SNP位点的分布区域,The fourth region is a distribution region of the AAab SNP site in the second combined genotype SNP site,
    所述第三关系为所述第二组合基因型SNP位点中的第一预定比例的AAaa SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系,The third relationship is a relationship between a depth of a second highest base of the AAaa SNP site of the first predetermined ratio of the second combined genotype SNP site and a depth of the highest base thereof,
    所述第四关系为所述第二组合基因型SNP位点中的AAab SNP位点的次高碱基的深度及其最高碱基的深度之间满足的关系;其中,The fourth relationship is a relationship between a depth of a sub-high base of the AAab SNP site in the second combined genotype SNP site and a depth of the highest base; wherein
    AA和AB分别表示纯合和杂合的来自所述第一来源核酸的SNP位点,aa和ab分别表示纯合和杂合的来自所述第二来源核酸的相同SNP位点,AA and AB respectively represent homozygous and heterozygous SNP sites from the first source nucleic acid, and aa and ab respectively represent homozygous and heterozygous identical SNP sites from the second source nucleic acid,
    A或a表示所述SNP位点的最高碱基,B或b表示相同SNP位点的次高碱基。A or a represents the highest base of the SNP site, and B or b represents the next highest base of the same SNP site.
  12. 权利要求11的方法,其特征在于,还包括,The method of claim 11 further comprising
    从所述第四区域中构建闭合第四区域,所述闭合第四区域是所述第二组合基因型SNP位点中的第二预定比例的AAab SNP位点的分布区域,其中包括,Constructing a closed fourth region from the fourth region, the closed fourth region being a distribution region of a second predetermined ratio of AAab SNP sites in the second combined genotype SNP site, including
    基于所述AAab SNP位点的次高碱基和最高碱基的深度都服从正态分布,以及设定所述第二预定比例,从所述第四区域中构建获得的。The depths of the next highest base and the highest base based on the AAab SNP site are subject to a normal distribution, and the second predetermined ratio is set to be constructed from the fourth region.
  13. 权利要求11或12的方法,其特征在于,所述第一预定比例不小于95%。The method of claim 11 or 12, wherein said first predetermined ratio is not less than 95%.
  14. 权利要求11的方法,其特征在于,所述第一关系为x/2≥y≥x/3,所述第二关系为0<y<x/3,所述第三关系为0<y<(x+y)*e+m*δ,所述第四关系为(x+y)*e+m*δ<y<3/x,所述第一关系和第二关系的差异为y=x/3,所述第三关系和第四关系的差异为y=(x+y)*e+m*δ,其中,The method of claim 11 wherein said first relationship is x/2 ≥ y ≥ x / 3, said second relationship is 0 < y < x / 3, and said third relationship is 0 < y < (x+y)*e+m*δ, the fourth relationship is (x+y)*e+m*δ<y<3/x, and the difference between the first relationship and the second relationship is y= x/3, the difference between the third relationship and the fourth relationship is y=(x+y)*e+m*δ, wherein
    y为SNP位点的次高碱基的深度,y is the depth of the next highest base of the SNP site,
    x为SNP位点的最高碱基的深度,x is the depth of the highest base of the SNP site,
    e为测序错误率,e is the sequencing error rate,
    δ为标准差,δ=((x+y)*e)^0.5,δ is the standard deviation, δ=((x+y)*e)^0.5,
    m取决于所述第一预定比例,m*δ和所述第一预定比例的关系为标准正态分布中的标准差和概率的比率关系。m depends on the first predetermined ratio, and the relationship between m*δ and the first predetermined ratio is a ratio relationship of the standard deviation and the probability in the standard normal distribution.
  15. 权利要求14的方法,其特征在于,e≤1%。 The method of claim 14 wherein e ≤ 1%.
  16. 权利要求14的方法,其特征在于,当所述第一预定比例为99.9%时,m=3。The method of claim 14 wherein m = 3 when said first predetermined ratio is 99.9%.
  17. 权利要求12的方法,其特征在于,所述第二预定比例不小于95%。The method of claim 12 wherein said second predetermined ratio is no less than 95%.
  18. 权利要求12的方法,其特征在于,所述闭合第四区域为y=x/3、y=(x+y)*e+m*δ、y=D0-n*δ-x和y=D0+n*δ-x形成的区域,其中,The method of claim 12 wherein said closed fourth region is y = x / 3, y = (x + y) * e + m * δ, y = D 0 - n * δ - x and y = a region formed by D 0 +n*δ-x, wherein
    y为SNP位点的次高碱基的深度,y is the depth of the next highest base of the SNP site,
    x为SNP位点的最高碱基的深度,x is the depth of the highest base of the SNP site,
    e为测序错误率,e≤1%,e is the sequencing error rate, e ≤ 1%,
    δ为标准差,δ=((x+y)*e)^0.5,δ is the standard deviation, δ=((x+y)*e)^0.5,
    D0为SNP位点的平均深度,D 0 is the average depth of the SNP site,
    m取决于所述第一预定比例,m*δ和所述第一预定比例的关系为标准正态分布中的标准差和概率的比率关系,m depends on the first predetermined ratio, and the relationship between m*δ and the first predetermined ratio is a ratio relationship between the standard deviation and the probability in the standard normal distribution,
    n取决于所述第二预定比例,n*δ和所述第二预定比例的关系为标准正态分布中的标准差和概率的比率关系。n Depends on the second predetermined ratio, the relationship between n*δ and the second predetermined ratio is a ratio relationship of the standard deviation and the probability in the standard normal distribution.
  19. 权利要求18的方法,其特征在于,当所述第一预定比例和所述第二预定比例均为99.9%时,m=n=3。The method of claim 18 wherein m = n = 3 when said first predetermined ratio and said second predetermined ratio are both 99.9%.
  20. 权利要求18的方法,其特征在于,D0≥100X。The method of claim 18, wherein D 0 ≥ 100X.
  21. 一种区分不同组合基因型SNP位点的方法,所述组合基因型为所述SNP位点在第一来源核酸中的基因型和在第二来源核酸中的基因型的组合,其特征在于,所述方法包括,A method for distinguishing different combined genotype SNP sites, wherein the combined genotype is a combination of a genotype of the SNP site in a first source nucleic acid and a genotype in a second source nucleic acid, characterized in that The method includes,
    对混合核酸样本中的至少一部分核酸进行序列测定,获得测序数据,所述测序数据由多个读段组成,所述混合核酸样本包含所述第一来源核酸和所述第二来源核酸;Performing sequence determination on at least a portion of the nucleic acid in the mixed nucleic acid sample, the sequencing data comprising a plurality of reads, the mixed nucleic acid sample comprising the first source nucleic acid and the second source nucleic acid;
    将所述测序数据与参考序列比对,获得比对结果;Comparing the sequencing data with a reference sequence to obtain a comparison result;
    基于所述比对结果,识别出SNP位点;Identifying a SNP site based on the alignment result;
    基于所述比对结果,确定所述SNP位点所处的分布区域,所述分布区域依据权利要求11-20任一方法构建;Determining, according to the comparison result, a distribution area in which the SNP site is located, the distribution area being constructed according to any one of claims 11-20;
    基于所述SNP位点所处的分布区域,确定所述SNP位点的组合基因型。The combined genotype of the SNP site is determined based on the distribution region in which the SNP site is located.
  22. 一种确定孕妇样本中胎儿核酸含量的方法,其特征在于,包括,A method for determining fetal nucleic acid content in a sample of a pregnant woman, characterized in that
    获取测序结果,所述测序结果的获取包括对孕妇样本中的至少一部分核酸进行序列测定,所述测序结果由多个读段组成,所述孕妇样本包含母体核酸和胎儿核酸;Obtaining a sequencing result, the obtaining of the sequencing result comprises performing sequence determination on at least a portion of the nucleic acid in the pregnant woman sample, the sequencing result consisting of a plurality of readings, the pregnant female sample comprising the parent nucleic acid and the fetal nucleic acid;
    将所述测序结果与参考序列比对,获得比对结果;Comparing the sequencing result with a reference sequence to obtain a comparison result;
    基于所述比对结果,识别出SNP位点;Identifying a SNP site based on the alignment result;
    确定所述SNP位点所处的分布区域,所述分布区域依据权利要求11-20任一方法构建; Determining a distribution area in which the SNP site is located, the distribution area being constructed according to any of claims 11-20;
    基于处于所述分布区域中的第四区域或者闭合第四区域的SNP位点,确定所述孕妇样本中的胎儿核酸含量。The fetal nucleic acid content in the maternal sample is determined based on the fourth region in the distribution region or the SNP site in the closed fourth region.
  23. 权利要求22的方法,其特征在于,所述孕妇样本来自孕妇外周血和孕妇尿液的至少一种。The method of claim 22 wherein said maternal sample is derived from at least one of maternal peripheral blood and maternal urine.
  24. 权利要求22的方法,其特征在于,所述测序结果包含的数据量为不小于65X。The method of claim 22, wherein said sequencing result comprises an amount of data of not less than 65X.
  25. 权利要求22的方法,其特征在于,所述胎儿核酸含量为2*y4/(x4+y4)的中位数,The method of claim 22, wherein said fetal nucleic acid content is a median of 2*y 4 /(x 4 +y 4 ),
    y4为处于所述第四区域或者所述闭合第四区域的每个SNP位点的次高碱基的深度,y 4 is the depth of the next highest base of each SNP site in the fourth region or the closed fourth region,
    x4为处于所述第四区域或者所述闭合第四区域的相应的每个SNP位点的最高碱基的深度,其中,x 4 is the depth of the highest base of each of the corresponding SNP sites in the fourth region or the closed fourth region, wherein
    碱基的深度为其获得的支持读段的数目。The depth of the base is the number of supported reads it has obtained.
  26. 权利要求22-25任一方法,其特征在于,还包括,当测序结果包含的数据量少于65X和/或所述胎儿核酸含量小于10%,利用偏差校正模型来校正所述胎儿核酸含量,获得校正的胎儿核酸含量。The method of any one of claims 22-25, further comprising: correcting said fetal nucleic acid content using a bias correction model when the sequencing result comprises less than 65X of data and/or said fetal nucleic acid content is less than 10%, A corrected fetal nucleic acid content is obtained.
  27. 权利要求26的方法,其特征在于,当测序结果包含的数据量少于65X和/或所述胎儿核酸含量小于10%,在利用偏差校正模型进行校正之前,调整第一预定比例,以增大第四区域或者闭合第四区域范围。The method of claim 26, wherein when the sequencing result comprises less than 65X of data and/or said fetal nucleic acid content is less than 10%, the first predetermined ratio is adjusted to increase prior to correction using the bias correction model The fourth area or the fourth area is closed.
  28. 权利要求26的方法,其特征在于,所述偏差校正模型的建立包括,The method of claim 26, wherein the establishing of the deviation correction model comprises
    获取K个模拟位点,K=K1+K2+K3+K4,K1为组合基因型AAaa模拟位点的数目,K2为组合基因型AAab的模拟位点的数目,K3为组合基因型ABaa模拟位点的数目,K4为组合基因型ABab模拟位点的数目,K2/K≥0.5%,K2≥35;K simulation sites were obtained, K=K 1 +K 2 +K 3 +K 4 , K 1 is the number of combined genotype AAaa mimic sites, K 2 is the number of mimic sites of the combined genotype AAab, K 3 In order to combine the number of genotype ABaa mimic sites, K 4 is the number of combinatorial genotype ABab mimic sites, K 2 /K ≥ 0.5%, K 2 ≥ 35;
    设定不同标准胎儿核酸含量f,基于假设,利用处于所述第四区域或者所述闭合第四区域的模拟位点计算相应的胎儿核酸含量f0Setting different standard fetal nucleic acid contents f, based on the assumption, using the simulated sites in the fourth region or the closed fourth region to calculate the corresponding fetal nucleic acid content f 0 ,
    对获得的多组(f,f0)进行多项式回归,以建立所述偏差校正模型;Performing polynomial regression on the obtained plurality of sets (f, f 0 ) to establish the deviation correction model;
    所述假设包括,The assumptions include,
    组合基因型AAaa模拟位点的最高碱基和次高碱基的深度分别服从N(D-e*D,D-e*D和N(e*D,e*D),The depths of the highest base and the second highest base of the combined genotype AAaa mimic sites are respectively obeyed by N (D-e*D, D-e*D and N(e*D, e*D),
    组合基因型AAab模拟位点的最高碱基和次高碱基的深度分别服从N(D*(1-f/2),D*(1-f/2))和N(D*f/2,(D*f/2)),The depths of the highest base and the second highest base of the combined genotype AAab mimic sites are respectively obeyed by N(D*(1-f/2), D*(1-f/2)) and N(D*f/2 , (D*f/2)),
    组合基因型ABaa模拟位点的最高碱基和次高碱基的深度分别服从N(D*(1/2-f/2),D*(1/2-f/2))和N(D*(1/2+f/2),D*(1/2+f/2)),The depths of the highest base and the second highest base of the combined genotype ABaa analog site are respectively obeyed by N(D*(1/2-f/2), D*(1/2-f/2)) and N(D *(1/2+f/2), D*(1/2+f/2)),
    组合基因型ABab模拟位点的最高碱基和次高碱基的深度分别服从N(D/2,D/2)和
    Figure PCTCN2015070900-appb-100001
    其中,
    The depths of the highest base and the second highest base of the combined genotype ABab mimetic site are respectively obeyed by N(D/2, D/2) and
    Figure PCTCN2015070900-appb-100001
    among them,
    e为测序错误率,e=K1/K≤1%,e is the sequencing error rate, e = K 1 / K ≤ 1 %,
    D为模拟位点的平均测序深度,D is the average sequencing depth of the analog site,
    f为标准胎儿核酸含量,0.5%≤f≤25%。f is the standard fetal nucleic acid content, 0.5% ≤ f ≤ 25%.
  29. 权利要求28的方法,其特征在于,所述偏差校正模型包括,The method of claim 28 wherein said deviation correction model comprises
    D=50X时,f=10.981f0 3-8.401f0 2+3.1292f0-0.1883,When D=50X, f=10.981f 0 3 -8.401f 0 2 +3.1292f 0 -0.1883,
    D=60X时,f=14.449f0 3-9.757f0 2+3.2212f0-0.1759,When D=60X, f=14.449f 0 3 -9.757f 0 2 +3.2212f 0 -0.1759,
    D=70X时,f=18.57f0 3-11.595f0 2+3.4261f0-0.1745,When D=70X, f=18.57f 0 3 -11.595f 0 2 +3.4261f 0 -0.1745,
    D=80X时,f=18.693f0 3-11.293f0 2+3.279f0-0.1566,When D=80X, f=18.693f 0 3 -11.293f 0 2 +3.279f 0 -0.1566,
    D=90X时,f=20.076f0 3-11.749f0 2+3.2816f0-0.1494,When D=90X, f=20.076f 0 3 -11.749f 0 2 +3.2816f 0 -0.1494,
    D=100X时,f=19.126f0 3-11.025f0 2+3.098f0-0.1337,When D=100X, f=19.126f 0 3 -11.025f 0 2 +3.098f 0 -0.1337,
    D=110X时,f=19.81f0 3-11.159f0 2+3.0725f0-0.1279,When D=110X, f=19.81f 0 3 -11.159f 0 2 +3.0725f 0 -0.1279,
    D=120X时,f=20.61f0 3-11.38f0 2+3.0554f0-0.1226,When D=120X, f=20.61f 0 3 -11.38f 0 2 +3.0554f 0 -0.1226,
    D=130X时,f=19.808f0 3-10.82f0 2+2.9285f0-0.1128,When D=130X, f=19.808f 0 3 -10.82f 0 2 +2.9285f 0 -0.1128,
    D=140X时,f=20.752f0 3-10.892f0 2+2.8731f0-0.1061,When D=140X, f=20.752f 0 3 -10.892f 0 2 +2.8731f 0 -0.1061,
    D=150X时,f=16.71f0 3-9.1447f0 2+2.623f0-0.0937,When D=150X, f=16.71f 0 3 -9.1447f 0 2 +2.623f 0 -0.0937,
    D=160X时,f=16.878f0 3-9.1543f0 2+2.6011f0-0.0904,When D=160X, f=16.878f 0 3 -9.1543f 0 2 +2.6011f 0 -0.0904,
    D=170X时,f=15.433f0 3-8.3874f0 2+2.4715f0-0.0831,When D=170X, f=15.433f 0 3 -8.3874f 0 2 +2.4715f 0 -0.0831,
    D=180X时,f=17f0 3-8.9749f0 2+2.5224f0-0.0828,When D=180X, f=17f 0 3 -8.9749f 0 2 +2.5224f 0 -0.0828,
    D=190X时,f=14.627f0 3-7.8187f0 2+2.3464f0-0.0743,When D=190X, f=14.627f 0 3 -7.8187f 0 2 +2.3464f 0 -0.0743,
    D=200X时,f=13.3f0 3-7.2048f0 2+2.2491f0-0.0688。When D=200X, f=13.3f 0 3 -7.2048f 0 2 +2.2491f 0 -0.0688.
  30. 一种确定孕妇样本中胎儿核酸含量的装置,其特征在于,包括,A device for determining fetal nucleic acid content in a pregnant woman sample, characterized in that
    数据输入单元,用于输入数据;a data input unit for inputting data;
    数据输出单元,用于输出数据;a data output unit for outputting data;
    存储单元,用于存储数据,其中包括可执行程序;a storage unit for storing data, including an executable program;
    处理器,与所述数据输入单元、数据输出单元和存储单元连接,用于执行所述可执行程序,所述可执行程序的执行包括完成权利要求22-29任一方法。And a processor coupled to the data input unit, the data output unit, and the storage unit for executing the executable program, the execution of the executable program comprising completing the method of any one of claims 22-29.
  31. 一种计算机可读介质,其特征在于,用于存储供计算机执行的程序,所述程序的执行包括完成权利要求22-29任一方法。 A computer readable medium for storing a program for execution by a computer, the execution of the program comprising performing the method of any of claims 22-29.
PCT/CN2015/070900 2015-01-16 2015-01-16 Method and device for determining fetal nucleic acid content WO2016112539A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580072546.0A CN107109324B (en) 2015-01-16 2015-01-16 The method and apparatus for determining fetal nucleic acid content
PCT/CN2015/070900 WO2016112539A1 (en) 2015-01-16 2015-01-16 Method and device for determining fetal nucleic acid content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/070900 WO2016112539A1 (en) 2015-01-16 2015-01-16 Method and device for determining fetal nucleic acid content

Publications (1)

Publication Number Publication Date
WO2016112539A1 true WO2016112539A1 (en) 2016-07-21

Family

ID=56405139

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/070900 WO2016112539A1 (en) 2015-01-16 2015-01-16 Method and device for determining fetal nucleic acid content

Country Status (2)

Country Link
CN (1) CN107109324B (en)
WO (1) WO2016112539A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4092130A4 (en) * 2020-01-17 2023-09-27 BGI Shenzhen Method for determining foetal nucleic acid concentration and foetal genotyping method

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110993024B (en) * 2019-12-20 2023-08-22 北京科迅生物技术有限公司 Method and device for establishing fetal concentration correction model and method and device for quantifying fetal concentration
CN113981062B (en) * 2021-10-14 2024-02-20 武汉蓝沙医学检验实验室有限公司 Method for evaluating fetal DNA concentration by non-maternal and maternal DNA and application

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102325901A (en) * 2008-12-22 2012-01-18 赛卢拉有限公司 Methods and genotyping panels for detecting alleles, genomes, and transcriptomes
CN102770558A (en) * 2009-11-05 2012-11-07 香港中文大学 Fetal genomic analysis from a maternal biological sample
CN104232777A (en) * 2014-09-19 2014-12-24 天津华大基因科技有限公司 Method and device for simultaneously determining fetal nucleic acid content and aneuploidy of chromosome

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2473638B1 (en) * 2009-09-30 2017-08-09 Natera, Inc. Methods for non-invasive prenatal ploidy calling
CN103215350B (en) * 2013-03-26 2016-11-02 苏州贝康医疗器械有限公司 A kind of assay method of the fetal DNA in maternal plasma DNA content based on mononucleotide polymorphism site

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102325901A (en) * 2008-12-22 2012-01-18 赛卢拉有限公司 Methods and genotyping panels for detecting alleles, genomes, and transcriptomes
CN102770558A (en) * 2009-11-05 2012-11-07 香港中文大学 Fetal genomic analysis from a maternal biological sample
CN104232777A (en) * 2014-09-19 2014-12-24 天津华大基因科技有限公司 Method and device for simultaneously determining fetal nucleic acid content and aneuploidy of chromosome

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4092130A4 (en) * 2020-01-17 2023-09-27 BGI Shenzhen Method for determining foetal nucleic acid concentration and foetal genotyping method

Also Published As

Publication number Publication date
CN107109324B (en) 2019-11-08
CN107109324A (en) 2017-08-29

Similar Documents

Publication Publication Date Title
AU2022200046B2 (en) Maternal plasma transcriptome analysis by massively parallel RNA sequencing
US20190153541A1 (en) Detecting mutations for cancer screening
TWI611186B (en) Molecular testing of multiple pregnancies
CN104232778B (en) Determine the method and device of fetus haplotype and chromosomal aneuploidy simultaneously
US20110092763A1 (en) Methods for Embryo Characterization and Comparison
US20190338349A1 (en) Methods and systems for high fidelity sequencing
JP2015506684A (en) Method, system, and computer-readable storage medium for determining presence / absence of genome copy number variation
CN110846411B (en) Method for distinguishing gene mutation types of single tumor sample based on next generation sequencing
US20210130900A1 (en) Multiplexed parallel analysis of targeted genomic regions for non-invasive prenatal testing
KR20190019219A (en) Noninvasive prenatal molecular karyotyping from maternal plasma
US20190338350A1 (en) Method, device and kit for detecting fetal genetic mutation
WO2021073604A1 (en) Method and system for clearing noisy genetic data, phasing haplotype, and reconstructing offspring genome, and use thereof
CN109461473B (en) Method and device for acquiring concentration of free DNA of fetus
WO2016112539A1 (en) Method and device for determining fetal nucleic acid content
GB2559437A (en) Prenatal screening and diagnostic system and method
CN113308548B (en) Method, device and storage medium for detecting fetal gene haplotype
US10106836B2 (en) Determining fetal genomes for multiple fetus pregnancies
CA3068198A1 (en) Enrichment of targeted genomic regions for multiplexed parallel analysis
TW201823472A (en) Universal haplotype-based noninvasive prenatal testing for single gene diseases
CN114531916A (en) System and method for determining a genetic relationship between a sperm provider, an oocyte provider and a corresponding concentiator
US11869630B2 (en) Screening system and method for determining a presence and an assessment score of cell-free DNA fragments
KR102519739B1 (en) Non-invasive prenatal testing method and devices based on double Z-score
Giannico et al. NIPAT as Non-Invasive Prenatal Paternity Testing Using a Panel of 861 SNVs. Genes 2023, 14, 312
CN117925820A (en) Method for detecting variation before embryo implantation
IT201800006418A1 (en) Procedure for determining the risk of preeclampsia in pregnancy through the analysis of free fetal DNA circulating in the maternal blood.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15877457

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01.12.2017)

122 Ep: pct application non-entry in european phase

Ref document number: 15877457

Country of ref document: EP

Kind code of ref document: A1