CN112840404A - Methods, systems, and uses for eliminating noisy genetic data, haplotype phasing, and reconstructing progeny genomes - Google Patents

Methods, systems, and uses for eliminating noisy genetic data, haplotype phasing, and reconstructing progeny genomes Download PDF

Info

Publication number
CN112840404A
CN112840404A CN202080005425.5A CN202080005425A CN112840404A CN 112840404 A CN112840404 A CN 112840404A CN 202080005425 A CN202080005425 A CN 202080005425A CN 112840404 A CN112840404 A CN 112840404A
Authority
CN
China
Prior art keywords
progeny
haplotype
genetic
genotype
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080005425.5A
Other languages
Chinese (zh)
Inventor
邹央云
陆思嘉
胡春旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Yikang Medical Laboratory Co ltd
Original Assignee
Suzhou Yikang Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Yikang Medical Laboratory Co ltd filed Critical Suzhou Yikang Medical Laboratory Co ltd
Publication of CN112840404A publication Critical patent/CN112840404A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The invention relates to a method for acquiring genetic data from a progeny trace nucleic acid sample, clearing noise genetic data in the genetic data through quality control and filtering, phasing the haplotype of progeny for the genetic data after quality control and filtering, a method for reconstructing a progeny genome, and a device or a system for implementing the method. The invention also relates to the use of said method and a device or system for carrying out said method for the multigene disease genetic risk rating, aneuploidy detection, monogenic genetic disease detection, chromosome structure rearrangement detection of pre-implantation embryos and/or fetuses in early gestation.

Description

Methods, systems, and uses for eliminating noisy genetic data, haplotype phasing, and reconstructing progeny genomes Technical Field
The present invention relates generally to the field of biomedical diagnosis and detection. More particularly, the present invention relates to methods of acquiring, manipulating and using genetic data to phase haplotypes of progeny, reconstituting progeny genomes, systems and devices implementing the methods, and in particular, to methods, systems and computer devices for haplotyping and reconstitution of progeny genomes using progeny microscale DNA nucleic acid samples, and to the use of the phased progeny haplotypes, reconstituted progeny genomes to identify genetic variations, particularly aneuploidies and disease-related genes, that may lead to a variety of phenotypic outcomes.
Background
Assisted Reproductive Technology (ART) has made great progress in overcoming infertility and sterility in humans. Currently, about 3-4% of the total birth population is born by assisted reproductive operations worldwide. Despite some surprising theoretical and technical advances in ART, the practical implementation of the concept of "healthy infants" still faces unique challenges.
Preimplantation Genetic Test (PGT) is a method for performing Preimplantation Genetic analysis on embryos of patients with high Genetic risk in the process of in vitro fertilization embryo transplantation, and aims to select embryos with normal Genetic materials to be implanted into a maternal uterine cavity so as to obtain healthy offspring. According to the content of the detection, PGT can be classified into aneuploidy detection (PGT for Aneuploidies, PGT-A), Monogenic gene defects (PGT for PGT-M), and chromosome structure rearrangement detection (PGT for chromosomal Structural Rearrangements, PGT-SR). The current clinical use of PGT is primarily for such genetic testing by taking cells from embryo biopsies. However, more and more studies have shown that this invasive cell biopsy procedure can have an adverse effect on the developmental potential of the embryo and subsequent ontogeny. In recent years, a plurality of researches show that the embryo culture solution contains free DNA (cfDNA) fragments derived from embryos, so that noninvasive genetic detection of embryos before implantation becomes possible. The successful application of cfDNA in embryo culture solution in PGT-A, PGT-M and PGT-SR further shows that the method is noninvasive, accurate and effective in genetic detection before embryo implantation.
It is considered that polygenic diseases or chronic diseases, such as cardiovascular diseases, diabetes, obesity, tumors, etc., which are the result of multiple genes involved in the disease process, have become the first killers threatening human health. Studies have shown that many chronic diseases have a high genetic rate, and therefore the genetic basis plays a more important role in determining the risk of an individual. However, the detection of the risk of the polygenic diseases still cannot be realized theoretically and practically, and the technical difficulty is that the construction of the risk value of the polygenic diseases needs genotype information of the whole genome range of the embryo and/or the fetus; the embryo and/or fetus DNA obtained in a non-invasive or minimally invasive manner has the characteristics of small fragment, poor amplification uniformity of the whole DNA genome, high genotype error rate and the like due to the small existence amount of the embryo and/or fetus DNA, especially the embryo cell-free DNA existing in the embryo culture solution, so that the high-quality and highly continuous embryo and/or fetus whole genome sequence cannot be generated.
Therefore, there is a pressing need in the art for a method and system that can provide high quality, highly continuous fetal and/or fetal whole genome genotype information for trace genetic material of embryos and/or fetuses obtained in a non-invasive or minimally invasive manner, thereby enabling the assessment of the polygenic disease genetic risk of preimplantation embryos and/or fetuses in early gestation.
Summary of The Invention
The inventor of the present invention has conducted extensive and intensive studies for a long time, and surprisingly found, through a large number of screening and tests, that, for progeny trace nucleic acid samples (for example, the trace nucleic acid samples are cell-free dna (cfdna) of fetuses in embryo culture solution, blastocyst cavity solution, maternal plasma or other types of maternal body fluids, and/or blastocyst trophoblast cells, blastomere embryo cells, maternal blood or fetal cells in other types of maternal body fluids), on the basis of the whole genome amplification technology, the sequence information of genomes (gDNA) of progeny families (such as parents and the like) is combined, and the progeny haplotyping and progeny genome reconstruction are performed by using sequence information acquisition technologies such as nucleic acid chips or second-generation sequencing technologies and statistical genetics and computational biology algorithms, so that very high haplotype phasing and genome reconstruction accuracy can be obtained. In some embodiments, the haplotype phasing and reconstitution accuracy is greater than or equal to 90%, and can even be as high as 97%.
Specifically, the present invention obtains haplotype phasing of progeny by amplifying, analyzing data, quality controlling and filtering the obtained trace progeny nucleic acid, thereby eliminating noisy genetic data (e.g., site genotype information of genotyping quality differences such as Allele Dropout (ADO)), and then obtaining haplotype phasing of progeny based on haplotype phasing of pedigree; finally, genotype filling is carried out on the deleted genotypes (such as the sites which are not amplified and have genotype errors such as ADO) in the offspring By utilizing the genealogical blood source Identity (IBD) and the linkage disequilibrium strategy of the population, so that the genomes of the offspring are reconstructed with high fidelity and high genome reconstruction accuracy. In addition, genetic data obtained from other related individuals, such as other embryos, siblings, grandparents, or other relatives related to the offspring may also be used to further increase the accuracy of the reconstituted genome of the offspring. On this basis, the present inventors have completed the present invention.
Accordingly, in a first aspect, the present invention relates to a method of cleaning noisy genetic data from progeny, said method comprising the steps of:
(a) providing genomic sequence information from an offspring, wherein the genomic sequence information of the offspring is obtained from an offspring micro nucleic acid sample comprising about 0.1-40 ng DNA, e.g., 1-40ng DNA, 20-40ng DNA, 0.1-40pg DNA, 1-40pg DNA, 10-40pg DNA; for example, the progeny micro nucleic acid sample is fetal cell-free DNA in embryo culture fluid, blastocyst culture fluid, blastocoel fluid, maternal plasma, or other type of maternal body fluid, and/or fetal cells in blastocyst trophoblast cells, blastocyst stage embryonic cells, maternal blood, or other type of maternal body fluid;
(b) performing quality control and filtering on the genome sequence information of the progeny of step (a), wherein the quality control comprises quality control selected from the group consisting of performing whole genome amplification efficiency of trace nucleic acids, identifying Mendelian genetic errors, identifying violating chromosomal interference suppression theory, deriving from each other a plurality of progeny haplotypes, and combinations thereof.
In some embodiments, the genomic sequence information from the progeny provided by step (a) of the method of the invention to clean up noisy genetic data from the progeny may not cover its genome by more than about 30%, e.g., about 30%, 25%, 20%, 15% of its genome, e.g., wherein the genomic sequence information from the progeny of step (a) is obtained by performing whole genome amplification on the progeny trace nucleic acid sample selected from the group consisting of: pre-amplification primer extension PCR, degenerate oligonucleotide primer PCR, multiplex displacement amplification techniques, multiple annealing cycle (MALBAC) amplification techniques, blunt-end or sticky-end ligation pooling, or the like, or combinations thereof, preferably MALBAC amplification techniques, followed by detection of genomic sequences of progeny by techniques selected from nucleic acid chips, amplification and/or sequencing. The nucleic acid chip, the amplification and/or sequencing technology is a single nucleotide polymorphism site microarray nucleic acid chip, a MassARRAY flight mass spectrum chip, an MLPA multiple connection amplification technology, second-generation sequencing, third-generation sequencing or a combination thereof; for example, the single nucleotide polymorphism site microarray nucleic acid chip is an SNP genotyping chip; for example, the second generation sequencing comprises whole genome sequencing, whole exome sequencing and sequencing of targeted genomic regions, preferably whole genome sequencing, e.g., low depth whole genome sequencing, e.g., sequencing depth can be as low as 2x or even 1x or less.
Further, the quality control of the whole genome amplification efficiency of the trace nucleic acid according to step (b) of the method for eliminating noise genetic data from progeny of the present invention is performed as follows: identifying a site genotype with low amplification efficiency using reference sequencing data of whole genome amplification products of a plurality of trace nucleic acid samples and designating the site genotype as deletion data, e.g., low depth sequencing of the whole genome amplification products of the plurality of trace nucleic acid samples as reference samples, e.g., sequencing depth no greater than about 0.5X, no greater than about 0.4X, no greater than about 0.3X, no greater than about 0.2X, no greater than about 0.1X, e.g., sequencing depth about 0.06X, aligning sequencing data obtained from the reference samples onto a human reference genome (e.g., hg19 or hg38), calculating site amplification efficiency using the following formula
Figure PCTCN2020121432-APPB-000001
Wherein the content of the first and second substances,
Figure PCTCN2020121432-APPB-000002
wherein DPiIndicates the absolute depth of the ith locus, N indicates the number of sequencing reads, L indicates the read length,
when DP is presentiWhen the average depth of the genome is not less than or equal to 1, the amplification efficiency of the locus is not less than 1, which indicates that the locus passes the quality control of the amplification efficiency of the trace nucleic acid whole genome; loci that do not meet this quality control are genotyped as missing data. In addition, the chromosome interference suppression theory of step (b) is that when two molecular marker loci are crossed or recombined twice within a genetic distance, the molecular marker loci in the recombination segment are judged to have genotyping errors, and the molecular marker loci are marked as missing data, for example, wherein the genetic distance is a distance between the two molecular marker lociThe distance is any distance below 1 centimole (cM).
In a second aspect, the present invention relates to a method of phasing the haplotypes of a offspring, said method comprising steps (a) and (b) as defined above, and the steps of:
(c) phasing the haplotypes of progeny, such as the haplotype of progeny at the chromosome level, based on pedigree information of the genetic father of the progeny (e.g., the genotype information of the genetic father) and/or the genomic sequence information of the genetic mother of the progeny (e.g., the genotype information of the genetic mother), optionally the pedigree information also includes the genomic sequence information of other pedigree individuals of the progeny (e.g., the genotype information), and mendelian rules and genetic linkage and a multi-site linkage analysis strategy of crossover theory.
In some embodiments, the pedigree information in step (c) of the method of phasing the haplotypes of progeny is obtained from a nucleic acid sample from the pedigree individual comprising at least about 100ng DNA (e.g., 100ng-1000ng DNA); for example, the family individual nucleic acid sample is a nucleic acid sample from blood, saliva, buccal swab, urine, nails, hair follicles, dander, cells, tissue, body fluid of the family individual, and the pedigree information has a coverage of no less than about 90% for the pedigree individual, such as a coverage of about 90%, 95%, 98%, 99% or more, for example, wherein the pedigree information is data obtained by whole genome sequencing of genomic DNA (e.g., whole blood gDNA, oral epithelial cell gDNA, urothelial cell gDNA, nail bed gDNA, hair follicle gDNA, and dandruff gDNA, preferably whole blood gDNA) of the pedigree individual, preferably, a high depth whole genome sequencing strategy is applied to the gDNA, e.g., a sequencing depth of at least 20X, at least 30X, at least 40X, at least 50X, at least 60X, at least 70X, at least 80X.
In some embodiments, step (c) of the method of phasing the haplotypes of the offspring is performed using statistical genetics and computational biology algorithms, e.g., using a strategy selected from the group consisting of likelihood (haplotype composition for maximum probability), genetic rule (haplotype composition for minimum number of recombinations), Expectation Maximization (EM) algorithm, and combinations thereof, based on the pedigree information to obtain the largest possible haplotype composition for the offspring.
In a third aspect, the present invention relates to a method for reconstructing a progeny genome, said method comprising the steps (a), (b) and (c) above, and further comprising the steps
(d) The filling of the offspring deletion genotype is carried out,
in some embodiments, step (d) of the method for reconstructing the progeny genome is performed by combining the haplotype composition of the blood-derived identical region, i.e., the embryo from the parent in a certain region, with the genotype information of the parent high-density polymorphic loci, and filling in the missing genotype locus information in the progeny.
In some embodiments, step (d) of the method of reconstructing a progeny genome further involves filling in genome-wide level genotype information using population reference haplotype information and population-level allelic Linkage Disequilibrium (LD) rules for genotype information that was not successfully filled based on family information;
in a fourth aspect, the present invention relates to a device or system capable of performing the above-described cleaning of noisy genetic data from progeny; a device or system capable of performing haplotype phasing as described above; and a device or system capable of performing the above-described genotype filling.
In some embodiments, the invention relates to an apparatus or system characterized in that,
being able to perform genome-wide amplification of a DNA sample, e.g., being able to perform genome-wide amplification of a DNA sample of an offspring and/or for genome-wide amplification of a DNA sample of a genetic parent of an offspring (in some embodiments, amplification is not required when the DNA sample of the parent is sufficient in quantity);
detection of the sequence genetic information of the genome of the obtained whole genome amplification product or gDNA sample can be performed, for example, reading sequence information after a nucleic acid chip or second-generation sequencing;
can perform quality control and filtering on the original genetic data, remove data with unsatisfactory quality, for example, mark the locus genotype with low amplification efficiency as missing data;
being able to identify the wrong genotype sites in the original genetic data and label them as missing data;
can perform phasing of haplotypes; and
the filling of genotypes can be performed.
In a fifth aspect, the present invention relates to the use of the method of the first to third aspects above or the use of the device or system of the fourth aspect above for polygenic disease genetic risk ranking, aneuploidy detection, monogenic genetic disease detection, chromosomal structural rearrangement detection, and combinations thereof, on a pre-implantation embryo and/or a fetus early in pregnancy.
Other embodiments of the invention will be apparent by reference to the detailed description that follows.
Brief Description of Drawings
The preferred embodiments of the present invention described in detail below will be better understood when read in conjunction with the following drawings. For the purpose of illustrating the invention, there is shown in the drawings embodiments which are presently preferred. It should be understood, however, that the invention is not limited to the precise arrangements and instrumentalities of the embodiments shown in the drawings.
Fig. 1 shows a flow chart of an embodiment of the invention.
FIG. 2 shows the effect of progeny trace nucleic acid whole genome amplification efficiency on progeny genotype quality.
Detailed Description
Before the present invention is described in detail, it is to be understood that this invention is not limited to the particular methodology and experimental conditions set forth herein as such may vary. In addition, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.
Definition of
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. For the purposes of the present invention, the following terms are defined below.
The term "about," when used in conjunction with a numerical value, is intended to encompass a numerical value within a range having a lower limit that is 5% less than the stated numerical value and an upper limit that is 5% greater than the stated numerical value.
The term "and/or" when used to connect two or more selectable items should be understood to mean either one of the selectable items or any two or more of the selectable items.
As used herein, the term "comprising" or "comprises" is intended to mean including the stated elements, integers or steps, but not excluding any other elements, integers or steps. When the term "comprising" or "includes" is used herein, unless otherwise specified, it also encompasses the presence of stated elements, integers or steps. For example, when referring to an antibody variable region "comprising" a particular sequence, it is also intended to encompass antibody variable regions consisting of that particular sequence.
The term "progeny" includes, but is not limited to, progeny of a mammal, e.g., a human, meaning either born or unborn progeny. The unborn offspring comprise an embryo (embryo) or fetus (fetus). Embryo generally refers to the product of the cleavage of a fertilized egg by the eighth week after fertilization and before the end of the embryonic period. The cleavage stage of the embryo is present in the first three days of culture. "embryo transfer" is the process of placing one or more embryos and/or blastocysts into the uterus or fallopian tubes. A fetus generally refers to an unborn offspring of a mammal, particularly an unborn human infant, eight weeks after pregnancy.
The term "blastocyst" is a 5-or 6-day post-fertilization embryo having an internal cell mass, an outer cell layer called trophectoderm, and a fluid-filled blastocoel that contains the internal cell mass from which the embryo is derived in its entirety. Trophectoderm is a precursor of the placenta.
The terms "related individuals" or "pedigree individuals" of a progeny are used interchangeably and refer to any individual that is genetically related to the target progeny individual, e.g., any individual that is genetically related to the target progeny individual and therefore shares a haploid with it. In one instance, the related individual may be the gene parents of the target individual or any genetic material derived from the parents, such as sperm, polar bodies, other embryos, or fetuses. It may also refer to a sibling, a parent or grandparent, or an outlying grandparent. In this application, a parent refers to the genetic father or mother of an individual. An individual offspring typically has two parents (a female parent and a male parent). The sibling refers to any individual whose gene parents are the same as the offspring individual in question. In some embodiments, a sibling may refer to a born child, embryo or fetus, or one or more cells derived from an embryo or fetus, a born child; siblings may also refer to haploid individuals derived from the parent side, such as sperm, polar bodies or any other haplotype genetic material.
The progeny-derived DNA is DNA derived from the original part of the progeny cell, which has a genotype substantially equivalent to the genotype of the progeny, or from the body fluid of the progeny, or from the culture fluid of the progeny cell.
The parent-derived DNA is DNA derived from the original part of a parent cell having a genotype substantially identical to the parent genotype, or DNA derived from the original part of a parent body fluid or a culture fluid of the parent cell. For example, a maternal-derived DNA refers to a DNA of the primary portion of a maternal cell whose genotype is substantially equivalent to the maternal genotype, a maternal body fluid, or a culture fluid of the maternal cell.
The term "SNP (single nucleotide polymorphism)" refers to a polymorphism at a site in a chromosomal DNA sequence due to a single nucleotide change, with the frequency of SNPs in a population generally being > 1%. On average, there is one SNP of 300-1000bp across the human whole genome. SNP databases are currently available from a variety of public databases, including, for example, http:// cgap. ncbi. nih. gov/GAI; http:// www.ncbi.nlm.nih.gov/SNP; human SNP database http:// hgbase. cgr. ki. sei or http:// hgbase. interactiva. de/.
The term "genotype" refers to the type of allele an individual possesses at a locus, referred to as the individual's genotype at that locus. For humans, in addition to sex chromosomes, each pair of homologous chromosomes has a pair of allelic types at the same locus, referred to as the genotype of the locus. Genotyping refers to the process of determining the genotype of an individual.
The term "noisy genetic data" refers to genetic data having any one of: allele Dropout (ADO), ambiguous base pair measurements, erroneous base pair measurements, missing base pair measurements, ambiguous measurements of insertions or deletions, ambiguous measurements of chromosome segment copy numbers, spurious signals, other measurement errors, or combinations thereof.
The term "Sequencing Depth" refers to the ratio of the total base number obtained by Sequencing to the size of the genome to be tested. Assuming a genome size of 2M and a sequencing depth of 10X, the total amount of data obtained is 20M. The depth of sequencing can be expressed as the ratio of the total base number (bp) to the Genome size (Genome) obtained by sequencing.
The terms "absolute sequencing depth of a site" and "absolute depth of a site" are used interchangeably to refer to the number of reads of the site.
The terms "average sequencing depth of the genome" and "average depth of the genome" are used interchangeably and refer to the addition of the absolute depth of each site over the entire genome, divided by the number of sites, to give the average depth of the genome. The average sequencing depth of a genome can also be understood as the average number of times each base in the genome has been sequenced.
The term "Read" is also known as "Read length", and each sequence in the sequencing data is a Read.
The term "coverage" refers to the proportion of sequence portions on a genome or transcriptome or chromosome segment for which sequence information is known to be present in the entire group or segment. In some embodiments, coverage refers to the ratio of the number of bases for which sequence information is detected (e.g., by a sequence detection means, such as sequencing) to the total number of bases for the detected region. For example, when sequencing and detecting a whole genome sequence, due to the problems of gaps (gap) of large fragment splicing, limited sequencing read length, repetitive sequences and the like, the genome sequence obtained after sequencing usually cannot completely cover all regions of the genome, and at this time, the coverage is the proportion of the finally obtained number of sequencing bases to the number of bases of the whole genome. For example, the coverage obtained by sequencing the human genome is 98.5%, which indicates that the genome has no sequence in 1.5% of the region. In other embodiments, coverage refers to the number of genetic sites (e.g., SNP sites or genetic variation sites) for which sequence information is detected (e.g., by SNP chip or sequencing analysis) for a region being detected as a proportion of the total number of genetic sites detected in the region. The detected region may be a whole genome, a specific chromosome, or a specific chromosome segment, or a transcriptome, or a specific transcribed region.
The term "Fastq" is one of the standard formats for sequence data storage, with one read for every 4 rows, containing the sequencing read name, sequence, sense strand and reverse strand designations, and sequence quality values.
The term "mendelian genetic law" relates to two basic laws of genetics, namely the law of segregation and the law of free combination, collectively known as mendelian genetic laws. According to Mendelian genetic law, in meiosis, alleles can be separated along with separation of homologous chromosomes, enter two gametes respectively and are independently inherited to offspring along with the gametes; in addition, non-alleles on non-homologous chromosomes appear to combine freely while alleles segregate.
The term "Linkage Disequilibrium (LD)" refers to the probability that alleles belonging to two or more loci, respectively, occur simultaneously on one chromosome, higher than the frequency of random occurrence. Linkage disequilibrium is also known as allelic association. In general, the intensity of LD is related to the distance between 2 locus sites. Generally, the further apart the two pairs of alleles are, the greater the chance of recombination occurring, i.e., the higher the rate of recombination (crossover) and the weaker the LD; conversely, the closer the distance, the lower the recombination rate and the stronger the LD. Thus, the recombination rate can be used to reflect the relative distance between two genes on the same chromosome. The distance between the two genes was recorded as 1 centiMorgan (cM) when the gene recombination rate was 1%.
The term "chromosomal interference" refers to the phenomenon of the interplay and inhibition of two adjacent single exchanges of homologous chromosomal nonsilers during meiotic stages. In this context, the suppression phenomenon of chromosomal interference is used for the quality control of genotyping data.
The genome formation process of the progeny corresponds to one random recombination of the parental genome (i.e., random combination of linkage-interchange haplotypic recombination and gametes).
The term "Allele Dropout (ADO)" refers to the inability to amplify one of the two alleles in a heterozygous cell due to amplification bias, e.g., one allele preferentially amplifies while the other allele fails to amplify at all. For whole genome amplification of minute amounts of DNA, ADO can affect up to more than 40% of amplification and has caused pre-transplantation genetic diagnosis (PGD) errors before embryo implantation.
The term "haplotype" refers to a combination of alleles at multiple loci that are usually inherited together on the same chromosome. Haplotypes can refer to as few as two sites only, or to the entire chromosome, depending on the number of recombination events that have occurred between a set of designated sites. Haplotypes can also refer to a set of Single Nucleotide Polymorphisms (SNPs) on a single chromatid that are statistically related.
The term "haplotypic data", also referred to as "haplotype genetic data", "phased data" or "ordered genetic data", refers to genetic data that has been determined from a single chromosome in a diploid or polyploid genome.
The term "Unordered Genetic Data" refers to Data from the joining together of sequencing Data from two or more chromosomes in a diploid or polyploid genome.
The term "haplotyping phasing" refers to the act of determining haplotype genetic data for an individual for unordered genetic data from a diploid or polyploid. It may refer to the act of determining which of two genes at each allele is associated with one of two homologous chromosomes in an individual for a set of alleles found on one chromosome. Haplotyping multiple sites allows for the discovery of haplotype-disease phenotype associations that are significantly stronger than single site-disease phenotype associations.
The term "SNP chip" is a chip that can determine the genotype of a certain site using a signal (usually a fluorescent signal) obtained after hybridization of the chip. In actual research, the SNP chip will contain different SNP sites according to the chip manufacturer, model, etc. For example, the human chips manufactured by Affymetrix and Illumina contain different sets of SNPs.
The term "Identity By Disease (IBD)" means that two or more alleles are inherited from the same ancestor, and no genetic recombination events occur during the process, such alleles are said to have a common ancestry. The IBD region identification method can be referred to, for example, as Browning BL, A fast, powerfull method for detecting identity by detector, Am J Hum Genet.2011 Feb 11; 88(2) 173-82; and Augustine Kong et al, Detection of sharing by device, long-range phasing and halopType import, Nat Genet.2008 Sep; 40(9):1068-1075.
The phrase "high-density genetic polymorphic site information of parents of an embryo and the like" means that when the same genetic analysis means is adopted, the densities of the genetic polymorphic sites of the parents and the embryo are different because the samples of the parents are gDNA samples, the DNA concentration is high, and most of the genotype site information can be successfully obtained; the embryo usually uses a single-cell whole-gene amplification product or a DNA whole-gene amplification product in an embryo culture solution, and amplification errors such as whole-genome amplification nonuniformity and ADO exist, so that the information of available genetic polymorphic sites of the embryo is relatively sparse.
The term "Genome-wide association analysis (Genome-wide association study)" refers to the analysis of the entire Genome in humansGeneFinding out the existing sequence variation in the group range, and screening out the sequence variation related to diseases to realize low cost and high benefitAnd finding the association between the genetic marker and the disease.
As used herein, the term "module" refers to software objects or routines (e.g., as separate threads) that may be executed centrally on a single computing system (e.g., a computer program, a PAD, one or more processors). Programs embodying the methods of the present invention may be stored on computer-readable media having computer program logic or code portions embodied therein for implementing the described system modules and methods. While the system modules and methods described herein are preferably implemented in software, implementations in hardware or a combination of software and hardware are also possible and contemplated by those skilled in the art.
The following describes aspects of the present invention in detail.
Method of the invention
The present invention generally provides a method of reconstituting a progeny genome, the method comprising:
(a) acquiring genome sequence information of the filial generation;
(b) performing quality control and filtration on the genome sequence information of the filial generation in the step (a), and removing loci with poor genotyping quality;
(c) phasing the haplotypes of the progeny based on pedigree information and the like on the quality-controlled and filtered progeny genomic sequence information;
(d) filling in of the progeny deletion genotype is performed.
In the present invention, the offspring may be born or unborn offspring. In the present invention, the related individual may be any individual that has a genetic relationship with the target individual.
With respect to various aspects of the method of the present invention, the following is further described:
I. acquisition of raw genetic data
The methods of the invention in one aspect relate to genomic information processing and/or reconstruction based on raw genetic data. In the present invention, the raw genetic data that may be suitable for use in the methods of the invention include genomic sequence information of progeny and/or related individuals and their associated raw genetic data, such as genotype information generated based on the sequence information. These raw genetic data are unordered, unphased. In some embodiments, the raw genetic data is in the form of a data set, e.g., a computer-readable data set.
In the present invention, the route of acquisition of the raw genetic data is not particularly limited, and for example, a computer-readable medium on which the data is recorded, or a data packet generated on a commercial platform may be directly provided by a user of the method of the present invention, or preferably, obtained from a target nucleic acid sample by any sequence information detection technique known in the art.
In a preferred embodiment, in the methods of the invention, e.g., the methods of genetic information quality control, noise-cleaned genetic data, haplotype phasing, and/or genome reconstruction of the invention, the obtaining of the raw genetic data comprises: obtaining genomic sequence information of the progeny, and performing genotype analysis of the progeny based on the information. In other embodiments, further comprising: obtaining the genome sequence information of related individuals and carrying out genotype analysis.
In some embodiments, preferably, the step of obtaining genomic sequence information of the progeny and/or related individuals comprises:
-performing whole genome amplification of nucleic acid samples of progeny and/or related individuals;
-detecting genomic sequence information of progeny for the amplification product.
In some embodiments, the genomic sequence information includes, but is not limited to: sequence information for a whole genome, sequence information for a whole exome, and/or sequence information for a targeted chromosomal region. The sequence information may be, for example, but not limited to, a gene sequencing dataset, a SNP dataset, a gene variation site dataset.
The technique for acquiring the sequence information is not particularly limited. Any sequence detection technique known in the art to be suitable for nucleic acids is suitable for use in the present invention. In some embodiments, preferably, sequencing techniques are used to detect sequence information, including but not limited to: whole genome sequencing, whole exome sequencing, and targeted sequencing. Preferably, whole genome sequence information is detected from the nucleic acid sample using whole genome sequencing, more preferably by high throughput sequencing techniques such as second generation sequencing techniques. In other embodiments, preferably, the sequence information may be detected by a method selected from the group consisting of: genomic, genomic-wide, and genomic-targeted region-based detection of genetic polymorphic sites (e.g., SNPs or Short Tandem Repeats (STRs)) or genetic variant sites, preferably high-density detection of polymorphisms or genetic variant sites. In one embodiment, a CBC-PMRA (CapitalBiotechnology Precision Medicine Research array) chip, customized by Boo Crystal based on the PMRA (Precision Medicine Research array) chip from Affymetrix, Inc., is used, which can detect 90 ten thousand SNP sites. In yet another embodiment, an ASA (asset Screening array) chip from Illumina was used, which can detect 80 ten thousand labeled SNPs.
Progeny nucleic acid samples for obtaining raw data
In the methods of the invention, the progeny may be born or unborn progeny. In some preferred embodiments, wherein the progeny is an unborn progeny, preferably a fetus or embryo, more preferably an embryo produced by, for example, IVF. In some embodiments, the embryo may be an about 3-10 day old embryo, for example, a blastocyst that is about 5 days old.
One skilled in the art can take a nucleic acid sample from a progeny for obtaining the progeny raw data using any known method.
In some embodiments, the progeny nucleic acid sample is a sample comprising progeny trace genomic DNA nucleic acid, e.g., the sample is a progeny trace nucleic acid sample comprising about 0.1pg to 40ng DNA, e.g., 1-40ng DNA, 20-40ng DNA, 0.1-40pg DNA, 1-40pg DNA, 10-40pg DNA. In some embodiments, the progeny minigenome DNA nucleic acid sample includes, but is not limited to, embryo culture fluid (e.g., IVF embryo culture fluid), blastocyst culture fluid (e.g., medium for about 3-5 day old blastocysts), blastocoel fluid, fetal cell-free DNA in maternal plasma or other types of bodily fluids of the mother, and/or fetal cells in blastocyst trophoblasts, cleavage stage embryonic cells, maternal blood or other types of bodily fluids of the mother;
methods for obtaining nucleic acid from embryos cultured in vitro or from embryo culture fluids and performing genomic amplification are known in the art. For example, CN106086199A discloses a method for detecting embryo chromosomes using blastocyst culture fluid, and particularly discloses a manner of obtaining the blastocyst culture fluid and a method for performing genome amplification on the obtained blastocyst culture fluid. CN105368936A also discloses a method for detecting embryo chromosomes by using blastocyst culture solution, and particularly discloses the collection of blastocyst culture solution and whole genome amplification performed from trace amounts of DNA in the culture solution, including the design of primers used for amplification and the design of amplification reaction program. CN109536581A discloses a manner of obtaining blastocyst culture fluid used as nucleic acid sample for genotyping. CN105543339A discloses obtaining ectotrophoblast cells at the blastocyst stage from embryos produced by In Vitro Fertilization (IVF) techniques and performing genomic amplification of the embryo chromosomes. The fetal nucleic acid samples and methods for their amplification disclosed in the above-mentioned references are all suitable for use in the present invention for the acquisition of raw genetic data for progeny, and are hereby incorporated by reference in their entirety.
In some embodiments, the culture fluid is aspirated from an embryo culture fertilized with intracytoplasmic sperm microinjection (ICSI), preferably on days 3-10, preferably 5, of the culture, as a progeny micro nucleic acid sample, for obtaining genomic sequence information of the progeny.
In some embodiments of the methods of the invention, after removing the zona pellucida, the embryos are cultured in 0.1ul to 1ml of culture medium using a single embryo culture system, and a small amount of culture medium (e.g., about 0.1ul to 1ml, e.g., about 0.1ul, 10ul, 20ul, 30ul, 40ul, 50ul, 100ul, 200ul, 500ul, 800ul, 1ml) is separated from the culture for genetic information detection and genotyping of the progeny.
In some embodiments, the surface of the ovum or fertilized egg may be washed prior to embryo culture to remove DNA impurities from the surface of the fertilized egg, thereby reducing the effects of contaminating DNA in the culture broth. For this cleaning, see for example the description in CN201610584345.5 and TW 10612113. These documents are hereby incorporated by reference.
As will be understood by those skilled in the art, in the present invention, the type of progeny nucleic acid sample used for sequence detection is not particularly limited, and may be a sample containing a large amount of nucleic acid or a sample containing a trace amount of nucleic acid. As demonstrated in the examples, the method of the invention is particularly suitable for the reconstruction of embryonic genomic information on DNA of embryonic origin, such as cell-free DNA (cf DNA) in blastocyst broth, which is present in small amounts and small fragments. Thus, in some embodiments, the invention is particularly useful for prenatal diagnosis, e.g., haplotype phasing and/or genome reconstruction of offspring is performed prior to pregnancy establishment (e.g., prior to embryo implantation in IVF techniques), in ligands and cells or culture fluid taken from early embryos, or in cell samples taken from the placenta or fetus at late gestation, or in fetal-derived DNA taken from maternal bodily fluids, such as fetal cfDNA in maternal bodily fluids.
In some embodiments, the progeny are not yet born and a sample containing a trace amount of progeny nucleic acid is used, e.g., a single cell of an embryo produced by IVF or a culture of an embryo.
In some preferred embodiments, the progeny is a fetus, and the progeny nucleic acid sample comprises fetal free DNA in an amount of, for example, 0.1pg to 40ng, preferably 1ng to 40ng, more preferably 20 ng to 40ng free DNA. In other preferred embodiments, the progeny is an embryo and the progeny nucleic acid sample comprises an amount of embryo-free DNA of, for example, 0.1-40pg, preferably 1-40pg, more preferably 10-40 pg.
Correlated individual nucleic acid samples for obtaining raw data
The skilled person can use any known method to take a nucleic acid sample from the related individuals of the offspring, detect the genome sequence information of the related individuals, and further obtain the genotype and haplotype information thereof, thereby providing the family genotype information of the offspring.
In the present invention, the type of the nucleic acid sample used for the acquisition of the raw data of the relevant individual is not particularly limited, and may be a sample containing a large amount of nucleic acid, or a sample containing a trace amount of nucleic acid. In some embodiments, nucleic acid samples that facilitate high density genotype site information across the entire genome would be preferred.
In some embodiments, the nucleic acid sample is any sample comprising genomic DNA nucleic acid of the relevant individual. In some embodiments, the sample can be a nucleic acid sample from the related individual comprising at least about 1ng of DNA (e.g., 1pg-1000ng of DNA); for example, the relevant individual nucleic acid sample is a nucleic acid sample from a tissue, cell, and bodily fluid of the relevant individual, e.g., a nucleic acid sample from blood, saliva, oral epithelium, urine, nails, hair follicles, dander.
Depending on the method used for the detection of sequence information, the nucleic acid sample may or may not be extracted and/or purified. In some embodiments, the nucleic acid contained in the nucleic acid sample is genomic dna (gdna) from a source selected from the group consisting of: whole blood gDNA, oral epithelial cell gDNA, urothelial cell gDNA, nail bed gDNA, hair follicle gDNA, and dandruff gDNA, preferably whole blood gDNA.
Whole genome amplification of nucleic acid samples
After obtaining the nucleic acid sample, in some embodiments, amplification of the nucleic acid can be performed. One skilled in the art can perform whole genome amplification of progeny and/or related individual nucleic acids using any known nucleic acid amplification technique.
Preferably, the amplification method is selected from: primer extension PCR before amplification; degenerate oligonucleotide primer PCR (DOP-PCR); multiple displacement amplification technology (MDA); multiple annealing circular amplification technology (MALBAC); blunt-ended or sticky-ended ligation pooling, or a combination thereof.
In the case of trace amounts of progeny nucleic acid in the sample, it is preferable to amplify using a whole genome amplification method suitable for single cells, such as the MALBAC method, to reduce erroneous gene sequence information resulting from the amplification.
Gene sequencesInformation acquisition
In some embodiments, the sequence information of the genome is detected, preferably by a technique selected from the group consisting of nucleic acid chip, amplification and/or sequencing. The technique can be any such technique known in the art, including, but not limited to, a single nucleotide polymorphism site microarray nucleic acid chip, a MassARRAY flight Mass Spectrometry chip, an MLPA multiplex ligation amplification technique, second generation sequencing, third generation sequencing, or a combination thereof.
In some embodiments, genomic sequence information of progeny and/or related individuals is obtained via a SNP chip. In some embodiments, the SNP chip comprises at least 700K sites, such as 800-1000K sites, for whole genome sequence information acquisition.
In some embodiments, the genomic sequence information is obtained by whole genome sequencing. Preferably, a high throughput sequencing platform is used to sequence whole genome amplification products of a nucleic acid sample. The sequencing platform is not particularly limited, the second generation sequencing platform: including but not limited to GA, GAII, GAIIx, HiSeq1000/2000/2500/3000/4000, X Ten, XFIve, NextSeq500/550, MiSeq, SOLID by applied biosystems, 454FLX by Roche, IonTorrent, IonPGM, IonProton I/II by ThermoFisher scientific (Life technologies); the third generation of single molecule sequencing platform: including but not limited to the HeliScope system from Helicos BioSciences, the SMRT system from Pacific biosciences, Gridios, MinION from Oxford Nuclear technologies. The sequencing type can be single-ended (SingleEnd) sequencing or double-ended (PairedEnd) sequencing, and the sequencing length can be any length larger than 30bp, such as 30bp, 40bp, 50bp, 100bp, 300bp and the like.
In some embodiments, whole genome sequencing is performed with a sequencing depth of, e.g., ≧ 20X for progeny and related individuals, more preferably, the sequencing depth can be higher for related individuals, e.g., at least 20X, at least 30X, at least 40X, at least 50X, at least 60X, at least 70X, at least 80X, or more.
In some preferred embodiments, in whole genome sequencing, the gDNA of the relevant individual employs a high-depth whole genome sequencing strategy in order to obtain high-density polymorphic molecular markers with higher accuracy.
In some embodiments, the low depth whole genome sequencing method can be used in the whole genome sequencing method for the nucleic acid amplification products of the unborn progeny subject due to the heterogeneous nature of their amplification, which is advantageous for cost control. Thus, in some embodiments, low depth sequencing methods can be used to obtain relatively low density of genotype information on samples containing trace amounts of progeny nucleic acids, such as embryo culture fluid, e.g., sequencing depths as low as 2x, even less than 1 x. Of course, the higher the sequencing depth, the higher the progeny genome reconstruction accuracy.
In some embodiments, after the raw sequence information is obtained, quality control and filtering of the sequence information data is performed to remove low quality data. Any tool known in the art that can perform such information quality control and clean up noisy genetic data can be used herein, including, but not limited to, various software that performs data quality control filtering on the original fastq file resulting from sequencing, e.g., fastp software.
Genotyping analysis
A variety of means are known in the art for analyzing the genotype of a subject based on its genomic sequence information, including various algorithms and computer-executable programs. As will be appreciated by those skilled in the art, these means are applicable to genotyping in the methods of the invention.
In some embodiments, in the methods of the invention, the genotyping comprises: determining the genotype of the progeny and/or the related individuals based on the genomic sequence information of the progeny and/or the related individuals. In some preferred embodiments, the subject's SNP polymorphic sites and genotypes are determined, e.g., based on SNP chip detection data, or analyzed for genetic variation sites and genotypes in the subject's genome, e.g., based on sequencing data sets.
In some embodiments, the genotyping comprises:
high throughput sequencing or SNP chip detection of whole genome amplification products;
-obtaining allelic distribution information of a genetic variation site or SNP polymorphic site by comparing reference genomic sequences,
optionally, sequence information correction is performed on sequencing data obtained on a minute amount of nucleic acid sample, such as an embryo culture solution, using as a database a plurality (e.g., at least 200 or 300 or 400 or more) of trace nucleic acid whole genome amplification data.
The type of reference genomic sequence is not particularly limited. For example, a known human reference genome, such as the hg19 and hg38 reference genomes provided by UCSC, can be employed as a reference sequence. As will be clear to those skilled in the art, with different reference genome versions, such as hg19 or hg38, the coordinate system will be different. Therefore, during the analysis, the detected sequence data (such as sequencing data or SNP chip detection data) needs to be mapped to the specific reference genome used, and the consistency of the information is maintained. However, if desired, the genomic coordinates can also be transformed by means known in the art, for example using LiftOver.
In the present invention, the method of performing the alignment is not particularly limited. In a preferred embodiment, the sequences are aligned to a reference genome, such as hg19, using the BWA-MEM algorithm. Preferably, after alignment, the resulting alignment files are sorted and indexed, along with deduplication and base quality correction.
In some preferred embodiments, the genotyping comprises:
-SNP chip detection of whole genome amplification products;
-aligning the reference genomic sequences to obtain allelic distribution information for the SNP sites.
In other preferred embodiments, the method further comprises the step of obtaining information about the genotype of progeny of regions of the genome other than the nucleic acid chip site.
In some embodiments, a high density SNP chip is used that covers on average the entire genome of asians, especially chinese, to meet the needs of whole genome association analysis and genotyping. Preferably, the progeny and their associated individuals are genotyped for their whole genome amplification products using a chip containing at least 500,000 (also referred to as 500K) SNP sites, at least 600K SNP sites, or 800K SNP sites or 900K SNP sites or even more.
A variety of SNP genotyping tools are known in the art. For example, Axiom available from Thermo Fisher ScientificTMThe Genotyping function module in the Analysis Suite Analysis platform is used for carrying out SNP genotype Analysis, and selecting SNP loci with genotype quality meeting the standards of PolyHighResolution, NoMinorHom, MonoHighResolution and Hemizygous for subsequent steps of the method.
In other embodiments, the methods of the invention comprise: acquiring whole genome sequencing data sets of filial generations and related individuals, and detecting gene variation sites based on the data sets. In one embodiment, the sequencing dataset is preferably: the raw sequencing data is subjected to quality control and cleaning, and is compared with a reference genome, and the data set obtained after sequencing and de-duplication is obtained, such as a BAM data format. Quality control and clean-up of raw sequencing data is described in the prior art, see CN 108573125A. Preferably, the sequencer for acquiring data comprises the Illumina platform. In one embodiment, gene variation Analysis is performed using the Genome Analysis Toolkit (GATK) optimization strategy. In some embodiments, after analysis of the genetic variation sites, the obtained genetic variation sites are subjected to quality control filtering to obtain sites that have genotype information in both parents and can be used for linkage analysis.
Removal of noisy genetic data
When genotyping genetic data is performed using samples containing trace amounts of progeny nucleic acids, for example, cfDNA in embryo culture fluid or biopsy samples of embryo tissue or free DNA of fetuses as samples, there is often a high genotyping error rate due to the small amount and small fragments of progeny nucleic acids present in these samples.
Therefore, in order to obtain high-quality offspring embryo and/or fetus genome reconstruction, it is necessary to perform noisy genetic data cleaning on genotyping original genetic data, and remove poor-quality genotyping (genotyping) sites through quality control and filtering. Poor quality of site typing includes sites that are inherently inefficient in amplification or have errors in genotyping, such as ADO, sites that are incorrectly randomly amplified, or sites that are poorly detected. In some embodiments, the removing of noisy genetic data comprises: identifying a site of genotyping error in the genotyping data and labeling the site as missing data.
The inventors have found that it is advantageous to perform the cleaning of noisy genetic data by a quality control means selected from the group consisting of: quality control of nucleic acid whole genome amplification efficiency, quality control of identification of Mendelian genetic errors, quality control of identification of violations of chromosomal interference suppression theory, quality control of mutual deductions of multiple progeny haplotypes, and combinations thereof.
Quality control based on whole genome amplification efficiency
The invention utilizes the quality control of the whole genome amplification efficiency to identify the locus with poor genotyping caused by low amplification efficiency in the progeny genotyping data, and marks the locus as the missing data.
The heterogeneity of the amplification efficiency of the whole genome is the characteristic of the single cell amplification technology used for the amplification of trace nucleic acid samples, and the region with low amplification efficiency can cause the genotyping quality of the base sites of the region to be poor. The inventor proposes that a reference sequencing data set is constructed by using multi-sample whole genome amplification products to determine a trace nucleic acid whole genome amplification efficiency distribution mode.
In one embodiment, quality control of the efficiency of whole genome amplification of a trace amount of nucleic acid is performed as follows:
performing low depth sequencing (sequencing depth no higher than about 0.5X, no higher than about 0.4X, no higher than about 0.3X, no higher than about 0.2X, no higher than about 0.1X, e.g., sequencing depth of about 0.06X) on a plurality of corresponding amplification product reference samples, obtaining BAM files aligning sequencing data to a reference genome, and merging BAM files of a plurality of reference samples into one large BAM library. For example, the BAM file of the plurality of reference samples is a BAM file of at least 300, 400, 500, 600, 700, 800 reference samples.
Further, the amplification efficiency of each site in the whole genome of a trace amount of nucleic acid was calculated by the following formula
Figure PCTCN2020121432-APPB-000003
Wherein the content of the first and second substances,
Figure PCTCN2020121432-APPB-000004
wherein DPiThe absolute depth of the ith locus is shown, N is the number of sequencing reads, and L is the read length.
When DP is presentiNot less than the mean depth of genome (DP)iNot less than 27), the amplification efficiency of the locus is not less than 1, which indicates that the locus passes the quality control of the amplification efficiency of the trace nucleic acid whole genome; loci that do not meet this quality control are genotyped as missing data. This is a quality control means based on the findings in the studies of the present inventors. Compared with the site with the amplification efficiency of the trace nucleic acid whole genome being more than or equal to 1, the amplification efficiency<1(DP i< 27) has a Mendelian genetic error rate nearly 6-fold higher (FIG. 2). Therefore, in order to minimize the effect of trace nucleic acid whole genome amplification efficiency on embryo whole genome genotyping, the present inventors mapped whole genome amplification efficiency distribution using second generation sequencing data of MALBAC whole genome amplification products of 463 trace nucleic acids, and identified low amplification efficiency (e.g., amplification efficiency) based on the mapped whole genome amplification efficiency distribution<1) And labeling the site as missing data. Here, the mendelian error rate refers to a ratio of sites where mendelian genetic errors occur to the total number of sites.
In some embodiments, the present invention utilizes Mendelian inheritance rules and chromosome interference theory to identify sites of ADO and other genotype errors that occur in offspring genotyping data and label them as deletion data.
Quality control based on Mendelian genetic law
Here, Mendelian inheritance pattern means that if a father is the A/C genotype and the mother is the C/C genotype at a certain locus, their offspring must be the A/C or C/C genotype unless new mutations occur (with very low frequency). If the genotype of the filial generation of the site is A/A, the site is suggested to be possibly subjected to ADO, and the information of the site is marked as missing data.
Quality control based on chromosome interference theory
The chromosome interference theory refers to the phenomenon that two adjacent single exchanges of homologous chromosome non-sister chromatids influence and inhibit each other in meiosis. The invention adopts the inhibition theory, in particular to judge that the molecular marker in a recombination section has genotyping error when two molecular marker sites are exchanged or recombined twice in a section of genetic distance, and mark the molecular marker sites as deletion data, wherein the section of genetic distance is any distance below 1 centimorgan (cM), for example. For example, based on the obtained genome sequence information, when the haplotypes of the parents and the parents of the offspring are constructed, if recombination occurs twice within a small genetic distance in the constructed haplotypes, such as the last molecular marker a (SNP site genotype) and the previous site indicating that the haplotype of the parents is inherited from the grandfather, the next molecular marker B indicating that the haplotype of the parents is inherited from the grandmother, the downstream molecular marker C of B and the subsequent site indicating that the haplotype of the parents is inherited from the grandfather, and A, B, C sites are as small as 1 centimorgan within a relatively small genetic distance, it can be inferred that the genotype of the B site is wrong, and the genotype is marked as deletion data.
Quality control based on mutual derivation of haplotypes of multiple offspring
In the methods of the invention, a plurality (preferably greater than 2) progeny samples (preferably unborn progeny samples, e.g., 2-4 or more blastocyst broth samples; or siblings such as an embryo) are used and genotype data is obtained for the plurality of progeny. The derivation of the haplotypes of the plurality of offspring from each other means that the maximum possible haplotype composition of the offspring is derived using the haplotype phasing method of the present invention and using the obtained genotype data of the plurality of offspring, thereby identifying the sites of genotype errors and labeling the sites as missing data.
In some preferred embodiments, the removing of noisy genetic data comprises: performing quality control of the whole genome amplification efficiency of the nucleic acid and at least one quality control selected from the group consisting of: quality control of Mendelian genetic errors, quality control of chromosomal interference suppression, and quality control of multiple progeny haplotypes deduced from each other.
In a preferred embodiment, the removing of noisy genetic data comprises: performing quality control on the amplification efficiency of the whole nucleic acid genome and quality control on all three or less nucleic acid genomes: quality control of Mendelian genetic errors, quality control of chromosomal interference suppression, and quality control of multiple progeny haplotypes deduced from each other.
Haplotype phasing
After acquiring genotyping information and removing noisy genetic data from the genotyping information, haplotype phasing of the progeny may be performed to determine the parent and maternal haplotype composition of the progeny.
In some embodiments, preferably, haplotype phasing (phasing) is performed based on pedigrees to obtain parent-parent haplotype composition for the offspring.
In some embodiments, the haplotype phasing comprises:
-distinguishing the two haplotypes of the male and female parent of the offspring (e.g. embryo);
-constructing a parent-derived haplotype composition and a maternal haplotype composition of a progeny (e.g. an embryo), thereby determining which of the parents' haplotypes each of the two haplotypes of the progeny inherit.
In some embodiments, a multi-site linkage analysis strategy based on Mendelian genetic rules and linkage disequilibrium theory is employed to construct haplotypes of progeny at the chromosomal level. In some embodiments, haplotype phasing is performed using an algorithm selected from the group consisting of: the Lander-Green algorithm, the Elston-Stewart algorithm, and the Idury-Elston algorithm.
In some embodiments, the haplotype phasing methods of the invention further comprise: nucleic acid samples of (e.g., unborn) offspring grandparents and/or outer grandparents are used to construct haplotypes of unborn offspring and their parents.
In a preferred embodiment, haplotype analysis is performed using a pedigree-based multi-site linkage analysis method. In a preferred embodiment, the haplotyping comprises using a plurality, preferably greater than 2, of progeny samples. In another preferred embodiment, the grandparents and/or the outer grandparents of the unborn offspring may also be used in the haplotype analysis to construct a haplotype of the unborn offspring and its parents.
In a preferred embodiment, the pedigree-based haplotype analysis is performed as follows: the Lander-Green algorithm, the Elston-Stewart algorithm, or the Idury-Elston algorithm.
In a preferred embodiment, haplotype construction is performed based on pedigree (pedigree) information to obtain the maximum possible haplotype composition for progeny inherited from parents. The construction method includes but is not limited to a likelihood method strategy (haplotype composition for maximum probability), a genetic rule strategy (haplotype composition for minimum recombination number) and an Expectation Maximization (EM) algorithm. Preferably, the likelihood policy includes but is not limited to: the Lander-Green algorithm and the Viterbi dynamic programming algorithm, the Elston-Stewart algorithm and the Bayesian network algorithm, and the preferred proposal is the Lander-Green algorithm and the Viterbi dynamic programming algorithm. Preferably, the genetic rule methods include a zero recombination hypothesis strategy and a minimum recombination hypothesis strategy, and available software vectors include, but are not limited to, ZAPLO, haplor.
In a preferred embodiment, haplotype phasing comprises: after obtaining genotyping information and removing noisy genetic data from the genotyping information, the following steps are performed
i) Constructing binary gene flow vector V of each site based on pedigree structurei,V i=(P 1,i,M 1,i,P 2,i,M 2,i,...,P n,i,M n,i). Here, i represents a site; n represents the number of embryos; pn,iRepresenting the haplotype of the nth embryo at position i inherited from the ancestor of the father line, Pn,i0 denotes the position of the embryoInherit the haplotype of grandfather, P n,i1 indicates that the embryo inherits the monomer type of the grandmother at the site; mn,iRepresenting the haplotype of the nth embryo inherited from the maternal ancestor at the ith position, Mn,i0 indicates that the embryo inherits the haplotype of the grandfather at this site, M n,i1 indicates that the embryo inherited the haplotype of the exo-grandmother at this site.
ii) calculating the maximum joint likelihood probability of the haplotype hidden state and genotype observation value of each site based on a hidden Markov model, wherein the formula is as follows:
Figure PCTCN2020121432-APPB-000005
wherein m represents the number of bits; p (V)1) Is the initial probability of the first locus paternal or maternal ancestral haplotype; p (V)i|V i-1) Is the transition probability of the haplotype state from the i-1 th site to the i th site adjacent to the i-1 th site, and is obtained by calculating the recombination rate between the two sites;
estimating the recombination rate between sites by genetic mapping of the third stage of genome of thousand human (1000genome Project Phase 3); p (G)i|V i) Is the haplotype status of the ancestor at a given site (V)i) Later genotype (G)i) The probability is obtained by calculating the Mendelian genetic rule;
iii) calculating hidden Markov state V ═ V using Viterbi algorithm1,V 2,...,V m) The most likely composition of the m-locus ancestral haplotype state, the most likely chromosome-level haplotype composition for each progeny was obtained.
Reconstruction of progeny subject genomes
The invention provides methods for reconstituting progeny genomes. In one embodiment, haplotype phasing is performed prior to genotype filling, and the haplotype of the sample is inferred. Thereafter, genotype filling is performed for the alleles lacking in the phased haplotypes (phased haplotypes).
Genotype filling is performed because there is missing data in the progeny object genome. A genotype deletion refers to a site of unknown genotype, i.e., a region in a sample that is not covered by sequencing data or a site where sequence information data is deleted, also referred to as deletion data. Loss of genotype data can be divided into genetic and detectable losses. Genetic deletions are deletions of the genotype caused by variations in the genetic information of the individual (e.g., actual deletion of the DNA segment at the site). A detectable deletion is a loss of sequence information due to limitations, errors, etc. of the detection technique. Various genotyping techniques produce detectable genotype deletions. For example, in the chip probe hybridization sequencing technology, genotype deletion occurs due to the capture efficiency of probe hybridization. In this context, a genotype deletion includes both deletions as well as sites of poor genotyping quality that are removed from progeny genomic sequence information based on noisy genetic data clean-up.
In the methods of the invention, following the elimination of noisy genetic data, the site of deletion of data on the genome of the progeny (e.g., embryo) will, in some embodiments, achieve, for example, at least 1/2 or greater, e.g., 4/5 or greater, of the whole genome. For micro-quantum generation nucleic acid samples, such as IVF-derived embryo culture fluid cfDNA, higher genotype deletion occurs due to the characteristics of small amount, small fragments, poor DNA amplification uniformity, and the like. In one embodiment, progeny may have up to about 6/7 locus genotype deletions prior to genotype filling.
In one embodiment, the method of reconstructing a progeny subject genome of the present invention comprises: haplotype phasing and filling of deletion genotypes are performed on offspring genotypes after elimination of noisy genetic data, in combination with pedigree genotype information from related individuals. The deleted genotype may be the genotype of the locus where the progeny is not amplified, the locus marked as missing data by clearing noisy genetic data, or both.
In some preferred embodiments, the present invention provides a method of reconstructing a progeny subject genome, comprising the steps of:
(a) providing a data set for the analytical process, the data set comprising: a 1 st data set from a child subject, a 2 nd data set from the parent of the child subject, and/or a 3 rd data set from the mother of the child subject; wherein the data set is a corresponding genotyping information data set obtained by genetic detection and genotyping analysis of nucleic acids or nucleic acid amplification products of offspring subjects and parents of the offspring subjects, and the offspring subjects are preferably unborn offspring subjects;
(b) performing noise genetic data elimination on the data set to remove locus genotypes with poor typing quality, wherein the data set subjected to quality control is one or more data sets including the 1 st data set;
(c) haplotyping the typing data obtained in step (b); and
(d) and filling the filial generation deletion genotype so as to obtain the information reconstruction of the whole genome genotype of the filial generation object.
In a preferred embodiment of the method, in step (d), further comprising: and adding genetype filling based on the genetype filling data of the family and/or the population, thereby obtaining the information reconstruction of the whole genome genetype of the filial generation object. The family is a genetically related relative other than the offspring subject parent, such as a sibling, grandparent and/or an outgrandparent. The population typing data may be reference haplotypes and haplotype frequency information from, for example, HapMap and1000 Genomes.
Those skilled in the art will appreciate that the destrying and genotyping of the present invention may be repeated for a plurality of different progeny chromosomes, preferably, the number of repeats is determined based on the number of chromosomes in the progeny to obtain a genome-wide information reconstruction. For example, for diploid progeny, it is repeated 23 times (for female individuals) or 24 times (for male individuals).
In a preferred embodiment, pedigree-based genotype fill-in comprises: based on the parental high-density polymorphic molecular marker information and the composition of parental and maternal haplotypes in offspring constructed By haplotype phasing, the deleted genotypes in the offspring are filled By utilizing the same-blood-source By Descent (IBD) strategy.
In another preferred embodiment, population-based genotype filling comprises: and analyzing the genotype information of the filial generation at the whole genome level based on the group linkage disequilibrium rule and the reference haplotypes and haplotype frequency information such as HapMap, 1000Genomes and the like. The assay method may be selected from the group consisting of: maximization (E-M), Hidden Markov Model (HMM), Markov Chain Monte Carlo (MCMC), Coalescent theory, or a combination thereof. Genotype filling algorithms based on the rule of population linkage disequilibrium include, but are not limited to, IMPUTE (2), MaCH, Beagle, Minimac.
Preferably, for genotype information that has not been successfully filled in based on family information, progeny genomic information completion is performed using population-based genotype filling.
In the present invention, genotype filling of progeny samples can be performed compared to the genotypes of the male and female parent samples based on pedigree analysis. In a preferred embodiment, the genotypes of other pedigree individuals may be further added for pedigree analysis and genotype filling of progeny subjects, such as siblings of progeny or embryos from the same male and female parents (including IVF-produced embryos and their culture), and/or grandparents and outgrandparents of progeny. For example, in one embodiment, the haplotype extrapolation of the offspring may be combined to complement the genotype of the offspring deletion.
In extensive research, the present inventors have found that, by combining identification of the same region of the blood-related (IBD) with genotype information of related individuals (in particular, high-density polymorphic Site (SNP) genotype information of parents) to fill in the missing genotypes in progeny, allele estimates with higher accuracy, e.g., at least 90% or more accuracy, and even up to 99% or more accuracy, can be obtained.
Thus, in some preferred embodiments, the population of the progeny deletion genotype of step (d) of the method of the invention comprises: combining the identification of the same region of the blood source, namely determining the haplotype composition condition of a certain region embryo from parents, and simultaneously combining the genotype information of the high-density polymorphic loci of the parents to fill the missing genotype locus information in the offspring;
and optionally, for genotype information that was not successfully populated based on family information, filling genome-wide level genotype information with population reference haplotype information and population-level allelic Linkage Disequilibrium (LD) laws;
preferably, the population Reference Haplotype information is HapMap, 1000 genes, HRC (Haplotype Reference Consortium);
preferably, the genotype filling algorithm for population-level allelic Linkage Disequilibrium (LD) regularities is, for example, the IMPUTE (2), machh, Beagle, Minimac algorithm;
preferably, genotyping is performed using Maximization (E-M), Hidden Markov Model (HMM), Markov Chain Monte Carlo (MCMC), Coalescent theory, or a combination thereof.
In some embodiments, the reconstitution preferably comprises:
-determining the embryo chromosome segment that needs to be filled,
-using the population reference haplotype information to find a segment of the population haplotype that is most similar to the haplotype of the offspring in this segment;
-complementing the information missing from the embryo based on the genotypic information for the haplotype of the segment in the population.
In population-based genotype filling, one can consider selecting a reference population template that is closer in genetic background to the progeny objects. For example, when the offspring is a Chinese individual, Chinese population reference haplotype information using 1000 genes phases3 may be considered. Progeny whole genome genotype reconstructions can be performed using population-based genotype filling software known in the art, including, but not limited to, the MACH software package.
Coverage of progeny genomes can be further increased by population-based genotype filling; however, at the same time, the genotype estimation accuracy may decrease. Thus, in a preferred embodiment, whether population-based genotype filling is to be performed may be determined based on the following factors: (1) desired genotype fill-in accuracy, (2) desired genome coverage, (3) desired coverage or lack thereof of the target region.
In the present invention, the accuracy of genome reconstruction of the ancestry-based offspring (such as embryo) is 90% or more, preferably 95% or more, more preferably 97% or more, and most preferably 99% or more. Preferably, the genome coverage of the progeny after genotype filling of the pedigree is above 60%, e.g., above 70%, above 80%.
In the present invention, the accuracy of genome reconstruction of progeny (e.g., embryos) is greater than 90%, preferably greater than 95%, more preferably greater than 97%, and most preferably greater than 99% after population-based genotype filling. Preferably, the genomic coverage of the progeny is further increased after genotype filling of the pedigree.
In addition to the preferred embodiments described above, it is to be understood that other methods suitable for genotyping based on genetic characteristics of a pedigree sample and genotyping based on population genetic characteristics are also contemplated by the present invention.
Generally, genotype filling comprises two steps:
(1) from the non-deletion sites of the target site/region, the genotype rules of this region are summarized and classified. That is, the haplotype composition of each region of the reference sample (e.g., family sample) is analyzed;
(2) judging which haplotype the region belongs to according to other non-deletion sites above and below the deletion site of the filial generation sample, and supplementing the deletion site of the filial generation sample according to the genotype of the haplotype.
For example, the complete genotype of the sample can be reconstructed by first determining which haplotype in the sample is most similar to the reference haplotype set based on the genotype information for the sites on the sample that are not deleted, and then assigning the corresponding most similar haplotype to the sample.
Genotype filling based on the genetic characteristics of the pedigree sample, a haplotype common between the progeny sample to be filled and, for example, the haplotypes of the male and female parents, can generally be found by comparing the haplotypes of the two, however, the genotype reconstruction of the progeny sample can be performed by copying the loci on the reference template on the match into the target dataset for the progeny.
Based on population-based genotype population, the offspring sample to be populated and, for example, the reference population haplotype can generally be aligned to find a haplotype common between the two, however, the loci on the reference template on the match can be copied into the target dataset for the offspring for genotype reconstruction of the offspring sample.
In some preferred embodiments, the pedigree-based embryo progeny genotype fill comprises the steps of:
(1) constructing embryo parental haplotype:
as shown in FIG. 1, for a particular chromosome, haplotypes at the chromosome level were constructed based on pedigree information (parents) and Mendelian genetic rules and the multi-site linkage analysis strategy of gene linkage and crossover theory. Two haplotypes of the father (mother) of the embryo are distinguished, and the composition of the maternal and maternal haplotypes of the embryo is constructed at the same time, namely which haplotypes of the father and the mother are inherited by the embryo. When some heterozygous sites in the offspring information are not haplotyping, more offspring such as siblings of the embryo or grandparents of the embryo (grandparents information) can be added to increase the number of sites that can be haplotyped.
(2) Genome information completion of embryos:
and detecting the genome sequence information of the embryo amplification sample and the embryo parental gDNA sample to obtain the genotyping information. However, the trace amount of fetal DNA in the embryo culture solution (or other samples containing trace amounts of progeny DNA) may not be fully amplified due to the heterogeneity of whole genome amplification. Taking the SNP chip as an example, only about 1/5 loci on the chip can pass the quality control of chip genotyping, and the amplification efficiency and the genotyping error quality control of the invention are added, so that the genotype information of the loci deleted on the embryonic genome is more. Therefore, based on haplotype construction, the deletion genotype information on a certain chromosome of the embryo is filled By combining the Identity By Descent (IBD) method and the parental high-density polymorphic locus genotyping information.
FIG. 1 shows an example of genotype fill-in. As in fig. 1 at 6.1 and 6.2, the haplotypes established after haplotype construction are unequivocally haplotypes of embryos inherited from father and mother, a. Based on the more complete allelic information of parents on the haplotype, the first-stage genotype filling is carried out, the filled embryo haplotype information is G.A.C.GA. T and C.G.T.T.CA. A, and the 3-locus genotype information which is missing in the embryo, namely G/C, A/T, A/A, is filled. By analogy, information on the haplotype of the offspring at the level of the whole chromosome can be obtained. For example, if the offspring to be tested in the family has brother and sister, as shown in 6.3 in fig. 1, the genotype filling of the second stage is performed based on the haplotype of the brother and sister, the embryo to be tested has the same parental haplotype as the brother and sister, the maternal haplotype is different, and the other 3 genotypes which are deleted in the whole embryo can be further supplemented, which are respectively A/C, C/T, C/A, and the haplotype information after the completion is G.A.AACCGAC.T and C.G.TCTTCAA.A..
In one embodiment, the family-based genotype fill-in is followed by a population-based genotype fill-in of embryos to complement the genotype information that is missing in both parents and embryos. In a preferred aspect, the filling comprises:
the filial generation haplotypes constructed by the method are utilized to complement the locus genotype information which is also deleted in the parental genome information in a certain chromosome of the embryo based on the group linkage disequilibrium rule and the reference haplotypes and haplotype frequency information such as HapMap and1000Genomes and the like. In particular, a population reference haplotype information can be used to find the segment of the population haplotype that most closely resembles the embryonic haplotype, and then to complete the missing information in the embryo based on other genotype information for this haplotype segment in the population. As at 6.4 in figure 1, the two haplotypes in the embryo are g.a.aaccgac.t and c.g.tcttcaa.a, matching the most similar and most frequent haplotypes in the population are GTACAACCGACGT and CGGATCTTCAACA, thus complementing the three missing loci genotype information T/G, C/a and G/C in the embryo. Estimation methods that can be used for this filling include, but are not limited to, Maximization (E-M), Hidden Markov Models (HMM), Markov Chain Monte Carlo (MCMC), and Coolescent theory.
In some embodiments, to achieve better genotype fill, the methods of the invention may comprise:
(1) pre-filling (pre-imputation) genotype (genotypes) quality control to filter out low quality variant sites and samples, including the inventive noisy genetic data clean-up method described above.
(2) The genomic coordinate system used in the analysis is determined, for example, using the genomic coordinate system of the UCSC version (e.g., "hg 19").
(3) Selecting a reference template (reference panel), and performing genotype filling of the progeny based on the family-based reference template and/or the population-based reference template.
V. exemplary embodiments of the method of the invention
In a preferred embodiment, the process of the invention comprises the steps of:
1. and (3) non-invasively obtaining the embryo nucleic acid sample. The embryonic nucleic acid sample may be taken from free DNA in an embryo culture.
2. Whole genome amplification of embryo trace DNA in culture solution. The Amplification method adopts a single cell Amplification strategy, and the specific method is not limited, and includes but not limited to Primer extension PCR (Primer extension PCR, PEP-PCR) before Amplification, Degenerate oligonucleotide Primer PCR (DOP-PCR), Multiple Displacement Amplification (MDA), Multiple Annealing and LoopBased Amplification Cycles (MALBAC), flat-end or cohesive-end ligation library construction, and the like.
3. Extracting the parental gDNA of embryo.
4. And (3) performing genetic detection and analysis on DNA products obtained by amplifying the whole genome of the embryo culture solution and gDNA of families such as embryo parents. The detection method can adopt nucleic acid chips, second-generation sequencing and other platforms, and genetic analysis obtains the genotype information of parents and embryos by using a genotyping method.
5. The embryo DNA analysis data was quality controlled and filtered.
1) The quality control of the amplification efficiency of the trace nucleic acid whole genome: the heterogeneity of amplification efficiency of a whole genome of a trace amount of nucleic acid is a characteristic of amplification technology of a trace amount of nucleic acid (for example, trace amount of nucleic acid from a single cell), and a region with low amplification efficiency affects the genotyping quality of the base site of the region. In view of the above, the present inventors constructed a reference sequencing dataset using multi-sample whole genome amplification products to determine a trace nucleic acid whole genome amplification efficiency distribution pattern. Specifically, a plurality of corresponding amplification product reference samples are subjected to low-depth sequencing (such as about 0.06X), sequencing data are obtained and compared to a BAM file on a reference genome, and the BAM files of the plurality of reference samples are combined into a large BAM library.
Figure PCTCN2020121432-APPB-000006
Here, N represents the number of sequencing reads, and L represents the read length;
2)
Figure PCTCN2020121432-APPB-000007
DP irepresents the absolute depth (total depth) of the ith position, namely the read number of the position. In this example, the mean depth of genome sequencing for the BAM library of the reference sample was 27. The research of the inventor finds that the amplification efficiency of the whole genome is more than or equal to 1X (DP) compared with that of trace nucleic acidiNot less than 27, absolute depth of site not less than average depth of genome), and amplification efficiency<1X(DP i<27) The Mendelian error rate of the locus is nearly 6-fold higher (FIG. 2). Therefore, to minimize the effect of trace nucleic acid whole genome amplification efficiency on embryo whole genome genotyping, the present inventors mapped whole genome using next generation sequencing data of multiple trace nucleic acid amplification products (different amplification methods require corresponding amplification reference samples)(ii) a genome amplification efficiency distribution map on which the low amplification efficiency is recognized (<1X) and labeled as deletion data.
3) Embryo misgenotype sites were identified and labeled as missing data: after the embryo micro-DNA is amplified by the whole genome, besides some sites are not amplified or have poor genotyping quality due to low amplification efficiency, some sites also have the problem of Allele Dropout (ADO) caused by amplification deviation, wherein one of two alleles is preferentially amplified or even the other Allele fails to be amplified completely, so that the genotyping of the sites is influenced. Here, parents and parents haplotypes of offspring are first constructed using step 6.1, and sites in the embryo where ADO and other genotype errors occur are identified using Mendelian genetic rules and chromosomal interference and labeled as missing data.
6. And (3) reconstructing the genotype information of the whole genome of the embryo by utilizing statistical genetics and computational biology algorithms. The method comprises the following specific steps:
6.1 construction of embryo parental haplotype: haplotypes at the chromosome level were constructed for a chromosome based on pedigree information (parents) and the Mendelian genetic rules and multi-site linkage analysis strategy of gene linkage and crossover theory using algorithms such as the Lander-Green algorithm, the Elston-Stewart algorithm and the Idury-Elston algorithm (FIG. 1). Two haplotypes of the father (mother) of the embryo are distinguished, and the parental haplotype composition of the embryo is constructed at the same time, so that which haplotypes of the parents are inherited by the embryo is determined. Optionally, more progeny samples, and/or genotype information of other pedigree individuals, are added to the analysis. Specific methods or frameworks include, but are not limited to, likelihood strategy (haplotype composition for maximum probability), genetic rule strategy (haplotype composition for minimum number of recombinations), and Expectation Maximization (EM) algorithm. The likelihood method strategy comprises but is not limited to a Lander-Green algorithm and a Viterbi dynamic programming algorithm, an Elston-Stewart algorithm and a Bayesian network algorithm, and the preferred scheme is the Lander-Green algorithm and the Viterbi dynamic programming algorithm; genetic rule methods include zero recombination hypothesis strategies and minimum recombination hypothesis strategies, and software vectors include, but are not limited to, ZAPLO, haplor. If there is only one offspring information, such as only one embryo information, some heterozygous loci may not be haplotypic. Sites that can be haplotyped are more if there are more offspring such as brothers of the embryo or grandparents of the embryo (grandparent information). Only one parent can be haplotyped here, but the number of sites that can be typed is reduced, thereby affecting the accuracy of the genomic information population.
6.2 genome information completion of embryos. Based on the haplotype construction in the step 6.1, the deletion genotype information on a certain chromosome of the embryo is filled By combining the Identity By Disease (IBD) method and the parental high-density polymorphic site genotyping information.
6.3 reconstructing the haplotype based on the information of other members of the family (such as brother and sister of the offspring to be tested), and further filling the deletion genotype information of the embryo based on Mendelian genetic rule and the haplotype information of brother and sister of the offspring to be tested. Generally, the more other members of the family (e.g., siblings of the offspring to be tested), the more genotypes that can be filled in, and the higher the accuracy.
6.4 optionally, complementing the genotypic information for both parental and embryonic deletions. And (3) utilizing the haplotypes constructed in the step 6.1, and completing the locus genotype information which is also deleted in the parental genome information in a certain chromosome of the embryo on the basis of the group linkage disequilibrium rule, the HapMap, 1000Genomes and other reference haplotypes and haplotype frequency information.
6.5 repeat 6.1, 6.2, 6.3, 6.4 steps 24 times (22 autosomes + X chromosome + Y chromosome) to construct embryo whole genome genotype information.
In the above embodiment, preferably, if the DNA product is analyzed by whole genome sequencing strategy, embryo parents use whole genome sequencing at a higher depth to obtain accurate whole genome genotype information, for example, the sequencing depth is greater than or equal to 20 x. In the above embodiment, preferably, by utilizing the property of heterogeneity of single cell amplification, the embryo can obtain genotype information of a locus capable of being amplified by using a higher-depth whole genome sequencing method, and can also obtain genotype information of a relatively low density by using a low-depth whole genome sequencing method, for example, the sequencing depth can be as low as 2x, or even lower.
Computer product, system and apparatus for implementing the method of the invention
The invention also provides computer products, systems, and devices for implementing the cleaning of noisy genetic data, phasing haplotypes, and/or reconstructing progeny genomes
Computer product
In one aspect, the present invention provides a device for reconstructing a progeny genome (in particular a haplotype), the device comprising:
a non-transitory computer readable storage medium having instructions thereon for performing the method for reconstructing genomic information from progeny of the invention, comprising:
one or more instructions for receiving input comprising genomic sequence information of a progeny,
-one or more instructions for quality control and filtering of the input genomic sequence information;
-one or more instructions for reconstructing genomic information of the progeny to determine a haplotype of the progeny.
Preferably, the device comprises the following modules:
(1) a sequence information data acquisition module: the method comprises the steps of obtaining original sequence information data of filial generations and/or related individuals;
(2) genotype analysis module: genotyping the raw sequence information data for the analysis module (1);
(3) quality control filter module: the system is used for carrying out quality control and filtration on the genotype information obtained by the module (2);
(4) a haplotype phasing module for performing haplotype phasing on the genotypes after the quality control and filtration of the module (3);
(5) genotype fill module: for further reconstruction of progeny genotypes for the phased haplotypes of module (4);
(6) optionally, a report output module: and (5) processing and integrating the data obtained in the steps (1) to (5) to generate a report.
In a preferred embodiment, the present invention provides an apparatus comprising:
at least one processor and at least one memory, the at least one memory stored with code thereon, which when executed by the at least one processor, causes the apparatus to perform a method of the present invention. Preferably, the code, when executed by the at least one processor, causes the apparatus at least to perform:
receiving sequence information data, e.g. original sequence information data of descendants and/or related individuals,
-analyzing the genotype based on the received raw sequence information data;
-quality control and filtering of the genotypic information obtained by analysis;
-haplotyping the quality-controlled and filtered genotypes;
-further reconstruction of progeny genotypes for the phased haplotypes.
In one embodiment, the invention also provides a computer-readable storage medium having code stored thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform the method for daughter genome information reconstruction of the invention. Preferably, the code, when executed by the at least one processor, causes the apparatus at least to perform:
receiving sequence information data, e.g. original sequence information data of descendants and/or related individuals,
-analyzing the genotype based on the received raw sequence information data;
-quality control and filtering of the genotypic information obtained by analysis;
-haplotyping the quality-controlled and filtered genotypes;
-further reconstruction of progeny genotypes for the phased haplotypes.
System for controlling a power supply
In one aspect, the invention provides a system for reconstructing a progeny genome (particularly a haplotype), the system comprising a device (apparatus or module) configured to enable the inventive method, e.g., configured to:
-receiving input comprising genomic sequence information of progeny and related individuals,
-quality control and filtering of the input genomic sequence information;
-preferably, determining the haplotype of the progeny to reconstruct the genomic information of the progeny.
In a preferred embodiment, the device is a device of the invention as described above for the reconstitution of progeny genomes, in particular haplotypes.
In the system of the present invention, it may further include:
-amplification means for amplifying, preferably whole genome amplifying, a nucleic acid sample of progeny and/or related individuals;
sequence information detection means for performing sequence information detection on the amplification product, including but not limited to polymorphic site (e.g., SNP) detection, sequencing detection.
Device
In yet another aspect, the present invention provides an apparatus for analytical processing of progeny genomic reconstruction, comprising:
the amplification unit is used for carrying out whole genome amplification on the sample to be detected and the DNA sample of the filial generation parental system;
the detection and analysis unit is used for carrying out hereditary detection and analysis on the amplification product obtained by the amplification unit;
the data processing unit is used for carrying out quality control and filtration on the detection and analysis data of the amplification product obtained by the amplification unit and removing locus genotypes which are not amplified or have genotyping errors; and
an information reconstruction unit of the offspring whole genome genotype; the information reconstruction unit of the offspring whole genome genotype is used for performing haplotype phasing and genotype filling and outputting the obtained information reconstruction result of the offspring whole genome genotype.
Preferably, the system of the invention will comprise a tool for genomic sequence information query, and a programmed memory or medium for allowing a computer to analyze the resulting data. Sequence information query data (including, for example, sequencing data sets, SNP data sets, gene mutation site data sets, genotyping data sets), may be stored data sets, or in "live-action or run-on (fly)" form. As used herein, "data set" encompasses both types of data sources.
The tool for the genomic sequence information query is not particularly limited. In a preferred embodiment, a high density SNP chip is used. In another preferred embodiment, high throughput sequencing equipment is used to obtain higher depth sequencing data for related individuals of progeny.
The present invention may be executed by a computer. The invention therefore also provides a computer programmed to carry out the above method. The computer typically includes: a CPU communicatively interfaced with the computer, a system memory (RAM), a non-transitory memory (ROM), and one or more other storage devices such as a hard drive, a floppy drive, and a CD ROM drive. The computer may also include a presentation device, such as a printer, CRT monitor or LCD presentation, and an input device, such as a keyboard, mouse, pen, touch screen or voice activated system. The input device may receive data, such as directly from a sequence information query tool through an interface.
Application of the method, computer product, system and device of the invention
The methods, computer products (in particular the above-described apparatus, systems and devices of the invention) according to the invention can be used for disease detection or disease susceptibility detection of pre-implantation embryos and/or fetuses in early gestation, including but not limited to: aneuploidy detection, monogenic genetic disease detection, chromosome structure rearrangement detection, and polygenic disease genetic risk assessment.
In one embodiment, the use comprises: diagnosing a predisposition to a common disease or cancer, comprising: for example, the haplotype of progeny that is reconstituted according to the methods of the invention is compared to known disease-associated haplotypes. The association of such haplotypes with disease is being established in the art. For example, "international HapMap consortium" is a genome-wide variation that maps SNP haplotypes in the human population, facilitating disease-related studies (international, HapMap consortium, 2005). Thus, combining the genomic reconstruction methods of the invention with these haplotype analyses also forms an aspect of the invention.
The invention has the beneficial technical effects
(1) In the invention, a method for accurately and non-invasively acquiring the genotype information of the whole embryo genome is found for the first time, so that the non-invasive genetic risk rating of the embryo polygenic disease before implantation becomes possible.
(2) The invention firstly discovers that on the basis of the whole genome amplification technology, by combining with the sequencing technologies of chips, second generation, third generation and the like, and utilizing high-density gene polymorphic site information of families such as embryo parents and the like and statistical genetics and computational biology algorithms, the invention can carry out genotype filling on unamplified sites and ADO and other genotyping wrong sites in the embryo genome, thereby obtaining embryo whole genome information.
(3) The invention discovers for the first time that the site information with poor genotyping quality, especially the site with poor amplification efficiency of the unicellular whole genome can be filtered out by controlling and filtering the quality of the DNA analysis data of the embryo, thereby improving the accuracy of genome reconstruction.
(4) The invention discovers for the first time that the locus information of the whole genome of the embryo can be obtained to the maximum extent based on the integrated application of the family and group gene filling method.
Acronym descriptions
Figure PCTCN2020121432-APPB-000008
Figure PCTCN2020121432-APPB-000009
Examples
The following examples are described to aid in the understanding of the present invention. The examples are not intended to, and should not be construed as, limiting the scope of the invention in any way.
Example 1 chip platform
1) Collecting samples:
the collected samples are 1ml of peripheral blood of father and 1ml of peripheral blood of mother in a family, and are collected by an EDTA anticoagulation tube; and fertilizing the mother's ovum in vitro with Sperm of the father using IntraCytoplasmic single Sperm Injection (ICSI), using GM Medium (Quinn's Advantage Plus Protein Cleavage Medium) (SAGE, product number: ART-1526) at 37 deg.C and 5% CO2,5%O 2Culturing the fertilized eggs in an incubator until the fertilized eggs grow into blastocysts about the fifth day, and sucking about 20ul of blastocyst culture solution. Culture broth of 4 blastocysts from the same parents were prepared.
2) DNA extraction, amplification and quantification
The peripheral blood sample of the father and the peripheral blood sample of the mother respectively adopt a routine whole blood genome extraction step to extract whole genome DNA. The Kit used in this step is a commercially available DNeasy Blood & Tissue Kit (50) (manufacturer QIAGEN, cat 69504), and the extraction of whole genomic DNA was carried out according to the manufacturer's instructions.
Each culture sample (approximately 10-15. mu.l in volume) from 4 blastocysts from the same parents was transferred to 5. mu.l of lysis buffer (30 mM Tris-Cl, 2mM EDTA, 20mM KCl, 0.2% Triton X-100, pH 7.8) and the sample name was marked on each collection tube with a marker pen. The microcentrifuge was centrifuged for 30 seconds. The sample can be immediately subjected to the next whole genome amplification step or put into a freezing storage at-20 ℃ or-80 ℃. The amplification method of the invention refers to the instructions of the MALBAC single cell whole genome amplification kit (product No. KT110700150) of Shuikang medical science and technology (Suzhou) Limited company for whole genome amplification of blastocyst culture fluid.
The whole genome amplification products of each blastocyst broth were quantified using the Qubit dsDNA HS Assay Kit (Invitrogen, Q32584). The quantitative results indicated that the total amount of nucleic acid in each sample was approximately 500-1000 ng.
3) Acquisition of raw data
A human genome-wide SNP chip board is obtained from Beijing Boo classical Biotechnology, Inc., and is named CBC-PMRA (Capital Biotechnology-Precision Medicine Research Array) SNP chip 900K. The high-density SNP chip board contains about 900,000 SNP loci in a GWSCATALOG database, and averagely covers the whole genome of Asians, particularly Chinese, so that the requirements of whole genome association analysis and genotyping can be met.
The DNA obtained in the above step 2) was manipulated using the SNP chip 900K according to the manufacturer's instructions, and the raw data for genotyping were obtained.
4) Genotyping analysis
Axiom of Thermo Fisher Scientific was selectedTMAnalysis Suite software as a platform for analyzing the raw data obtained from the SNP chip 900K of step 3) above. Using AxiomTMGenotyping Analysis is carried out by Genotyping function module in Analysis Suite software, and loci with genotype quality meeting the standards of Polyhighresolution, NoMinorHom, Monohighresolution and Hemizygous are selected.
The results show the number of loci where the quality of the genotypes obtained from each sample meet the criteria of PolyHighresolution, NoMinorHom, MonoHighghresolution, Hemizygous, are shown in Table 1 below.
TABLE 1
Figure PCTCN2020121432-APPB-000010
As can be seen from Table 1, the embryo SNP genotype information of each blastocyst broth is essentially 1/4 of that of its parent SNP.
5) Since there is a problem that the amplification efficiency of whole genome DNA is not uniform in the amplification of embryo DNA of blastocyst culture solution by using MALBAC for whole genome DNA amplification as described in the above step 2), quality control is further performed on SNP genotyping data of each blastocyst culture solution sample. The specific operation is as follows:
i) adopting MALBAC technology to amplify embryo DNA of blastocyst culture solution to obtain a whole genome amplification product sample, carrying out ultrasonic interruption, interrupting fragments to be distributed in 800bp of 200-: jiangsukang is a century biotechnology limited company, trade name: the second generation sequencing rapid DNA Library construction kit NGS Fast DNA Library Prep Set for Illumina, the cargo number: CW2585M) and then whole genome sequencing was performed by the NextSeq550 sequencing platform of Illumina corporation, with an average sequencing depth of 0.06X per sample;
ii) aligning the obtained sequencing sequence to a reference genome using the BWA-MEM algorithm. In this example, the reference genome of hg19 was used. 463 BAM files were obtained. The 463 BAM files are then merged into one large library of BAM files.
iii) calculating the average sequencing depth of the genome by adding the absolute depth of each site on the whole genome and dividing the number of sites. The average sequencing depth for this genome can be calculated using the following formula:
Figure PCTCN2020121432-APPB-000011
here, N represents the number of sequencing reads, L represents the read length, 3X 109Is the size of the human genome.
iv) if the absolute depth of the SNP locus, namely the read number covering the locus, is more than or equal to the average sequencing depth of the genome, the quality control of the locus through the amplification efficiency is indicated; loci that do not meet this quality control are genotyped as missing data.
6) The mendelian genetic rules described in the detailed description of the invention are used to identify the wrong genotype locus, such as ado (allel dropout) and the like, which is marked as missing data.
7) And (3) further identifying the locus with genotype error in the blastocyst culture solution by utilizing the theory of embryo haplotype mutual deduction and chromosome interference, and marking the locus with the genotype error as deletion data.
Wherein said mutual derivation of embryo haplotypes means that the greatest possible haplotype composition of the embryo is derived using the genotype phasing method of step 8) and using genotype data acquisition of multiple embryos. Then, based on the chromosome interference suppression strategy, the site where two crossover recombinations occurred within 1cM genetic distance was identified as the wrong genotype.
The number of sites obtained after quality control by steps 5), 6) and 7) is shown in Table 2.
TABLE 2
Figure PCTCN2020121432-APPB-000012
Figure PCTCN2020121432-APPB-000013
As can be seen from Table 2, the SNP genotype of the blastocyst culture fluid after further quality control was reduced to about 1/7 of the parental SNP genotype.
8) Haplotype phasing and genotype filling based on pedigree1
i) Constructing binary gene flow vector V of each site based on pedigree structurei,V i=(P 1,i,M 1,i,P 2,i,M 2,i,...,P n,i,M n,i). Here, i represents a site; n represents the number of embryos; pn,iRepresenting the haplotype of the nth embryo at position i inherited from the ancestor of the father line, Pn,i0 indicates that the embryo inherits the haplotype of grandfather at the site, P n,i1 indicates that the embryo inherits the haplotype of the grandmother at the site; mn,iRepresenting the nth embryoHaplotype inherited at the ith locus from maternal ancestors, Mn,i0 indicates that the embryo inherits the haplotype of the grandfather at this site, M n,i1 indicates that the embryo inherited the haplotype of the exo-grandmother at this site.
ii) calculating the maximum joint likelihood probability of the haplotype hidden state and genotype observation value of each site based on a hidden Markov model, wherein the formula is as follows:
Figure PCTCN2020121432-APPB-000014
here, m represents the number of bits; p (V)1) Is the initial probability of the first locus paternal or maternal ancestral haplotype; p (V)i|V i-1) Is the transition probability of the haplotype state from the i-1 th site to the i-th site adjacent to the i-th site, and is obtained by calculating the recombination rate between the two sites.
The recombination rate among sites is estimated by a genetic map of the third stage (1000genome Project Phase 3) of the thousand human Genomes; p (G)i|V i) Is the haplotype status of the ancestor at a given site (V)i) Later genotype (G)i) Probability, calculated by mendelian genetic law.
iii) calculating hidden Markov state V ═ V using Viterbi algorithm1,V 2,...,V m) The most likely composition of the m-locus ancestral haplotype status, i.e., the most likely chromosome-level haplotype composition for each embryo;
iv) combining the identification of the same (IBD) region of blood source, namely determining the haplotype composition condition of a certain region embryo from parents, and simultaneously combining the genotype information of the high-density polymorphic loci of the parents to estimate the genotype (chromosome level) of the locus which is not amplified or has wrong genotype in the embryo;
v) due to the existence of 22 autosomes and 1X chromosome and/or 1Y chromosome, the above steps i) -iv) are repeated for 23 or 24 times to reconstruct the genotype information (all sites on the chip) of the whole genome of the embryo at the chip level.
The polymorphic site information obtained after the embryo genome reconstruction is shown in Table 3.
TABLE 3
Figure PCTCN2020121432-APPB-000015
As can be seen from Table 3, the genome coverage of the reconstructed embryonic genome can be increased from 14% (1/7) to about 70%.
The results of this experiment were compared to paired biopsy samples. The biopsy sample is blastocyst trophoblast cells corresponding to a blastocyst culture solution, and the specific experimental steps comprise that firstly, the blastocyst obtained by biopsy is transferred into an in-vitro operation culture solution (for example, G-PGD contains 5% HSA) without calcium and magnesium ions; fixing the embryo with an ovum holding needle under an inverted microscope (200X); cutting or punching the transparent belt with the aperture of 35-40 μm; fourthly, sucking a cell with a cell nucleus by using a needle with the inner diameter of 35 to 40 mu m; fifthly, transferring the embryos out of the operating solution, cleaning the embryos in a blastocyst culture solution and culturing the embryos. The name and the embryo number of the patient are noted; sixthly, the steps of DNA extraction, amplification, quantification and genotype analysis are the same as those of the blastocyst culture solution.
After the successful typing and the unsuccessful typing in the blastocyst culture fluid sample in the biopsy sample are selected for verification and comparison, the allele reconstruction accuracy of the embodiment can reach about 99.2 percent (Table 4).
TABLE 4 chip platform embryo genome reconstruction accuracy
Figure PCTCN2020121432-APPB-000016
9) For genotypes of sites not on the CBC-PMRA SNP chip 900K, the genotype is predicted using a population information based genotype fill-in strategy.
In this example, the Chinese population reference haplotype information in 1000Genomes Phase3 was further utilizedBased on haplotype information which is well phased by using pedigree information, adopting a Hidden Markov Model (HMM) and using a MACH software package to predict the embryo genome-wide genotype2
Example 2 second Generation sequencing
1) Collecting samples:
as in example 1 above.
2) DNA extraction, amplification and quantification
As in example 1 above.
3) Construction of a second Generation sequencing library
Performing ultrasonic disruption on the DNA sample from the step 2), distributing the disrupted fragments in 800bp of 200-: jiangsukang is a century biotechnology limited company, trade name: the second generation sequencing rapid DNA Library construction kit NGS Fast DNA Library Prep Set for Illumina, the cargo number: CW 2585M).
4) Whole genome sequencing
And (3) carrying out whole genome sequencing on the second-generation sequencing library obtained in the step 3) by utilizing a NovaSeq 6000 sequencing platform of Illumina company, wherein the average whole genome sequencing depth of a parent gDNA sample is 20X, and the average whole genome sequencing depth of an embryo culture solution sample is 2X.
And obtaining an original fastq read sequencing file and storing the file on a server.
5) Quality control, filtration and correction and genotyping
Variants found from whole genome sequencing will be filtered in multiple steps for quality control. Quality control, filtration and calibration were performed using the Genome Analysis Tool Kit (GATK) optimization strategy, with specific reference to GATK official network, Best Practices by germlines nps + Indel (https:// software. branched assets. org/gate/Best Practices/workflowd 11145), as follows:
i) performing data quality control filtering on the original fastq file by using fastp software;
ii) aligning the obtained sequences to a reference genome using the BWA-MEM algorithm. In this example, the reference genome of hg19 was used;
iii) sorting and indexing the compared files by utilizing a Picard SortSam command and Samtools software to finally obtain a BAM file;
iv) deduplication with Picard's MarkDuplicates command;
v) obtaining Base Quality Score Recalibration (BQSR) by using a Base Recalibrator and an applied BQSR command of the GATK to perform Base mass re-correction;
vi) carrying out gene variation detection on a single sample by using a Haplotpypecaller method of GATK;
vii) performing multi-sample combined gene variation detection by using the CombineGVCF and genotypGVCFs method of GATK;
viii) obtaining Variant Quality Score Recalibration (VQSR) by using the Variant Recalibrator and AplyVQSR methods to perform Variant mass re-calibration.
Wherein the principle of the quality control filtration of the pair of sites is as follows: selecting a point with VQSR as PASS; secondly, determining the site sequencing depth DP of the father and the mother as a point with genotype information not being '/'; and (c) points with embryo locus sequencing depth DP >5 and genotype information not ″./. The "/." is a site that cannot be genotyped.
6) The wrong genotype site, such as ADO, is identified using mendelian genetic rules and labeled as missing data. The same as in example 1.
7) And (3) further identifying the locus with genotype error in the blastocyst culture solution by utilizing the theory of embryo haplotype mutual deduction and chromosome interference, and marking the locus with the genotype error as deletion data. The same as in example 1.
Finally, the number of loci that passed quality control is shown in Table 5, wherein the obtained parents all have genotype information and the loci that can be used for linkage analysis are 1608593 loci.
TABLE 5 Total gene variation detected by Whole genome sequencing
Figure PCTCN2020121432-APPB-000017
8) Haplotype phasing and genotype filling based on pedigrees:
as in example 1 above.
The information of the polymorphic sites of the finally obtained embryos is shown in Table 6.
TABLE 6
Figure PCTCN2020121432-APPB-000018
Comparing the results of this experiment with paired biopsy samples (the specific experimental procedure is as in example 1), it was found that the genome reconstruction accuracy of other embryos was substantially higher than 97%, except that one sample (sample from blastocyst broth 6) was slightly less than the allele prediction accuracy due to the low amplified DNA concentration of step 2) (166 ng/. mu.l for this sample and above 200 ng/. mu.l for all other samples) (Table 7). As compared to the chip data of example 1, the genotype fill-in accuracy is relatively low because whole genome sequencing will contain more sites with low allele frequency.
Table 7 second generation sequencing platform embryo genome reconstruction accuracy
Figure PCTCN2020121432-APPB-000019
Reference to the literature
1.Abecasis GR,Cherny SS,Cookson WO,&Cardon LR(2002)Merlin—rapid analysis of dense genetic maps using sparse gene flow trees.Nat Genet 30(1):97-101.
Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR, Stringham HM, Chines PS, Jackson AU et al (2007) A genome-wide association study of Type 2 Diabetes in Finns detectors multiple adaptability variants. science 316(5829):1341-1345.
All documents referred to herein are incorporated by reference into this application as if each were individually incorporated by reference. Furthermore, it should be understood that various changes and modifications of the present invention can be made by those skilled in the art after reading the above teachings of the present invention, and these equivalents also fall within the scope of the present invention as defined by the appended claims.

Claims (16)

  1. A method of cleaning noisy genetic data from progeny, said method comprising the steps of:
    (a) providing genomic sequence information from an offspring, wherein the genomic sequence information of the offspring is obtained from an offspring micro nucleic acid sample comprising about 0.1-40 ng DNA, e.g., 1-40ng DNA, 20-40ng DNA, 0.1-40pg DNA, 1-40pg DNA, 10-40pg DNA; for example, the progeny micro nucleic acid sample is fetal cell-free DNA in embryo culture fluid, blastocyst culture fluid, blastocoel fluid, maternal plasma, or other type of maternal body fluid, and/or fetal cells in blastocyst trophoblast cells, blastocyst stage embryonic cells, maternal blood, or other type of maternal body fluid;
    (b) performing quality control and filtering on the genome sequence information of the progeny of step (a), wherein the quality control comprises quality control selected from the group consisting of performing whole genome amplification efficiency of trace nucleic acids, identifying Mendelian genetic errors, identifying chromosomal interference suppression, deriving from each other a plurality of progeny haplotypes, and combinations thereof.
  2. The method of eliminating noisy genetic data from progeny according to claim 1, wherein the genomic sequence information from progeny provided in step (a) does not exceed about 30% of its genome, e.g., about 30%, 25%, 20%, 15% of its genome, e.g., wherein the genomic sequence information from progeny of step (a) is obtained by subjecting the progeny trace nucleic acid sample to whole genome amplification selected from the group consisting of: pre-amplification primer extension PCR, degenerate oligonucleotide primer PCR, multiplex displacement amplification techniques, multiple annealing cycle (MALBAC) amplification techniques, blunt-end or sticky-end ligation pooling, or the like, or combinations thereof, preferably MALBAC amplification techniques, followed by detection of genomic sequences of progeny by techniques selected from nucleic acid chips, amplification and/or sequencing.
  3. The method of eliminating noisy genetic data from progeny according to claim 2, wherein said nucleic acid chip, amplification and/or sequencing technique is a single nucleotide polymorphism site microarray nucleic acid chip, a MassARRAY flight mass spectrometry chip, an MLPA multiplex ligation amplification technique, second generation sequencing, third generation sequencing, or a combination thereof; for example, the single nucleotide polymorphism site microarray nucleic acid chip is an SNP genotyping chip; for example, the second generation sequencing comprises whole genome sequencing, whole exome sequencing and sequencing of targeted genomic regions, preferably whole genome sequencing, e.g., low depth whole genome sequencing, e.g., sequencing depths as low as 2x or even below 1 x.
  4. The method for eliminating noise genetic data from progeny according to any one of claims 1 to 3, wherein the quality control of the whole genome amplification efficiency of the trace nucleic acid of step (b) is performed by: identifying a site genotype with low amplification efficiency using reference sequencing data of whole genome amplification products of a plurality of trace nucleic acid samples and designating the site genotype as deletion data, e.g., performing low depth sequencing of the whole genome amplification products of the plurality of trace nucleic acid samples as reference samples, e.g., sequencing depth of no more than about 0.5X, no more than about 0.4X, no more than about 0.3X, no more than about 0.2X, no more than about 0.1X, e.g., sequencing depth of about 0.06X, aligning sequencing data obtained from the reference samples onto a human reference genome (e.g., hg19 or hg38), calculating site amplification efficiency using the following formula
    Figure PCTCN2020121432-APPB-100001
    Wherein the content of the first and second substances,
    Figure PCTCN2020121432-APPB-100002
    wherein DPiIndicates the absolute depth of the ith locus, N indicates the number of sequencing reads, L indicates the read length,
    when DP is presentiWhen the average depth of the genome is not less than or equal to 1, the amplification efficiency of the locus is not less than 1, which indicates that the locus passes the quality control of the amplification efficiency of the trace nucleic acid whole genome; loci that do not meet this quality control are genotyped as missing data.
  5. A method for eliminating noisy genetic data from progeny according to any one of claims 1-4, wherein the chromosome interference suppression theory of step (b) is that when two molecular marker sites are crossed or recombined within a genetic distance, the molecular marker in this recombination segment is judged to be genotyped incorrectly, and the molecular marker site is marked as missing data, for example, wherein the genetic distance is any distance below 1 centiMorgan (cM).
  6. A method of haplotyping of progeny, the method comprising steps (a) and (b) of any of claims 1-3, and the steps of:
    (c) phasing the haplotypes of progeny, such as the haplotype of progeny at the chromosome level, based on pedigree information of the genetic father of the progeny (e.g., the genotype information of the genetic father) and/or the genomic sequence information of the genetic mother of the progeny (e.g., the genotype information of the genetic mother), optionally the pedigree information also includes the genomic sequence information of other pedigree individuals of the progeny (e.g., the genotype information), and mendelian rules and genetic linkage and a multi-site linkage analysis strategy of crossover theory.
  7. The method for haplotype phasing of progeny according to claim 6, wherein said pedigree information is obtained from a nucleic acid sample of said pedigree individual comprising at least about 100ng DNA (e.g., 100ng-1000ng DNA); for example, the pedigree individual nucleic acid sample is a nucleic acid sample from blood, saliva, buccal swab, urine, nail, hair follicle, dandruff, cell, tissue, body fluid of the pedigree individual, and the pedigree information is not less than about 90% coverage of the pedigree individual's whole genome, e.g., about 90%, 95%, 98%, 99% or more coverage of its whole genome, e.g., wherein the pedigree information is data obtained by whole genome sequencing of genomic DNA of the pedigree individual (e.g., whole blood gDNA, oral epithelial cell gDNA, urothelial cell gDNA, nail bed gDNA, hair follicle gDNA, and dandruff gDNA, preferably whole blood gDNA), preferably, employing a high depth whole genome sequencing strategy for the gDNA, e.g., sequencing depth of at least 20X, at least 30X, at least 40X, at least 50X, at least 60X, a high-depth whole genome sequencing strategy, At least 70X, at least 80X.
  8. Method of haplotype phasing of progeny according to claim 6 or 7, wherein step (c) is performed using statistical genetics and computational biology algorithms, e.g. obtaining the highest possible haplotype composition of progeny using a strategy selected from the group consisting of likelihood method strategy (haplotype composition for maximum probability), genetic rule strategy (haplotype composition for minimum number of recombinations), Expectation Maximization (EM) algorithm and combinations thereof based on the pedigree information,
    for example, based on the genomic sequence information of the descendant's genetic father and/or mother, the largest possible haplotype composition of the descendant inherited from the father and/or mother is obtained using a strategy selected from the group consisting of likelihood method (haplotype composition for maximum probability), genetic rule strategy (haplotype composition for minimum number of recombinations), Expectation Maximization (EM) algorithm, and combinations thereof, and preferably, the algorithm is further implemented using the genomic sequence information of other pedigrees to obtain more haplotype compositions of the descendant,
    preferably, the plausibility strategies are the Lander-Green algorithm and the Viterbi dynamic programming algorithm, the Elston-Stewart algorithm and the Bayesian network algorithm, more preferably the Lander-Green algorithm and the Viterbi dynamic programming algorithm,
    preferably, the genetic rule strategies comprise a zero recombination hypothesis strategy and a minimum recombination hypothesis strategy, the software carrier implementing the genetic rule strategies being, for example, ZAPLO or HAPLORE,
  9. method for haplotype phasing of progeny according to any of the claims 6-8, wherein step (c) is performed as follows:
    i) constructing binary gene flow vector V of each site based on pedigree structurei,V i=(P 1,i,M 1,i,P 2,i,M 2,i,...,P n,i,M n,i) Where i represents a site; n represents the number of embryos; pn,iRepresenting the haplotype of the nth embryo at position i inherited from the ancestor of the father line, Pn,i0 indicates that the embryo inherits the haplotype of grandfather at the site, Pn,i1 indicates that the embryo inherits the haplotype of the grandmother at the site; mn,iRepresenting the haplotype of the nth embryo inherited from the maternal ancestor at the ith position, Mn,i0 indicates that the embryo inherits the haplotype of the grandfather at this site, Mn,i1 indicates that the embryo inherits the haplotype of the grandmother at the site;
    ii) calculating the maximum joint likelihood probability of the haplotype hidden state and genotype observation value of each site based on a hidden Markov model, wherein the formula is as follows:
    Figure PCTCN2020121432-APPB-100003
    wherein m represents the number of bits; p (V)1) Is the initial probability of the first locus paternal or maternal ancestral haplotype; p (V)i|V i-1) Is the transition probability of the haplotype state from the i-1 th site to the i-th site adjacent to the i-th site, and is obtained by calculating the recombination rate between the two sites;
    Estimating the recombination rate between sites by genetic mapping of the third stage of genome of thousand human (1000genome Project Phase 3); p (G)i|V i) Is the haplotype status of the ancestor at a given site (V)i) Later genotype (G)i) The probability is obtained by calculating the Mendelian genetic rule;
    iii) calculating hidden Markov state V ═ V using Viterbi algorithm1,V 2,...,V m) The most likely composition of the m-locus ancestral haplotype state, the most likely chromosome-level haplotype composition for each progeny was obtained.
  10. A method of reconstituting a progeny genome, said method comprising step (a), step (b) and step (c), and step (d) of any one of claims 6-9
    (d) The filling of the offspring deletion genotype is carried out,
    for example, combining the identification of the same region of blood source, namely determining the haplotype composition condition of a certain region embryo from parents, and combining the genotype information of the high-density polymorphic loci of the parents to fill in the deleted genotype locus information in offspring;
    and optionally, for genotype information that was not successfully populated based on family information, filling genome-wide level genotype information with population reference haplotype information and population-level allelic Linkage Disequilibrium (LD) laws;
    preferably, the population Reference Haplotype information is HapMap, 1000 genes, HRC (Haplotype Reference Consortium);
    preferably, the genotype filling algorithm for population-level allelic Linkage Disequilibrium (LD) regularities is, for example, the IMPUTE (2), machh, Beagle, Minimac algorithm;
    preferably, genotyping is performed using Maximization (E-M), Hidden Markov Model (HMM), Markov Chain Monte Carlo (MCMC), Coalescent theory, or a combination thereof.
  11. An apparatus, system and device comprising a quality control filtering module capable of performing the cleaning of noisy genetic data from progeny as claimed in any one of claims 1 to 5.
  12. Apparatus, systems and devices according to claim 11, further comprising a haplotype phasing module capable of performing haplotype phasing according to any of claims 6-9.
  13. Apparatus, systems and devices according to claim 11 or 12, further comprising a genotype-filling module capable of performing genotype filling as described in claim 10.
  14. Apparatus, systems and devices according to any of claims 11-13, comprising
    An amplification unit capable of performing genome-wide amplification of the DNA sample, e.g., capable of performing genome-wide amplification of a DNA sample of an offspring and/or for performing genome-wide amplification of a DNA sample of a genetic parent of an offspring;
    an original genetic data acquisition unit capable of performing reading of sequence genetic information of a genome of the obtained whole genome amplification product, for example, reading of sequence information after nucleic acid chip or second-generation sequencing;
    the quality control filtering unit can perform quality control and filtering on the original genetic data, and remove data with unsatisfactory quality, for example, marking the locus genotype with low amplification efficiency as missing data; and a genotype site capable of identifying an error in the original genetic data and marking it as missing data; and
    a haplotype phasing unit capable of performing phasing of the haplotype.
  15. Apparatus, systems and devices according to claim 14, characterized by comprising a genotype-filling unit capable of performing filling of genotypes.
  16. Use of the method according to any one of claims 1-10 or use of the device or system according to any one of claims 11-15 for polygenic disease genetic risk rating, aneuploidy detection, monogenic genetic disease detection, chromosomal structure rearrangement detection, and combinations thereof selected from a pre-implantation embryo and/or an early gestation fetus.
CN202080005425.5A 2019-10-18 2020-10-16 Methods, systems, and uses for eliminating noisy genetic data, haplotype phasing, and reconstructing progeny genomes Pending CN112840404A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201910995291 2019-10-18
CN2019109952915 2019-10-18
PCT/CN2020/121432 WO2021073604A1 (en) 2019-10-18 2020-10-16 Method and system for clearing noisy genetic data, phasing haplotype, and reconstructing offspring genome, and use thereof

Publications (1)

Publication Number Publication Date
CN112840404A true CN112840404A (en) 2021-05-25

Family

ID=75537714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080005425.5A Pending CN112840404A (en) 2019-10-18 2020-10-16 Methods, systems, and uses for eliminating noisy genetic data, haplotype phasing, and reconstructing progeny genomes

Country Status (2)

Country Link
CN (1) CN112840404A (en)
WO (1) WO2021073604A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023225951A1 (en) * 2022-05-26 2023-11-30 深圳华大生命科学研究院 Method for detecting fetal genotype on basis of haplotype
CN117230175A (en) * 2023-06-21 2023-12-15 广州序源医学科技有限公司 Embryo preimplantation genetics detection method based on third generation sequencing

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114606302A (en) * 2022-04-08 2022-06-10 复旦大学附属中山医院 Method for extracting oral mucosa nucleic acid to perform whole genome high-throughput sequencing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101790731A (en) * 2007-03-16 2010-07-28 吉恩安全网络公司 Be used to remove the system and method that genetic data disturbed and determined the chromosome copies number
CN105335625A (en) * 2015-11-04 2016-02-17 和卓生物科技(上海)有限公司 Genetics detecting device of embryo before implantation
CN107723364A (en) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 A kind of screening method of susceptibility gene of colorectal cancer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101790731A (en) * 2007-03-16 2010-07-28 吉恩安全网络公司 Be used to remove the system and method that genetic data disturbed and determined the chromosome copies number
CN105335625A (en) * 2015-11-04 2016-02-17 和卓生物科技(上海)有限公司 Genetics detecting device of embryo before implantation
CN107723364A (en) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 A kind of screening method of susceptibility gene of colorectal cancer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CARL A ANDERSON ET AL: "Data quality control in genetic case-control association studies", 《NATURE PROTOCALS》, vol. 5, no. 9, pages 1564 - 1573, XP055801113, DOI: 10.1038/nprot.2010.116 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023225951A1 (en) * 2022-05-26 2023-11-30 深圳华大生命科学研究院 Method for detecting fetal genotype on basis of haplotype
CN117230175A (en) * 2023-06-21 2023-12-15 广州序源医学科技有限公司 Embryo preimplantation genetics detection method based on third generation sequencing

Also Published As

Publication number Publication date
WO2021073604A1 (en) 2021-04-22

Similar Documents

Publication Publication Date Title
US11525162B2 (en) Methods for simultaneous amplification of target loci
US11312996B2 (en) Methods for simultaneous amplification of target loci
US10597708B2 (en) Methods for simultaneous amplifications of target loci
US20200362415A1 (en) System and method for cleaning noisy genetic data from target individuals using genetic data from genetically related individuals
US11332793B2 (en) Methods for simultaneous amplification of target loci
US11939634B2 (en) Methods for simultaneous amplification of target loci
CA3116156C (en) Methods for allele calling and ploidy calling
JP6153874B2 (en) Method for non-invasive prenatal ploidy calls
US8532930B2 (en) Method for determining the number of copies of a chromosome in the genome of a target individual using genetic data from genetically related individuals
US20140206552A1 (en) Methods for preimplantation genetic diagnosis by sequencing
US20140141981A1 (en) Highly multiplex pcr methods and compositions
US20130196862A1 (en) Informatics Enhanced Analysis of Fetal Samples Subject to Maternal Contamination
WO2013052557A2 (en) Methods for preimplantation genetic diagnosis by sequencing
AU2012385961A1 (en) Highly multiplex PCR methods and compositions
WO2013130848A1 (en) Informatics enhanced analysis of fetal samples subject to maternal contamination
WO2021073604A1 (en) Method and system for clearing noisy genetic data, phasing haplotype, and reconstructing offspring genome, and use thereof
US20160371432A1 (en) Methods for allele calling and ploidy calling
EP2847347B1 (en) Highly multiplex pcr methods and compositions
US20240158855A1 (en) Methods for simultaneous amplification of target loci

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination