WO2021073604A1 - Procédé et système de nettoyage de données génétiques bruitées, de phasage d'haplotype et de reconstruction du génome de la descendance, et leur utilisation - Google Patents

Procédé et système de nettoyage de données génétiques bruitées, de phasage d'haplotype et de reconstruction du génome de la descendance, et leur utilisation Download PDF

Info

Publication number
WO2021073604A1
WO2021073604A1 PCT/CN2020/121432 CN2020121432W WO2021073604A1 WO 2021073604 A1 WO2021073604 A1 WO 2021073604A1 CN 2020121432 W CN2020121432 W CN 2020121432W WO 2021073604 A1 WO2021073604 A1 WO 2021073604A1
Authority
WO
WIPO (PCT)
Prior art keywords
genome
haplotype
offspring
genotype
information
Prior art date
Application number
PCT/CN2020/121432
Other languages
English (en)
Chinese (zh)
Inventor
邹央云
陆思嘉
胡春旭
Original Assignee
苏州亿康医学检验有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州亿康医学检验有限公司 filed Critical 苏州亿康医学检验有限公司
Priority to CN202080005425.5A priority Critical patent/CN112840404A/zh
Publication of WO2021073604A1 publication Critical patent/WO2021073604A1/fr

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention generally relates to the field of biomedical diagnosis and detection. More specifically, the present invention relates to obtaining, manipulating and using genetic data to phase the haplotypes of progeny, methods for reconstructing progeny genomes, systems and devices for implementing the methods, and in particular, to the use of trace DNA from progeny
  • the method, system and computer device for phasing the haplotype of the progeny and reconstructing the progeny genome of the nucleic acid sample, and the identification of the haplotype of the progeny and the reconstructed progeny genome involved in the phasing may lead to a variety of expressions. Genetic variation of type results, especially the application of aneuploidy and disease-related genes.
  • Assisted reproductive technology has made great progress in overcoming human infertility and infertility. At present, about 3-4% of the total birth population in the world are born through assisted reproduction operations. Although ART has made some surprising theoretical and technological advancements, the actual implementation of the concept of "healthy babies" still faces unique challenges.
  • PGT Preimplantation Genetic Test
  • PGT is a test that performs preimplantation genetic analysis on embryos of patients with high genetic risk during the process from in vitro fertilization to embryo transfer, and aims to select normal genetic material The embryo is implanted into the mother’s uterine cavity to obtain healthy offspring.
  • PGT can be divided into aneuploidy test (PGT for Aneuploidies, PGT-A), single-gene genetic disease test (PGT for Monogenic gene defects, PGT-M) and chromosomal structure rearrangement test (PGT for chromosomal Structural Rearrangements, PGT-SR).
  • PGT Preimplantation Genetic Test
  • embryo culture fluid contains embryo-derived free DNA (cfDNA) fragments, making it possible to perform non-invasive preimplantation genetic testing.
  • cfDNA embryo-derived free DNA
  • polygenic diseases or chronic diseases such as cardiovascular diseases, diabetes, obesity, tumors, etc.
  • cardiovascular diseases such as diabetes, obesity, tumors, etc.
  • the polygenic diseases or chronic diseases are the result of multiple genes participating in the disease process.
  • cardiovascular diseases such as cardiovascular diseases, diabetes, obesity, tumors, etc.
  • many chronic diseases have a high heritability rate. Therefore, the genetic basis plays a more important role in determining the risk of an individual.
  • the technical difficulty is that the construction of the risk value of polygenic diseases requires embryo and/or fetal genome-wide genotype information; and in a non-invasive or minimally traumatic way
  • the obtained embryo and/or fetal DNA is small, especially the embryonic cell-free DNA present in the embryo culture medium has the characteristics of small fragments, poor DNA whole genome amplification uniformity, and high genotype error rate, which makes it impossible to produce high Quality, highly continuous embryo and/or fetal whole genome sequence.
  • the trace nucleic acid samples of the progeny are embryo culture fluid, blastocyst culture fluid, blastocoel fluid, Cell-free DNA (cfDNA) of the fetus in maternal plasma or other types of maternal body fluids, and/or blastocyst trophoblast cells, cleavage embryonic cells, maternal blood or other types of maternal body fluids), amplified in the whole genome
  • the offspring haplotypes are obtained by using sequence information acquisition technologies such as nucleic acid chips or next-generation sequencing technology, as well as statistical genetics and computational biology algorithms. Phasing and genomic reconstruction of offspring can obtain very high accuracy of haplotype phasing and genome reconstruction.
  • the haplotype phasing and genome reconstruction are obtained by using sequence information acquisition technologies such as nucleic acid chips or next-generation sequencing technology, as well as statistical genetics and computational biology algorithms. Phasing and genomic reconstruction of offspring can obtain very high accuracy of haplotype phasing and genome reconstruction.
  • the present invention performs amplification, data analysis, quality control and filtering on the obtained trace progeny nucleic acid, thereby eliminating noise genetic data (for example, the quality of genotyping is poor such as allele dropout (ADO) ) Locus genotype information), and then based on the haplotype phasing of the pedigree, obtain the haplotype phasing of the offspring; finally use the identity By Descent of the pedigree (Identity By Descent, IBD) and the linkage disequilibrium strategy of the population , Perform genotype filling in the missing genotypes in the offspring (for example, sites that have not been amplified and genotype errors such as ADO), thereby rebuilding the genome with high fidelity with a very high accuracy rate of genome reconstruction The genome of the offspring.
  • genetic data obtained from other related individuals such as other embryos, siblings, grandparents, or other relatives related to the offspring can also be used to further increase the accuracy of the reconstructed offspring genome.
  • the quality of genotyping is poor
  • the present invention relates to a method for removing noisy genetic data from offspring, the method comprising the steps of:
  • genomic sequence information from the offspring where the genomic sequence information of the offspring is obtained from about 0.1pg-40ng DNA, for example, 1-40ng DNA, 20-40ng DNA, 0.1-40pg DNA, 1- 40pg DNA, 10-40pg DNA trace nucleic acid sample of the offspring;
  • the trace nucleic acid sample of the offspring is embryo culture fluid, blastocyst culture fluid, blastocyst cavity fluid, maternal plasma or other types of maternal body fluids of the fetus without cells DNA, and/or fetal cells in blastocyst trophoblast cells, cleavage embryo cells, maternal blood or other types of maternal body fluids;
  • step (b) Quality control and filtering of the genomic sequence information of the progeny of step (a), wherein the quality control includes a quality control selected from the group consisting of implementing a whole genome amplification efficiency of trace nucleic acid, a quality control identifying Mendelian genetic errors, Identify quality controls that violate the theory of chromosomal interference suppression, quality controls that are deduced from multiple progeny haplotypes, and their combinations.
  • the quality control includes a quality control selected from the group consisting of implementing a whole genome amplification efficiency of trace nucleic acid, a quality control identifying Mendelian genetic errors, Identify quality controls that violate the theory of chromosomal interference suppression, quality controls that are deduced from multiple progeny haplotypes, and their combinations.
  • step (a) of the method for removing noisy genetic data from progeny of the present invention provides genome sequence information from progeny that does not exceed about 30% of its genome, for example, the coverage of its genome The coverage is about 30%, 25%, 20%, 15%.
  • the genome sequence information from the progeny in step (a) is performed by performing whole-genome expansion selected from the following group on the progeny trace nucleic acid sample.
  • the described nucleic acid chip, amplification and/or sequencing technology is a single nucleotide polymorphism site microarray nucleic acid chip, MassARRAY flying mass spectrometry chip, MLPA multiple connection amplification technology, second-generation sequencing, third-generation sequencing, or their Combinations;
  • the single nucleotide polymorphism site microarray nucleic acid chip is a SNP genotyping chip;
  • the second-generation sequencing includes whole-genome sequencing, whole-exome sequencing, and sequencing of targeted genomic regions ,
  • whole-genome sequencing for example, low-depth whole-genome sequencing, for example, the sequencing depth can be as low as 2x or even 1x or less.
  • the quality control of the whole genome amplification efficiency of trace nucleic acid described in step (b) of the method for removing noisy genetic data from progeny of the present invention is implemented as follows: whole genome amplification using multiple trace nucleic acid samples
  • the reference sequencing data of the product is used to identify the genotype of the site with low amplification efficiency, and the genotype of the site is marked as missing data, for example, the whole genome amplification product of multiple trace nucleic acid samples is used as a reference sample for low-depth sequencing
  • the sequencing depth is not higher than about 0.5X, not higher than about 0.4X, not higher than about 0.3X, not higher than about 0.2X, not higher than about 0.1X, for example, the sequencing depth is about 0.06X.
  • the sequencing data obtained from the reference sample is compared to a human reference genome (for example, hg19 or hg38), and the following formula is used to calculate the site amplification efficiency
  • DP i represents the absolute depth of the i-th site
  • N represents the number of sequencing reads
  • L represents the read length
  • step (b) When DP i ⁇ the average depth of the genome, and the site amplification efficiency ⁇ 1, it means that the site has passed the quality control of the whole genome amplification efficiency of trace nucleic acid; the genotype of the site that does not meet this quality control is marked as missing data .
  • the chromosomal interference suppression theory described in step (b) is that when two molecular marker sites within a genetic distance are exchanged or recombined twice, it is determined that the molecular marker in this recombination section has a genotyping error.
  • the molecular marker site is marked as missing data, for example, where the genetic distance is any distance below 1 centimer (cM).
  • the present invention relates to a method for phasing the haplotypes of progeny, the method comprising the above-mentioned steps (a) and (b), and the following steps:
  • the quality control and filtered progeny genome sequence information (for example, the genotype information of the progeny) is based on the pedigree information, Mendelian inheritance laws and the multi-locus linkage analysis strategy of gene linkage and exchange theory to phase the progeny
  • the haplotype of the progeny such as the haplotype of the progeny at the chromosome level, wherein the pedigree information is the genome sequence information of the genetic father of the progeny (for example, the genotype information of the genetic father) and/or the Genome sequence information of the genetic mother of the offspring (for example, the genotype information of the genetic mother), optionally, the pedigree information also includes the genome sequence information of other pedigree individuals of the offspring (for example, genotype information) ).
  • the genealogical information described in step (c) of the method for phasing the haplotype of the offspring is obtained from the nucleic acid of the family individual comprising at least about 100ng DNA (for example, 100ng-1000ng DNA)
  • a sample for example, the family individual nucleic acid sample is a nucleic acid sample from blood, saliva, buccal swabs, urine, nails, hair follicles, dander, cells, tissues, body fluids from the family individual, and the genealogical information is The coverage of the pedigree individual is not less than about 90%, for example, the coverage is about 90%, 95%, 98%, 99% or more, for example, where the genealogical information is obtained by analyzing the pedigree individual Genomic DNA (such as whole blood gDNA, oral epithelial cell gDNA, urothelial cell gDNA, nail bed gDNA, hair follicle gDNA and dander gDNA, preferably whole blood gDNA) is the data obtained by whole blood g
  • step (c) of the method for phasing the haplotypes of the offspring is implemented using statistical genetics and computational biology algorithms, for example, using a strategy selected from the likelihood method based on the pedigree information (Find the haplotype composition with the greatest probability), genetic rule strategy (see the haplotype composition with the smallest recombination number), the Expectation Maximisation (EM) algorithm and their combination to obtain the largest possible haplotype composition of the offspring .
  • EM Expectation Maximisation
  • the present invention relates to a method for reconstructing progeny genomes, the method comprising the above-mentioned steps (a), (b) and (c), and steps
  • step (d) of the method for reconstructing the offspring genome is to identify the same region of the blood source, that is, to determine the haplotype composition of the embryo in a certain region from the parent, and at the same time combine the parent's height. Density polymorphic locus genotype information, fill in the genotype locus information missing in the offspring.
  • step (d) of the method for reconstructing offspring genomes also involves using population reference haplotype information and population-level allele linkage failure for genotype information that has not been successfully populated based on family information.
  • the law of balance (LD) fills in genotype information at the whole genome level;
  • the present invention relates to a device or system capable of performing the above-mentioned removal of noisy genetic data from offspring; a device or system capable of performing the above-mentioned haplotype phasing; and a device or system capable of performing the above-mentioned A device or system for genotyping.
  • the present invention relates to a device or system characterized in that,
  • It can perform whole-genome amplification of DNA samples, for example, it can perform whole-genome amplification of DNA samples of offspring and/or whole-genome amplification of DNA samples of offspring’s genetic parents (in some implementations) In the plan, when the parent's DNA sample amount is sufficient, no amplification is required);
  • It can perform the detection of the sequence genetic information of the genome of the obtained whole-genome amplification product or gDNA sample, for example, read the sequence information after the nucleic acid chip or the second-generation sequencing;
  • the present invention relates to the use of the methods of the first to third aspects above or the use of the device or system of the fourth aspect to perform polygenic pre-implantation embryos and/or fetuses in early pregnancy.
  • Figure 1 shows a flow chart of a technical solution of the present invention.
  • Figure 2 shows the effect of the whole genome amplification efficiency of progeny trace nucleic acid on the quality of progeny genotypes.
  • the term “comprising” or “including” means including the stated elements, integers or steps, but does not exclude any other elements, integers or steps.
  • the term “comprises” or “includes” when used, unless otherwise specified, it also covers the case consisting of the stated elements, integers or steps.
  • an antibody variable region that "comprises” a specific sequence when referring to an antibody variable region that "comprises” a specific sequence, it is also intended to encompass the antibody variable region composed of that specific sequence.
  • offspring includes, but is not limited to, the offspring of a mammal, such as a human, and means a born or unborn offspring.
  • Unborn offspring include embryos or fetuses.
  • Embryo usually refers to the product of the division of the fertilized egg before the end of the embryonic period from the eighth week after fertilization. The cleavage stage of the embryo exists in the first three days of culture.
  • "Embryo transfer” is the operation of putting one or more embryos and/or blastocysts into the uterus or fallopian tube. Fetus usually refers to the unborn offspring of mammals after eight weeks of pregnancy, especially unborn human babies.
  • blastocyst is an embryo 5 or 6 days after fertilization, which has an inner cell mass, an outer cell layer called trophectoderm, and a fluid-filled blastocyst cavity that contains the inner cell mass from which the entire embryo is derived.
  • the trophectoderm is the precursor of the placenta.
  • related individuals or "family individuals” of the offspring are used interchangeably, and refer to any individual that is genetically related to the target offspring individual, for example, is genetically related to the target offspring individual and therefore shares an individual with the target offspring.
  • the relevant individual may be the genetic parent of the target individual or any genetic material derived from the parent, such as sperm, polar body, other embryos or fetuses. It can also refer to siblings, parents or grandparents, and grandparents. In this application, parent refers to the genetic father or mother of an individual.
  • Offspring individuals usually have two parents (maternal and male). The sibling refers to any individual whose genetic parents are the same as the offspring individual in question.
  • siblings can refer to a born child, embryo or fetus, or one or more cells derived from an embryo or fetus, or a child that has been born; siblings can also refer to haploid individuals derived from one parent, For example, sperm, polar bodies, or any other haplotype genetic material.
  • the DNA derived from the progeny refers to the DNA of the original part of the progeny cell, the body fluid of the progeny or the original DNA of the culture fluid of the progeny cell whose genotype is basically equivalent to the genotype of the progeny.
  • Parent-derived DNA refers to the DNA of the original part of the parent cell whose genotype is basically equivalent to the parental genotype, the parent body fluid, or the original DNA of the parent cell culture fluid.
  • maternal DNA refers to the DNA of the original part of the maternal cell whose genotype is basically equivalent to the maternal genotype, the maternal body fluid, or the original DNA of the maternal cell culture fluid.
  • SNP Single Nucleotide Polymorphism
  • the frequency of SNPs in a population is generally >1%. There is an average of 300-1000 bp in the whole human genome with one SNP.
  • SNP databases are currently available from a number of public databases, including, for example, http://cgap.ncbi.nih.gov/GAI; http://www.ncbi.nlm.nih.gov/SNP; human SNP database http:/ /hgbas.cgr.ki.sei or http://hgbase.interactiva.de/.
  • genotype refers to the type of alleles possessed by an individual at a locus, which is called the genotype of the individual at that locus.
  • genotype For humans, except for sex chromosomes, the type of a pair of alleles that each pair of homologous chromosomes has at the same locus is called the genotype of that locus. Genotyping refers to the process of determining the genotype of an individual.
  • noise genetic data refers to genetic data with any of the following: Allele Dropout (ADO), uncertain base pair measurement, wrong base pair measurement, missing base pair measurement , Indeterminate measurement of insertion or deletion, indeterminate measurement of chromosome segment copy number, false signal, other measurement errors, or a combination thereof.
  • ADO Allele Dropout
  • Sequence Depth refers to the ratio of the total number of bases obtained by sequencing to the size of the genome to be tested. Assuming a genome size of 2M and a sequencing depth of 10X, the total amount of data obtained is 20M. The sequencing depth can be expressed by the ratio of the total number of bases (bp) obtained by sequencing to the size of the genome (Genome).
  • absolute sequencing depth of a site and “absolute depth of a site” are used interchangeably and refer to the number of reads of the site.
  • average sequencing depth of the genome and “average depth of the genome” are used interchangeably, and refer to adding the absolute depth of each site on the whole genome and dividing by the number of sites to obtain the average depth of the genome.
  • the average sequencing depth of the genome can also be understood as the average number of times each base in the genome has been sequenced.
  • Read is also called “read length”. Each sequence in the sequencing data is a read.
  • the term “coverage” refers to the proportion of the sequence part of the genome or transcriptome or chromosome segment with known sequence information to the entire group or segment.
  • the coverage refers to the ratio of the number of bases of sequence information detected (for example, by means of sequence detection, such as sequencing) to the total number of bases in the detected region.
  • sequence detection such as sequencing
  • the coverage is the ratio of the number of sequenced bases finally obtained to the number of bases of the entire genome.
  • the coverage obtained by sequencing the human genome is 98.5%, which indicates that there are still 1.5% regions of the genome that have not been sequenced.
  • the coverage refers to the number of genetic sites (such as SNP sites or genetic variation sites) for which sequence information is detected (for example, by SNP chip or sequencing analysis) in terms of the detected area , The proportion of the total number of gene loci detected in the region.
  • the detected region can be the whole genome, a specific chromosome, or a specific chromosome segment, or a transcript set, or a specific transcription region.
  • Sequstq is one of the standard formats for sequence data storage. There is one piece of read information for every four rows, including sequencing read name, sequence, positive and negative chain identification, and sequence quality value.
  • Mendelian law of inheritance refers to the two basic laws of genetics, the law of separation and the law of free combination, collectively referred to as the law of Mendelian inheritance. According to Mendelian rules of inheritance, during meiosis, alleles will separate with the separation of homologous chromosomes, enter the two gametes separately, and be inherited independently from the gametes to offspring; in addition, at the same time as the alleles separate , The non-allelic genes on non-homologous chromosomes appear as free combinations.
  • linkage disequilibrium refers to the probability that alleles belonging to two or more gene loci appear on one chromosome at the same time, which is higher than the frequency of random occurrence. Linkage disequilibrium is also called allelic association (allelic association). Generally, the intensity of LD is related to the distance between two gene loci. Generally, the farther the two pairs of alleles are, the greater the chance of recombination, that is, the higher the recombination rate (exchange rate), the weaker the LD; conversely, the closer the distance, the lower the recombination rate, and the stronger the LD. Therefore, the recombination rate can be used to reflect the relative distance between two genes on the same chromosome. When the gene recombination rate is 1%, the distance between two genes is recorded as 1 centimorgan (cM).
  • chromosomal interference refers to the phenomenon in which homologous chromosomes and non-sister chromatids interact and inhibit each other in two adjacent single exchanges during meiosis. In this article, the inhibition of chromosomal interference is used to control the quality of genotyping data.
  • the genome formation process of the offspring is equivalent to a random recombination of the parental genome (that is, a random combination of chain interchange haplotype recombination and gametes).
  • ADO allelic dropout
  • haplotype refers to a combination of alleles at multiple sites that are usually inherited in common on the same chromosome. Depending on the number of recombination events that have occurred between a set of designated sites, haplotypes can refer to as few as two sites, or to the entire chromosome. Haplotype can also refer to a set of single nucleotide polymorphisms (SNPs) on a single chromatid that is statistically related.
  • SNPs single nucleotide polymorphisms
  • haplotypic data is also referred to as “haplotypic data”, “phased data” or “ordered genetic data”, which refers to data derived from double The determined genetic data of a single chromosome in a somatic or polyploid genome.
  • Unordered Genetic Data refers to data obtained by combining the sequencing data of two or more chromosomes in a diploid or polyploid genome.
  • haplotype phasing refers to the behavior of determining the haplotype genetic data of an individual for disordered genetic data from diploid or polyploid. It can refer to a set of alleles found on a chromosome, and determine which of the two genes at each allele is related to one of the two homologous chromosomes in an individual .
  • the haplotype phasing of multiple sites can find the haplotype-disease phenotype correlation, which is significantly stronger than the single site-disease phenotype correlation.
  • SNP chip is a chip that uses the signal (usually a fluorescent signal) obtained after hybridization of the chip to determine the genotype of a certain site.
  • signal usually a fluorescent signal
  • SNP chips will contain different SNP sites depending on chip manufacturers and models.
  • human chips produced by Affymetrix and Illumina contain different sets of SNPs.
  • IBD Identity By Descent
  • high-density genetic polymorphism information of parents and other families of embryos means that when the same genetic analysis method is used, the density of genetic polymorphisms of parents and embryos is different. The reason is that the parents’ samples are gDNA samples. If the concentration is high, most of the genotype locus information can be obtained smoothly; while embryos often use the whole gene amplification product of single cell or the whole gene amplification product of DNA in embryo culture fluid, and there is uneven whole genome amplification, Amplification errors such as ADO make the available genetic polymorphic loci information of embryos relatively sparse.
  • Genome-wide association study refers to identify sequence variations occurring within the whole range of human genome sequence variation screened out associated with the disease, in order to achieve cost-effective to find genetic The association between markers and disease.
  • module refers to a software object or routine (e.g., as an independent thread) that can be executed on a single computing system (e.g., computer program, tablet computer (PAD), one or more processors).
  • the program for implementing the method of the present invention may be stored on a computer-readable medium, which contains computer program logic or code parts, for implementing the system modules and methods.
  • system modules and methods described herein are preferably implemented by software, implementation by hardware or a combination of software and hardware is also possible, and can be conceived by those skilled in the art.
  • the present invention generally provides a method for reconstructing progeny genomes, the method comprising:
  • step (b) Perform quality control and filtering on the genomic sequence information of the offspring of step (a), and remove the sites with poor genotyping quality;
  • the offspring may be born or unborn offspring.
  • the related individual may be any individual who is genetically related to the target individual.
  • the method of the present invention involves genomic information processing and/or reconstruction based on original genetic data.
  • the original genetic data applicable to the method of the present invention includes genomic sequence information of the offspring and/or related individuals and related original genetic data, such as genotype information generated based on the sequence information. These original genetic data are disordered and unphased.
  • the original genetic data is in the form of a data set, for example, in the form of a computer-readable data set.
  • the user of the method of the present invention can directly provide a computer-readable medium recording the data, or a data package generated on a commercial platform, or preferably Specifically, the target nucleic acid sample is obtained by any sequence information detection technique known in the art.
  • the acquisition of original genetic data includes: Genome sequence information of the next generation, and the genotype analysis of the offspring based on this information.
  • the method further includes: obtaining genomic sequence information of related individuals and performing genotype analysis.
  • the step of obtaining genomic sequence information of offspring and/or related individuals includes:
  • the genomic sequence information includes, but is not limited to: sequence information of the whole genome, sequence information of the whole exome, and/or sequence information of a targeted chromosome region.
  • the sequence information may be, for example, but not limited to, a gene sequencing data set, a SNP data set, and a gene mutation site data set.
  • sequence information is not particularly limited. Any sequence detection technology suitable for nucleic acid known in the art is applicable to the present invention.
  • sequencing technology is used to detect sequence information, including but not limited to: whole genome sequencing, whole exome sequencing, and targeted sequencing.
  • whole-genome sequencing is used, and more preferably, high-throughput sequencing technology such as second-generation sequencing technology is used to detect sequence information of the whole genome from a nucleic acid sample.
  • the sequence information can be detected by a method selected from the group consisting of whole genome, whole exome, and gene polymorphic sites (for example, SNP or short tandem repeats) targeting genomic regions.
  • a CBC-PMRA Capital Biotechnology Precision Medicine Research Array
  • Boao Jingdian based on Affymetrix's PMRA (Precision Medicine Research Array) chip, which can detect 900,000 SNP sites.
  • an ASA Asian Screening Array
  • the offspring can be born or unborn offspring.
  • the offspring are unborn offspring, preferably fetuses or embryos, more preferably embryos produced by, for example, IVF.
  • the embryo may be an embryo about 3-10 days old, for example, a blastocyst about 5 days old.
  • the progeny nucleic acid sample is a sample containing a trace amount of genomic DNA nucleic acid of the progeny, for example, the sample contains about 0.1 pg-40 ng DNA, for example, 1-40 ng DNA, 20-40 ng DNA, 0.1-40 pg DNA , 1-40pg DNA, 10-40pg DNA trace nucleic acid samples of the offspring.
  • the progeny trace genomic DNA nucleic acid samples include, but are not limited to, embryo culture fluid (for example, IVF embryo culture fluid), blastocyst culture fluid (such as about 3-5 day-old blastocyst culture fluid), Fetal cell-free DNA in blastocoel fluid, maternal plasma or other types of maternal body fluids, and/or blastocyst trophoblast cells, cleavage embryonic cells, maternal blood or other types of maternal body fluids of fetal cells;
  • embryo culture fluid for example, IVF embryo culture fluid
  • blastocyst culture fluid such as about 3-5 day-old blastocyst culture fluid
  • Fetal cell-free DNA in blastocoel fluid maternal plasma or other types of maternal body fluids
  • blastocyst trophoblast cells blastocyst trophoblast cells
  • cleavage embryonic cells maternal blood or other types of maternal body fluids of fetal cells
  • CN106086199A discloses a method for detecting embryo chromosomes using blastocyst culture fluid, in particular it discloses the method of obtaining blastocyst culture fluid and the method of performing genome amplification on the obtained blastocyst culture fluid.
  • CN105368936A also discloses a method for detecting embryonic chromosomes for blastocyst culture fluid, especially discloses the collection of blastocyst culture fluid and whole genome amplification from trace DNA in culture fluid, including the design of primers used for amplification and Design of amplification reaction program.
  • CN109536581A discloses a method for obtaining a blastocyst culture solution used as a nucleic acid sample for genotyping analysis.
  • CN105543339A discloses embryos produced from in vitro fertilization (IVF) technology, obtaining outer trophoblast cells at the blastocyst stage, and performing embryo chromosome genome amplification.
  • IVF in vitro fertilization
  • the embryonic nucleic acid samples and their amplification methods disclosed in the above-mentioned documents are all suitable for obtaining the original genetic data of the offspring in the present invention, and they are hereby incorporated in their entirety into the present invention as a reference.
  • the culture fluid is aspirated from the embryo culture fertilized by the intracytoplasmic sperm injection technique (ICSI), preferably on the 3-10th day of culture, preferably the 5th day, the culture is aspirated Liquid, as a trace nucleic acid sample of the offspring, used to obtain the genomic sequence information of the offspring.
  • ICSI intracytoplasmic sperm injection technique
  • a single embryo culture system is used to culture the embryos in a culture medium of 0.1ul-1ml, and a small amount of culture medium (for example, about 0.1 ul-1ml, for example, about 0.1ul, 10ul, 20ul, 30ul, 40ul, 50ul, 100ul, 200ul, 500ul, 800ul, 1ml) for genetic information detection and genotype analysis of the offspring.
  • a culture medium of 0.1ul-1ml for example, about 0.1ul, 10ul, 20ul, 30ul, 40ul, 50ul, 100ul, 200ul, 500ul, 800ul, 1ml
  • the surface of the egg or the fertilized egg may be washed before the embryo culture is performed to remove the DNA impurities on the surface of the fertilized egg, thereby reducing the influence of the DNA impurities in the culture fluid.
  • the surface of the egg or the fertilized egg may be washed before the embryo culture is performed to remove the DNA impurities on the surface of the fertilized egg, thereby reducing the influence of the DNA impurities in the culture fluid.
  • this cleaning please refer to the descriptions in CN201610584345.5 and TW10612113, for example. These documents are hereby incorporated by reference.
  • the type of progeny nucleic acid sample used for sequence detection is not particularly limited, and it may be a sample containing a large amount of nucleic acid or a sample containing a small amount of nucleic acid.
  • the method of the present invention is particularly suitable for reconstructing embryonic genome information on embryo-derived DNA with a small amount and small fragments, such as cell-free DNA (cf DNA) in blastocyst culture fluid.
  • the present invention is particularly useful for prenatal diagnosis, for example, before the establishment of pregnancy (for example, before embryo implantation in IVF technology), in ligands and cells or culture medium taken from early embryos , Or in the later stages of pregnancy in cell samples taken from the placenta or fetus or fetal DNA taken from maternal body fluids, such as fetal cfDNA in maternal body fluids, for offspring haplotype phasing and/or genome reconstruction.
  • the offspring are unborn offspring, and a sample containing a small amount of offspring nucleic acid is used, for example, a single cell of an embryo produced by IVF or a culture medium of an embryo.
  • the offspring is a fetus
  • the free DNA content of the fetus contained in the offspring nucleic acid sample is, for example, 0.1pg-40ng, preferably 1-40ng, more preferably 20-40ng free DNA.
  • the offspring are embryos, and the free DNA content of the embryo contained in the offspring nucleic acid sample is, for example, 0.1-40 pg, preferably, 1-40 pg, and more preferably, 10-40 pg.
  • Those skilled in the art can use any known method to take nucleic acid samples from related individuals of the offspring, detect the genomic sequence information of the related individuals, and then obtain the genotype and haplotype information, thereby providing the family genotype information of the offspring .
  • the type of nucleic acid sample used for obtaining the raw data of related individuals is not particularly limited, and may be a sample containing a large amount of nucleic acid or a sample containing a small amount of nucleic acid.
  • nucleic acid samples that are conducive to obtaining high-density genotypic site information in the whole genome will be preferred.
  • the nucleic acid sample is any sample that contains genomic DNA nucleic acids of related individuals.
  • the sample may be a nucleic acid sample of the related individual that contains at least about 1 ng DNA (for example, 1 pg-1000 ng DNA); for example, the nucleic acid sample of the related individual is a tissue, cell, or tissue from the related individual.
  • Nucleic acid samples and body fluids for example, nucleic acid samples from blood, saliva, oral epithelium, urine, nails, hair follicles, and dander.
  • the nucleic acid sample may or may not be extracted and/or purified.
  • the nucleic acid contained in the nucleic acid sample is genomic DNA (gDNA) selected from the following various sources: whole blood gDNA, oral epithelial cell gDNA, urothelial cell gDNA, nail bed gDNA, hair follicle gDNA and Dandruff gDNA, preferably whole blood gDNA.
  • nucleic acid amplification can be performed.
  • Those skilled in the art can use any known nucleic acid amplification technology to perform whole-genome amplification of nucleic acids of progeny and/or related individuals.
  • the amplification method is selected from: primer extension PCR before amplification; degenerate oligonucleotide primer PCR (DOP-PCR); multiple displacement amplification technology (MDA); multiple annealing circular cycle amplification technology (MALBAC); blunt-end or sticky-end connection library method, or a combination thereof.
  • DOP-PCR degenerate oligonucleotide primer PCR
  • MDA multiple displacement amplification technology
  • MALBAC multiple annealing circular cycle amplification technology
  • blunt-end or sticky-end connection library method or a combination thereof.
  • the progeny nucleic acid content in the sample is small, it is preferable to use a whole-genome amplification method suitable for single cells, such as the MALBAC method for amplification, so as to reduce the erroneous gene sequence information caused by amplification.
  • a whole-genome amplification method suitable for single cells such as the MALBAC method for amplification
  • the sequence information of the genome is preferably detected by a technique selected from nucleic acid chips, amplification and/or sequencing.
  • the technology can be any such technology known in the art, including, but not limited to, mononucleotide polymorphism site microarray nucleic acid chip, MassARRAY flight mass spectrometry chip, MLPA multiplex connection amplification technology, second-generation sequencing, third-generation Sequencing, or a combination thereof.
  • SNP chips are used to obtain genomic sequence information of progeny and/or related individuals.
  • the SNP chip contains at least 700k sites, such as 800-1000K sites.
  • genome sequence information is obtained through whole genome sequencing.
  • a high-throughput sequencing platform is used to sequence the whole genome amplification products of the nucleic acid sample.
  • the sequencing platform is not particularly limited.
  • the second-generation sequencing platform including but not limited to Illumina's GA, GAII, GAIIx, HiSeq1000/2000/2500/3000/4000, X Ten, XFive, NextSeq500/550, MiSeq, Applied Biosystems' SOLiD , Roche’s 454FLX, ThermoFisherScientific (LifeTechnologies)’s IonTorrent, IonPGM, IonProton I/II; third-generation single-molecule sequencing platforms: including but not limited to HelicosBioSciences’ HeliScope system, PacificBioscience’s SMRT system, Oxford Nanopore Technologies’ GridION, MinION.
  • the sequencing type can be single-end (SingleEnd) sequencing or paired-end (PairedEnd) sequencing, and the sequencing length can be 30bp, 40bp, 50bp, 100bp, 300bp, etc., any length greater than 30bp.
  • whole-genome sequencing is performed using a sequencing depth, such as ⁇ 20x. More preferably, for related individuals, the sequencing depth can be higher, such as at least 20X, at least 30X, at least 40X, at least 50X. , At least 60X, at least 70X, at least 80X, or above.
  • the gDNA of related individuals adopts a high-depth whole-genome sequencing strategy in order to obtain high-accuracy and high-density polymorphic molecular markers.
  • a low-depth whole-genome sequencing method can be used, which is beneficial to cost control. Therefore, in some embodiments, low-depth sequencing methods can be used to obtain relatively low-density genotype information on samples containing trace progeny nucleic acids, such as embryo culture fluid, for example, the sequencing depth is as low as 2x, or even less than 1x. . Of course, the higher the sequencing depth, the higher the accuracy of the offspring genome reconstruction.
  • quality control and filtering of the sequence information data are performed to remove low-quality data.
  • Any tool known in the art that can perform such information quality control and clean up noisy genetic data can be used for this, including but not limited to various software that performs data quality control and filtering on the original fastq files generated by sequencing, for example, fastp software .
  • a variety of methods for analyzing the genotype of a subject based on the genome sequence information of a subject are known in the art, including various algorithms and computer executable programs. As understood by those skilled in the art, these methods are all applicable to the genotype analysis in the method of the present invention.
  • genotype analysis includes: determining the genotype of the offspring and/or related individuals based on the genome sequence information of the offspring and/or related individuals. In some preferred embodiments, for example, based on SNP chip detection data, determine the SNP polymorphism site and genotype of the subject, or, for example, analyze the genetic variation site and genotype in the subject's genome based on a sequencing data set. .
  • genotype analysis includes:
  • nucleic acid samples such as embryo culture fluid
  • multiple cases such as at least 200 or 300 or 400 or more cases
  • micro-nucleic acid whole-genome amplification data are used as a database for sequence information Fix.
  • the type of reference genome sequence is not particularly limited.
  • a known human reference genome can be used as a reference sequence, such as the hg19 and hg38 reference genomes provided by UCSC.
  • the coordinate system will be different. Therefore, in the analysis process, it is necessary to map the detected sequence data (such as sequencing data or SNP chip detection data) to the specific reference genome used to maintain the consistency of the information.
  • the genomic coordinates can also be converted by means known in the art, such as LiftOver.
  • the method of comparison is not particularly limited.
  • the BWA-MEM algorithm is used to align the sequence to a reference genome such as hg19.
  • the obtained comparison files are sorted and indexed, as well as deduplication and base quality correction.
  • genotype analysis includes:
  • the method further includes the step of obtaining the genotype information of the progeny of other genomic regions other than the nucleic acid chip site.
  • a high-density SNP chip that evenly covers the entire genome of Asians, especially Chinese, is used to meet the needs of genome-wide association analysis and genotyping.
  • a chip containing at least 500,000 (also referred to as 500K) SNP sites, at least 600K SNP sites, or 800K SNP sites or 900K SNP sites or even more sites is used to compare the data of offspring and related individuals. Genome-wide amplification products were subjected to genotyping analysis.
  • SNP genotype analysis tools are known in the art.
  • the Genotyping function module in the Axiom TM Analysis Suite analysis platform of Thermo Fisher Scientific can be used to perform SNP genotype analysis, and the genotype quality can be selected to meet PolyHighResolution, NoMinorHom, MonoHighResolution, and Hemizygous standards for SNP sites for use in the present invention. The next steps of the method.
  • the method of the present invention includes: obtaining a whole genome sequencing data set of offspring and related individuals, and detecting gene mutation sites based on the data set.
  • the sequencing data set is preferably a data set obtained after quality control and cleaning of the original sequencing data, comparison with a reference genome, sorting and deduplication, such as a BAM data format.
  • the prior art describes the quality control and cleaning of raw sequencing data, see CN108573125A.
  • the sequencer for acquiring the data includes an Illumina platform.
  • the Genome Analysis Toolkit (GATK) optimal strategy is used for gene mutation analysis.
  • quality control filtering is performed on the obtained gene mutation site to obtain sites that have genotype information in the parents and can be used for linkage analysis.
  • progeny nucleic acid When a sample containing a small amount of progeny nucleic acid is used for genotyping genetic data analysis, for example, when cfDNA in embryo culture fluid or embryonic tissue biopsy or fetal free DNA is used as the sample, because in these samples, progeny The amount of nucleic acid is small and the fragments are small, and there is often a high rate of genotyping errors.
  • the removal of noisy genetic data includes: identifying a genotyping error site in the genotyping data and marking the site as missing data.
  • the present invention uses the quality control of the amplification efficiency of the whole genome to identify the poorly typed sites in the progeny genotyping data caused by the low amplification efficiency, and mark the sites as missing data.
  • the heterogeneity of genome-wide amplification efficiency is a feature of single-cell amplification technology used for the amplification of trace nucleic acid samples, and regions with low amplification efficiency will lead to poor genotyping quality of base sites in this region.
  • the inventor proposes to construct a reference sequencing data set using multi-sample whole-genome amplification products to determine the distribution pattern of the whole-genome amplification efficiency of trace nucleic acids.
  • the quality control of the whole genome amplification efficiency of trace nucleic acid is implemented as follows:
  • the sequencing depth is not higher than about 0.5X, not higher than about 0.4X, not higher than about 0.3X, not higher than about 0.2X, not higher than about 0.1X
  • the sequencing depth is about 0.06X
  • the sequencing data is compared to the BAM file on the reference genome, and the BAM files of multiple reference samples are combined into a large BAM library.
  • the BAM files of the multiple reference samples are BAM files of at least 300, 400, 500, 600, 700, 800 reference samples.
  • DP i represents the absolute depth of the i-th site
  • N represents the number of sequencing reads
  • L represents the read length
  • the present inventors used the next-generation sequencing data of the MALBAC whole genome amplification products of 463 trace nucleic acids to plot the whole genome amplification efficiency distribution
  • the genotype information of the locus with low amplification efficiency for example, the amplification efficiency ⁇ 1
  • the locus is marked as missing data.
  • the Mendelian error rate refers to the ratio of the loci where Mendelian genetic errors occur to the total loci.
  • the present invention uses Mendelian laws of inheritance and chromosomal interference theory to identify ADO and other genotypic errors in progeny genotyping data and mark them as missing data.
  • Mendelian inheritance law means that if the father of a certain locus is of the A/C genotype and the mother is of the C/C genotype, their offspring must be of the A/C or C/C genotype, unless a new mutation occurs (The frequency of occurrence is extremely low). If the genotype of the offspring at this locus is A/A, it means that ADO may have occurred at this locus, and the information at this locus is marked as missing data.
  • chromosomal interference refers to the phenomenon that homologous chromosomes and non-sister chromatids interact and inhibit each other in two adjacent single exchanges during meiosis.
  • the present invention adopts the inhibition theory, which specifically refers to that when two molecular marker sites within a genetic distance undergo two exchanges or recombination, it is determined that the molecular marker in this recombination section has a genotyping error, and the molecular marker The marker site is marked as missing data, for example, where the genetic distance is any distance below 1 centimorgan (cM).
  • the haplotypes of the paternal and maternal origin of the offspring when constructing the haplotypes of the paternal and maternal origin of the offspring, if the constructed haplotypes are recombined twice within a small genetic distance, such as the previous molecular marker A ( SNP locus genotype) and the previous locus both indicate that the paternal haplotype was inherited from the grandfather, the next molecular marker B indicates that the paternal haplotype was inherited from the grandmother, and the downstream molecular marker C and subsequent sites of B all indicate the father
  • the source haplotype is inherited from the grandfather, and the A, B, and C locus are within a relatively small genetic distance such as 1 centimole. In this case, it can be inferred that the genotype of the B locus is wrong and mark it. Is missing data.
  • multiple (preferably more than 2) offspring samples are used (preferably unborn offspring samples, such as 2-4 or more blastocyst culture fluid samples; or siblings of embryos), And get the genotype data of multiple offspring.
  • the mutual derivation of the haplotypes of the multiple offspring refers to the use of the haplotype phasing method of the present invention and the use of the obtained genotype data of multiple offspring to deduce the largest possible haplotype composition of the offspring. This identifies the site of the genotype error and marks the site as missing data.
  • the removal of noisy genetic data includes: performing quality control of the above-mentioned nucleic acid genome-wide amplification efficiency and at least one quality control selected from the following: quality control for identifying Mendelian genetic errors, and identifying chromosomal interference suppression. Quality control and quality control of multiple progeny haplotypes derived from each other.
  • the elimination of noisy genetic data includes: quality control of the amplification efficiency of the whole nucleic acid genome and quality control of all three or less: quality control for identifying Mendelian genetic errors, quality control for identifying chromosomal interference suppression Quality control of mutual deduction with multiple progeny haplotypes.
  • the haplotype phasing of the offspring can be performed to determine the paternal and maternal haplotype composition of the offspring.
  • haplotype phasing based on genealogy it is preferable to perform haplotype phasing based on genealogy to obtain the paternal and maternal haplotype composition of the offspring.
  • the haplotype phasing includes:
  • a multi-locus linkage analysis strategy based on Mendelian law of inheritance and linkage disequilibrium theory is used to construct haplotypes of progeny at the chromosome level.
  • an algorithm selected from the group consisting of Lander-Green algorithm, Elston-Stewart algorithm, and Idury-Elston algorithm is used for haplotype phasing.
  • the haplotype phasing method of the present invention further includes: using nucleic acid samples of the grandparents and/or maternal grandparents of the (for example, unborn) offspring to construct the haplotypes of the unborn offspring and their parents .
  • a family-based multi-site linkage analysis method is used for haplotype analysis.
  • the haplotype analysis includes the use of multiple, preferably more than two progeny samples.
  • the grandparents and/or maternal grandparents of the unborn offspring can also be used to construct the haplotypes of the unborn offspring and their parents in the haplotype analysis.
  • the following methods are used for family-based haplotype analysis: Lander-Green algorithm, Elston-Stewart algorithm, or Idury-Elston algorithm.
  • haplotype construction is performed based on pedigree information to obtain the largest possible haplotype composition inherited from parents by offspring.
  • Construction methods include, but are not limited to, the likelihood method strategy (seeking the haplotype composition with the maximum probability), the genetic rule strategy (the haplotype composition with the minimum recombination number), and the Expectation Maximisation (EM) algorithm.
  • the likelihood method strategy includes, but is not limited to: Lander-Green algorithm and Viterbi dynamic programming algorithm, Elston-Stewart algorithm and Bayesian network algorithm, and the preferred solutions are Lander-Green algorithm and Viterbi dynamic programming algorithm.
  • the genetic rule method includes a zero recombination hypothesis strategy and a minimum recombination hypothesis strategy, and available software carriers include but are not limited to ZAPLO and HAPLORE.
  • haplotype phasing includes: performing the following steps after obtaining genotyping information and removing noisy genetic data in the genotyping information
  • V i (P 1,i ,M 1,i ,P 2,i ,M 2,i ,...,P n,i ,M n,i ).
  • m represents the number of sites;
  • P(V 1 ) is the initial probability of the paternal or maternal ancestor haplotype at the first site;
  • V i-1 ) is the i-1th site to its neighbor
  • the transfer probability of the haplotype state at the i-th site is obtained by calculating the recombination rate between the two sites;
  • the present invention provides a method for reconstructing the genome of the progeny.
  • haplotype phasing is performed before genotype filling to infer the haplotype of the sample.
  • genotype filling is performed for alleles that are missing in the phased haplotypes obtained after phasing.
  • Genotype filling is performed because of missing data in the genome of the offspring object.
  • a genotype deletion refers to a site with an unknown genotype, that is, an area in a sample that is not covered by sequencing data or a site with missing sequence information data, also known as missing data.
  • the lack of genotype data can be divided into genetic deletion and detection deletion.
  • Hereditary deletion refers to a genotypic deletion caused by a variation of an individual's genetic information (for example, a true deletion of a DNA fragment at a locus).
  • Loss of detection refers to the loss of sequence information due to the limitations of detection technology, errors and other factors.
  • Various genotyping techniques will produce detectable genotype deletions.
  • genotype deletion occurs due to the efficiency of probe hybridization and capture.
  • genotypic deletions include the above two types of deletions, as well as sites with poor genotyping quality that are removed from the offspring genome sequence information based on noise genetic data removal.
  • the sites with missing data on the genome of the offspring (such as embryos) will, in some embodiments, be at least 1/2 or higher, such as 4/5, of the whole genome. or above.
  • the progeny may have up to about 6/7 locus genotype deletions before genotype filling.
  • the method for reconstructing the genome of a progeny object of the present invention includes: combining family genotype information from related individuals, and performing haplotype phasing and missing genotypes on the genotypes of the progeny after removing the noise genetic data Of filling.
  • the missing genotype can be the genotype of the locus where the offspring is not amplified, the locus marked as missing data by removing the noise genetic data, or both.
  • the present invention provides a method for reconstructing the genome of a progeny object, which is characterized in that it comprises the steps:
  • (a) Provide a data set for the analysis and processing, the data set including: a first data set from a progeny subject, a second data set from a father of the progeny subject, and/or from the The third data set of the mother of the offspring object; wherein the data set is the corresponding genotyping information data set obtained by performing genetic testing and typing analysis on the nucleic acid or nucleic acid amplification products of the offspring object and its parents ,
  • the offspring object is preferably an unborn offspring object;
  • step (c) Perform haplotype phasing on the typing data obtained in step (b);
  • step (d) it further includes: adding family and/or population-based typing data for genotype filling, so as to obtain information about the whole genome genotype of the offspring object.
  • the pedigree is genetically related relatives other than the parents of the offspring object, such as siblings, grandparents and/or maternal grandparents.
  • the population typing data may be reference haplotype and haplotype frequency information from, for example, HapMap and 1000Genomes.
  • the desiccation and genotype filling of the present invention can be repeated for multiple different progeny chromosomes.
  • the number of repetitions is determined according to the number of progeny chromosomes to obtain the reconstruction of the whole genome information. For example, for diploid offspring, repeat 23 times (for female individuals) or 24 times (for male individuals).
  • pedigree-based genotype filling includes: based on parental high-density polymorphic molecular marker information and the composition of paternal and maternal haplotypes in the offspring constructed by haplotype phasing, using the same blood source (Identity By Descent, IBD) strategy to fill in missing genotypes in the offspring.
  • population-based genotype filling includes: analyzing the genome-wide genotype information of the progeny based on the population linkage disequilibrium law and reference haplotype and haplotype frequency information such as HapMap, 1000Genomes, etc. .
  • the analysis method can be selected from the following groups: Expectation Maximization (EM), Hidden Markov Model (HMM), Markov Chain Monte Carlo (MCMC), Coalescent Theory, or a combination thereof.
  • Genotype filling algorithms based on the law of population linkage disequilibrium include but are not limited to IMPUTE(2), MaCH, Beagle, Minimac.
  • genotype information that is not successfully filled based on family information
  • population-based genotype filling is used to complete the offspring genome information.
  • the genotype filling of offspring samples can be performed compared with the genotypes of the paternal and maternal samples.
  • the genotypes of individuals from other families can be further added for genealogical analysis and genotype filling of offspring objects, such as siblings of offspring or embryos from the same parent and maternal parent (including IVF production Embryos and their culture media), and/or grandparents and grandparents of offspring.
  • the haplotypes of embryonic progeny can be combined to complement the genotypes of the progeny that are missing.
  • the inventors found that combining the identification of the same blood (IBD) region and the genotype information of related individuals (especially, the parental high-density polymorphic locus (SNP) genotype information), the offspring Filling in the missing genotypes in the middle can obtain allele estimates with higher accuracy, for example, at least 90% or more accuracy, or even as high as 99% or more.
  • IBD same blood
  • SNP parental high-density polymorphic locus
  • step (d) filling in the missing genotypes of the offspring of the method of the present invention includes: identifying the same region in combination with the blood source, that is, determining the haplotype composition of the embryo in a certain region from the parent , Combined with the genotype information of the parents' high-density polymorphic loci to fill in the genotype information missing in the offspring;
  • genotype information that has not been successfully populated based on the family information, use the population reference haplotype information and the population-level allelic linkage disequilibrium (LD) law to fill in the genome-wide genotype information;
  • LD population-level allelic linkage disequilibrium
  • the population reference haplotype information is HapMap, 1000Genomes, HRC (Haplotype Reference Consortium);
  • the genotype filling algorithm of the population-level allelic linkage disequilibrium (LD) law is, for example, IMPUTE(2), MaCH, Beagle, Minimac algorithm;
  • a maximization algorithm (Expectation Maximization, EM), Hidden Markov Model (Hidden Markov Model, HMM), Markov chain Monte Carlo (MCMC), Coalescent theory, or a combination thereof is used to implement Genotype filling.
  • the reconstruction preferably includes:
  • population-based genotype population you can consider selecting a reference population template that is closer to the offspring object in genetic background. For example, when the offspring are Chinese individuals, you can consider using 1000 Genomes Phases3 Chinese population reference haplotype information.
  • the population-based genotype filling software known in the art, including but not limited to the MACH software package, can be used to reconstruct the whole genome genotype of the progeny.
  • whether to perform population-based genotype filling can be determined based on the following factors: (1) desired accuracy of genotype filling, (2) desired genome coverage, (3) desired target Regional coverage or not.
  • the accuracy of genome reconstruction of progeny (such as embryos) based on the family reaches more than 90%, preferably more than 95%, more preferably more than 97%, and most preferably more than 99%.
  • the genome coverage of the offspring reaches more than 60%, for example, more than 70%, more than 80%.
  • the accuracy of the progeny (such as embryo) genome reconstruction reaches 90% or more, preferably 95% or more, more preferably 97% or more, and most preferably 99% or more.
  • the genome coverage of the offspring is further improved.
  • genotype population includes two steps:
  • haplotype in the sample is most similar to the reference haplotype set based on the genotype information of the unmissed site on the sample to be filled, and then assign the corresponding most similar haplotype to the haplotype Sample, thereby reconstructing the complete genotype of the sample.
  • Genotype filling based on the genetic characteristics of family samples can generally find the haplotypes shared between the two by comparing the offspring samples to be filled with the haplotypes of the father and mother. However, the matching haplotypes can be found. The sites on the reference template are copied to the target data set of the offspring, and the genotype of the offspring samples is reconstructed.
  • Population-based genotype filling can generally compare the haplotypes of the offspring samples to be filled with the reference population haplotypes, and find the haplotypes shared between the two. However, the sites on the matched reference template can be copied to The target data set of the offspring is used to reconstruct the genotype of the offspring samples.
  • population based on family-based embryo progeny genotypes includes the following steps:
  • a chromosome-level haplotype is constructed based on pedigree information (parents), Mendelian inheritance rules, and gene linkage and exchange theory of multi-locus linkage analysis. Distinguish the two haplotypes of the embryo's father (mother), and construct the paternal and maternal haplotypes of the embryo at the same time, that is, it is clear which haplotype the embryo has inherited from the parent. When there are some heterozygous sites in the offspring information that cannot be haplotyped, you can add more offspring such as embryo's siblings or embryo's grandparents (maternal grandparent information) genotype information to increase the ability to perform haplotypes The number of loci for typing.
  • the amplified embryo samples and the gDNA samples of the embryo parent's gDNA can obtain genotyping information after genomic sequence information detection.
  • the small amount of embryonic DNA (or other samples containing small amounts of progeny DNA) in the embryo culture medium cannot be fully amplified due to the heterogeneity of the whole genome amplification.
  • the SNP chip as an example, only about 1/5 of the sites on the chip can pass the quality control of the chip genotyping, plus the amplification efficiency and genotyping error quality control of the present invention, the embryo genome is missing The locus genotype information will be more. Therefore, on the basis of haplotype construction, combined with the identity By Descent (IBD) method and parental high-density polymorphic locus genotyping information, the missing genotype information on a certain chromosome of the embryo is filled in.
  • IBD identity By Descent
  • Figure 1 shows an example of genotype filling.
  • 6.1 and 6.2 after the haplotypes are constructed, it is clear that the haplotypes inherited from the father and mother of the embryo are respectively ..A...C.G...T and ..G...T.C...A.
  • the completed embryo haplotype information is GAACGA..T and CGTTCA..A, which are completed.
  • the genotype information of the 3 locus missing in the embryo namely G/C, A/T, A/A.
  • the offspring haplotype information can be obtained at the level of the entire chromosome.
  • the offspring to be tested in the family has siblings, as shown in Figure 1, 6.3, based on the haplotypes of the siblings, the second stage genotype filling is performed, and the embryos to be tested and their siblings and their paternal haplotypes The same, the maternal haplotype is different, the other 3 genotypes missing in the embryo can be further complemented, namely A/C, C/T, C/A, and the completed haplotype information is GAAACCGAC.T And CGTCTTCAA.A..
  • the family-based genotype fill-in after the family-based genotype fill-in, it further includes population-based genotype fill-in of the embryo, so as to complement the genotype information missing from both the parent and the embryo.
  • the filling includes:
  • haplotypes of the progeny constructed according to the above-mentioned method of the present invention based on the population linkage disequilibrium law and HapMap and1000Genomes and other reference haplotypes and haplotype frequency information, a certain chromosome of the embryo is also missing in the parental genome information. Genotype information of the locus. Specifically, the population reference haplotype information can be used to find the population haplotype segment that is most similar to the embryo haplotype, and then based on other genotype information of this haplotype segment in the population to complete the embryo Missing information.
  • the two haplotypes in the embryo are GAAACCGAC.T and CGTCTTCAA.A, and the most similar and most frequent haplotypes in the population are GTACAACCGACGT and CGGATCTTCAACA, thus complementing the three embryos.
  • the estimation methods that can be used for this filling include, but are not limited to, the Expectation Maximization (EM), Hidden Markov Model (HMM), and Markov chain Monte Carlo (MCMC). ) And Coalescent theory.
  • the method of the present invention may include:
  • the quality control method includes the noise genetic data removal method of the present invention as described above.
  • the method of the present invention includes the following steps:
  • Embryonic nucleic acid samples can be taken from free DNA in embryo culture medium.
  • the amplification method adopts a single-cell amplification strategy, and the specific method is not limited, including but not limited to primer extension PCR (PEP-PCR) before amplification, and degenerate oligonucleotide primer PCR (Degenerate oligonucleotide primer- PCR, DOP-PCR), Multiple Displacement Amplification (MDA), Multiple Annealing and Looping Based Amplification Cycles (MALBAC), blunt-end or sticky-end connection construction And other methods.
  • PEP-PCR primer extension PCR
  • DOP-PCR degenerate oligonucleotide primer PCR
  • MDA Multiple Displacement Amplification
  • MALBAC Multiple Annealing and Looping Based Amplification Cycles
  • Detection methods can use nucleic acid chips, second-generation sequencing and other platforms, and genetic analysis uses genotyping methods to obtain genotype information of parents and embryos.
  • Quality control of the whole-genome amplification efficiency of trace nucleic acid The unevenness of the whole-genome amplification efficiency of trace nucleic acid is the characteristic of the amplification technology of trace nucleic acid (for example, trace nucleic acid from single cell), and the region with low amplification efficiency will affect The genotyping quality of the base site in this region.
  • the inventors used multi-sample whole-genome amplification products to construct a reference sequencing data set to determine the distribution pattern of the whole-genome amplification efficiency of trace nucleic acids.
  • the specific method is to perform low-depth sequencing of multiple reference samples of corresponding amplification products (such as about 0.06X), obtain the sequencing data and compare them to the BAM files on the reference genome, and merge the BAM files of multiple reference samples into one large BAM library.
  • N represents the number of sequencing reads
  • L represents the length of reads
  • DP i represents the absolute depth of the i-th position (total depth), that is, the number of reads at that position.
  • total depth the average sequencing depth of the genome of the reference sample BAM library is 27.
  • the inventor’s research found that compared to sites with a trace nucleic acid genome amplification efficiency ⁇ 1X (DP i ⁇ 27, absolute locus depth ⁇ genomic average depth) sites, amplification efficiency ⁇ 1X (DP i ⁇ 27) sites The Mendelian genetic error rate is nearly 6 times higher ( Figure 2).
  • the present inventors used multiple trace nucleic acid amplification products (different amplification methods require corresponding amplification reference samples).
  • Generation sequencing data draws a distribution map of the amplification efficiency of the whole genome, and on this basis, identifies the genotype information of the loci with low amplification efficiency ( ⁇ 1X), and marks it as missing data.
  • 3) Identify the wrong genotype sites of embryos and mark them as missing data: After the embryonic trace DNA is amplified by the whole genome, except for some sites that are not amplified due to low amplification efficiency, or the quality of genotyping is poor, At some sites, due to amplification bias, one of the two alleles is predominantly amplified, or even the other one fails to amplify completely, causing allele dropout (ADO) problems, thereby affecting the site Genotyping.
  • ADO allele dropout
  • first use step 6.1 to construct the paternal and maternal haplotypes of the offspring use Mendelian inheritance and chromosomal interference to identify the sites where ADO and other genotype errors occur in the embryo, and mark them as missing data .
  • Specific methods or frameworks include, but are not limited to, the likelihood method strategy (seeking the haplotype composition with the maximum probability), the genetic rule strategy (the haplotype composition with the minimum recombination number), and the Expectation Maximisation (EM) algorithm.
  • the likelihood method includes but is not limited to Lander-Green algorithm and Viterbi dynamic programming algorithm, Elston-Stewart algorithm and Bayesian network algorithm.
  • the preferred schemes are Lander-Green algorithm and Viterbi dynamic programming algorithm; genetic rule method includes zero recombination hypothesis Strategy and minimum reorganization hypothesis strategy, software carrier includes but not limited to ZAPLO, HAPLORE. If there is only one offspring information, such as only one embryo information, some heterozygous loci may not be haplotyped.
  • step 6.1 Completion of embryo genome information.
  • haplotype construction combined with the identity By Descent (IBD) method and parental high-density polymorphic locus genotyping information, the missing genotype information on a certain chromosome of the embryo is filled in.
  • IBD identity By Descent
  • the haplotype is reconstructed based on the information of other members of the family (for example, the siblings of the offspring to be tested), and the haplotype information of the siblings of the offspring to be tested is also used to further the lack of embryos based on Mendelian inheritance rules Genotype information is filled in. Generally, the more other members of the family (for example, the siblings of the offspring to be tested), the more genotypes that can be filled, and the higher the accuracy rate.
  • genotype information missing from both the parent and the embryo Using the haplotype constructed in step 6.1, based on the population linkage disequilibrium law and the reference haplotype and haplotype frequency information such as HapMap and 1000Genomes, complete the locus genes in a chromosome of the embryo that are also missing in the parental genome information Type information.
  • the parents of the embryo use higher-depth whole-genome sequencing to obtain accurate whole-genome genotype information, such as sequencing depth ⁇ 20x.
  • the embryos can also use low-depth whole-genome sequencing methods. Genome sequencing methods obtain relatively low-density genotype information, such as sequencing depth can be as low as 2x, or even lower.
  • the present invention also provides computer products, systems and equipment for the implementation of removing noise genetic data, phasing haplotypes and/or reconstructing progeny genomes
  • the present invention provides a device for reconstructing progeny genomes (especially haplotypes), the device comprising:
  • a non-transitory computer-readable storage medium carrying instructions for executing the method for reconstructing offspring genome information of the present invention including:
  • the device includes the following modules:
  • Sequence information data acquisition module used to acquire the original sequence information data of the offspring and/or related individuals
  • Genotype analysis module used to analyze the genotype of the original sequence information data of module (1);
  • Quality control filtering module used to perform quality control and filtering on the genotype information obtained by module (2);
  • Haplotype phasing module used to perform haplotype phasing on the genotype after quality control and filtering of module (3);
  • Genotype filling module used to further reconstruct the genotype of the progeny from the phased haplotype of module (4);
  • report output module process and integrate the data obtained in steps (1)-(5) to generate a report.
  • the present invention provides a device comprising:
  • At least one processor and at least one memory the at least one memory has a code stored thereon, and when executed by the at least one processor, the code causes the apparatus to be able to execute the method of the present invention.
  • the code when executed by the at least one processor, the code causes the apparatus to execute at least:
  • sequence information data for example, original sequence information data of offspring and/or related individuals
  • the present invention also provides a computer-readable storage medium having code stored thereon for use by a device, and when executed by a processor, the code causes the device to execute the progeny genome of the present invention.
  • Information reconstruction method when executed by the at least one processor, the code causes the apparatus to execute at least:
  • sequence information data for example, original sequence information data of offspring and/or related individuals
  • the present invention provides a system for reconstructing progeny genomes (especially haplotypes).
  • the system includes a device (device or module) configured to implement the method of the present invention, such as , Configured as:
  • the input includes genomic sequence information of offspring and related individuals
  • the haplotype of the progeny is determined to reconstruct the genomic information of the progeny.
  • the device is the aforementioned device of the present invention for reconstructing progeny genomes (especially haplotypes).
  • it may further include:
  • -Amplification device used to amplify nucleic acid samples of progeny and/or related individuals, preferably whole genome amplification
  • sequence information detection device used for sequence information detection of amplified products, including but not limited to polymorphic loci (such as SNP) detection and sequencing detection.
  • the present invention provides a device for analyzing and processing progeny genome reconstruction, including:
  • Amplification unit used for whole genome amplification of the DNA sample of the test sample and the parent family of the offspring
  • the detection and analysis unit is used for genetic detection and analysis of the amplified products obtained by the amplification unit;
  • the data processing unit is used for quality control and filtering of the detection and analysis data of the amplified products obtained by the amplification unit, and remove the genotypes of the sites that are not amplified or genotyping errors;
  • Information reconstruction unit for the whole genome genotype of the offspring wherein the information reconstruction unit for the whole genome genotype of the offspring is used to perform haplotype phasing and genotype filling, and output the obtained whole genome of the offspring The results of genotype information reconstruction.
  • the system of the present invention will include tools for querying genomic sequence information, and a programmed storage or medium for the computer to analyze the obtained data.
  • Sequence information query data can be a stored data set, or "on the fly” "form.
  • data set covers these two types of data sources.
  • the tools used for querying genome sequence information are not particularly limited.
  • a high-density SNP chip is used.
  • a high-throughput sequencing device is used to obtain high-depth sequencing data of related individuals of the offspring.
  • the present invention can be executed by a computer. Therefore, the present invention also provides a computer programmed to perform the above method.
  • a computer typically includes: a CPU connected to a computer communication interface, a system memory (RAM), a non-transitory memory (ROM), and one or more other storage devices such as a hard board, a floppy disk, and a CD ROM drive.
  • the computer may also include a display device, such as a printer, a CRT monitor, or an LCD display, and an input device, such as a keyboard, mouse, pen, touch screen, or voice activation system.
  • the input device can receive data, for example, directly from a sequence information query tool through an interface.
  • the method, computer product (especially the above-mentioned device of the present invention), system and equipment according to the present invention can be used for disease detection or disease susceptibility detection of pre-implantation embryos and/or fetuses in early pregnancy, including but not limited to: Aneuploidy detection, single gene genetic disease detection, chromosome structure rearrangement detection, polygenic disease genetic risk assessment.
  • the use includes: diagnosing common diseases or cancer susceptibility, including: for example, comparing the progeny haplotype reconstructed according to the method of the present invention with known disease-related haplotypes.
  • the relationship between such haplotypes and diseases is being established in the art.
  • "International HapMap consortium” maps and locates the genome-wide variation of SNP haplotypes in the human population, which is conducive to disease-related research (international, HapMap consortium, 2005). Therefore, combining the genome reconstruction method of the present invention with these haplotype analyses also forms an aspect of the present invention.
  • the present invention finds for the first time that on the basis of whole genome amplification technology, combined with chip or second- and third-generation sequencing technologies, the use of high-density gene polymorphism information and statistical genetics and computational biology of embryonic parents and other families
  • the algorithm can fill in the genotype of the unamplified sites in the embryo's genome and the sites with ADO and other genotyping errors, so as to obtain the embryo's whole genome information.
  • the present invention finds for the first time that quality control and filtering of embryo DNA analysis data can filter out information on sites with poor genotyping quality, especially sites with poor single-cell whole-genome amplification efficiency, thereby improving The accuracy of genome reconstruction.
  • the present invention finds for the first time that the integrated application of gene filling methods based on family and population can obtain the locus information of the entire genome of the embryo to the greatest extent.
  • the collected samples were 1ml of the father’s peripheral blood and 1ml of the mother’s peripheral blood in a family, collected with an EDTA anticoagulation tube; and the father’s sperm was collected using the technique of IntraCytoplasmic Sperm Injection ;ICSI) to fertilize the mother’s eggs in vitro using GM medium (Quinn's Advantage Protein Plus Cleavage Medium) (manufacturer: SAGE, product number: ART-1526) at 37°C, 5
  • GM medium Quinn's Advantage Protein Plus Cleavage Medium
  • SAGE product number: ART-1526
  • the fertilized egg was cultured in a 5% CO 2 and 5% O 2 incubator, and grew into a blastocyst on the fifth day, and about 20 ul of the blastocyst culture fluid was sucked.
  • the culture fluid of 4 blastocysts from the same parents was prepared.
  • the father’s peripheral blood samples and the mother’s peripheral blood samples were taken to extract the whole genome DNA by conventional whole blood genome extraction steps.
  • the kit used in this step is the commercially available DNeasy Blood&Tissue Kit (50) (manufacturer QIAGEN, article number 69504), and the extraction of whole genome DNA is carried out according to the manufacturer's instructions.
  • the amplification method of the present invention refers to the instruction of the MALBAC single-cell whole-genome amplification kit (Cat. No. KT110700150) of Xukang Medical Technology (Suzhou) Co., Ltd. to perform whole-genome amplification of the blastocyst culture medium.
  • Qubit dsDNA HS Assay Kit (Invitrogen, Q32584) was used to quantify the whole genome amplification products of each blastocyst culture medium. The quantitative results show that the total amount of nucleic acid in each sample is about 500-1000ng.
  • CBC-PMRA Capital Biotechnology-Precision Medicine Research Array
  • step 2) above was operated on the SNP chip 900K according to the manufacturer's instructions to obtain the original data for genotyping.
  • the Axiom TM Analysis Suite software of Thermo Fisher Scientific was selected as the platform for analyzing the original data obtained from the SNP chip 900K in the above step 3).
  • the embryonic SNP genotype information of each blastocyst culture medium is basically 1/4 of the parental SNP genotype information.
  • kits can be purchased on the market, and the library is constructed according to the instructions (manufacturer: Jiangsu Kangwei Century Biotechnology Co., Ltd., trade name: NGS Fast DNA Library Prep Set for Illumina, catalog number : CW2585M), and then use Illumina's NextSeq550 sequencing platform to perform whole-genome sequencing, and the average sequencing depth of each sample is 0.06X;
  • N the number of sequencing reads
  • L the read length
  • 3 ⁇ 10 9 is the size of the human genome.
  • the absolute depth of the SNP locus that is, the number of reads covering the locus, is greater than or equal to the average sequencing depth of the genome, it means that the locus has passed the quality control of amplification efficiency; the locus that does not meet this quality control The genotype is marked as missing data.
  • the mutual derivation of embryo haplotypes refers to the use of the genotype phasing method of step 8) and the acquisition of genotype data of multiple embryos to deduce the maximum possible haplotype composition of the embryo. Then, based on the chromosomal interference suppression strategy, the site where two crossover recombination occurred within 1cM genetic distance was identified as the wrong genotype.
  • V i (P 1,i ,M 1,i ,P 2,i ,M 2,i ,...,P n,i ,M n,i ).
  • m represents the number of sites;
  • P(V 1 ) is the initial probability of the paternal or maternal ancestor haplotype at the first site;
  • V i-1 ) is the i-1th site to its neighbor
  • the transfer probability of the haplotype state at the i-th site is obtained by calculating the recombination rate between the two sites.
  • the present invention estimates the recombination rate between sites through the genetic map of the 1000 Genomes Project Phase 3 (1000 Genomes Project Phase 3); P(G i
  • IBD identity by descent
  • Table 3 shows the polymorphic site information obtained after embryo genome reconstruction.
  • the results of this experiment were compared with paired biopsy samples.
  • the biopsy sample is the blastocyst trophoblast cells corresponding to the blastocyst culture medium.
  • the specific experimental steps are as follows: 1Transfer the biopsy blastocyst to an in vitro operation culture medium without calcium and magnesium ions (such as G-PGD containing 5% HSA); 2In Under an inverted microscope (200X), fix the embryo with an egg-holding needle; 3 Cut or perforate the zona pellucida with a diameter of 35-40 ⁇ m; 4 Use a needle with an inner diameter of 35-40 ⁇ m to absorb a cell with a nucleus; 5 Transfer the embryo The operating fluid is removed, washed and cultured in the blastocyst culture fluid. Indicate the patient's name and embryo number; 6DNA extraction, amplification, quantification and genotype analysis procedures are the same as the blastocyst culture medium.
  • allelic remodeling accuracy rate of this example can reach about 99.2% (Table 4) .
  • the Chinese population reference haplotype information in 1000 Genomes Phase 3 is further used, based on the haplotype information that has been phased using genealogical information, Hidden Markov Model (HMM) is used, and MACH software is used. package to predict embryo genotype 2 genome.
  • HMM Hidden Markov Model
  • the DNA samples from the above step 2) were ultrasonically interrupted, and the interrupted fragments were distributed in 200-800bp, and then the Illumina ligation library method was used to construct the second-generation sequencing library.
  • the kits can be purchased in the market and proceed according to the instructions. Library construction (manufacturer: Jiangsu Kangwei Century Biotechnology Co., Ltd., trade name: NGS Fast DNA Library Prep Set for Illumina, Item No.: CW2585M).
  • next-generation sequencing library of step 3 use the NovaSeq 6000 sequencing platform of Illumina to perform whole-genome sequencing.
  • the average sequencing depth of the whole genome is 20X
  • the average sequencing depth of the whole genome is 2X. .
  • GTK Genome Analysis Toolkit
  • VQSR Variant Quality Score Recalibration
  • the quality control filtering principle of the site is: 1 the point where the VQSR is "PASS"; 2 the point where the sequencing depth of the parent's site is DP>20 and the genotype information is not “./.”; and 3 Embryo site sequencing depth DP>5 and genotype information is not the point "./.”.
  • the "./.” is a site where genotyping cannot be performed.
  • the obtained parents have genotype information and the loci that can be used for linkage analysis are 1608593 loci.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne un procédé d'obtention de données génétiques à partir d'un échantillon d'acide nucléique d'une descendance, et à nettoyer les données génétiques bruitées à l'intérieur de celles-ci par un contrôle de qualité et un filtrage, un procédé de phasage d'un haplotype de la descendance pour les données génétiques de descendance soumises à un contrôle de qualité et un filtrage, un procédé de reconstruction d'un génome de descendance, et un dispositif ou système pour la mise en œuvre du procédé. La présente invention concerne en outre l'utilisation du procédé et du dispositif ou système pour mettre en œuvre le procédé, qui est utilisé pour la notation de risque génétique de maladie polygénique, la détection d'aneuploïdie, la détection de trouble génétique à un seul gène et la détection de réarrangement chromosomique structural d'un embryon de préimplantation et/ou d'un fœtus dans le premier trimestre.
PCT/CN2020/121432 2019-10-18 2020-10-16 Procédé et système de nettoyage de données génétiques bruitées, de phasage d'haplotype et de reconstruction du génome de la descendance, et leur utilisation WO2021073604A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202080005425.5A CN112840404A (zh) 2019-10-18 2020-10-16 清除噪音遗传数据、单体型定相、重构子代基因组的方法、系统和其用途

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910995291 2019-10-18
CN201910995291.5 2019-10-18

Publications (1)

Publication Number Publication Date
WO2021073604A1 true WO2021073604A1 (fr) 2021-04-22

Family

ID=75537714

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/121432 WO2021073604A1 (fr) 2019-10-18 2020-10-16 Procédé et système de nettoyage de données génétiques bruitées, de phasage d'haplotype et de reconstruction du génome de la descendance, et leur utilisation

Country Status (2)

Country Link
CN (1) CN112840404A (fr)
WO (1) WO2021073604A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114606302A (zh) * 2022-04-08 2022-06-10 复旦大学附属中山医院 一种提取口腔黏膜核酸进行全基因组高通量测序的方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023225951A1 (fr) * 2022-05-26 2023-11-30 深圳华大生命科学研究院 Procédé de détection du génotype fœtal à partir d'un haplotype
CN117230175B (zh) * 2023-06-21 2024-05-28 广州序源医学科技有限公司 一种基于三代测序的胚胎植入前遗传学检测方法

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335625A (zh) * 2015-11-04 2016-02-17 和卓生物科技(上海)有限公司 胚胎植入前的遗传学检测装置
CN107723364A (zh) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 一种结直肠癌易感基因的筛查方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008115497A2 (fr) * 2007-03-16 2008-09-25 Gene Security Network Système et procédé pour nettoyer des données génétiques bruyantes et déterminer un numéro de copie de chromosome

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335625A (zh) * 2015-11-04 2016-02-17 和卓生物科技(上海)有限公司 胚胎植入前的遗传学检测装置
CN107723364A (zh) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 一种结直肠癌易感基因的筛查方法

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
"Current Protocols in Human Genetics", 1 May 2011, JOHN WILEY & SONS, INC., Hoboken, NJ, USA, ISBN: 978-0-471-14290-4, ISSN: 1934-8266, article STEPHEN TURNER, ARMSTRONG LOREN L., BRADFORD YUKI, CARLSON CHRISTOPHER S., CRAWFORD DANA C., CRENSHAW ANDREW T., DE ANDRADE MARIZA: "Quality Control Procedures for Genome-Wide Association Studies", pages: 1.19.1 - 1.19.18, XP055241966, DOI: 10.1002/0471142905.hg0119s68 *
ANDERSON CARL A, PETTERSSON FREDRIK H, CLARKE GERALDINE M, CARDON LON R, MORRIS ANDREW P, ZONDERVAN KRINA T: "Data quality control in genetic case-control association studies", NATURE PROTOCOLS, NATURE PUBLISHING GROUP, ENGLAND, 1 September 2010 (2010-09-01), England, pages 1564 - 1573, XP055801113, Retrieved from the Internet <URL:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025522/pdf/ukmss-33586.pdf> [retrieved on 20210504], DOI: 10.1038/nprot.2010.116 *
ANDRIES T. MAREES, HILDE DE KLUIVER, SVEN STRINGER, FLORENCE VORSPAN, EMMANUEL CURIS, CYNTHIA MARIE-CLAIRE, ESKE M. DERKS: "A tutorial on conducting genome-wide association studies: Quality control and statistical analysis", INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, vol. 27, no. 2, 1 June 2018 (2018-06-01), pages e1608, XP055677508, ISSN: 1049-8931, DOI: 10.1002/mpr.1608 *
CATHY C. LAURIE, KIMBERLY F. DOHENY, DANIEL B. MIREL, ELIZABETH W. PUGH, LAURA J. BIERUT, TUSHAR BHANGALE, FREDERICK BOEHM, NEIL E: "Quality control and quality assurance in genotypic data for genome-wide association studies", GENETIC EPIDEMIOLOGY, LISS, NEW YORK, NY,, US, vol. 34, no. 6, 1 September 2010 (2010-09-01), US, pages 591 - 602, XP055470592, ISSN: 0741-0395, DOI: 10.1002/gepi.20516 *
ELEONORA PORCU, SERENA SANNA, CHRISTIAN FUCHSBERGER, LARS G FRITSCHE: "Genotype Imputation in Genome-Wide Association Studies.", CURRENT PROTOCOLS IN HUMAN GENETICS, no. supplement 78, 31 July 2013 (2013-07-31), XP009527334, DOI: 10.1002/0471142905.hg0125s78 *
HUANG ZHICONG, LIN HUANG, FELLAY JACQUES, KUTALIK ZOLTÁN, HUBAUX JEAN-PIERRE: "SQC: secure quality control for meta-analysis of genome-wide association studies", BIOINFORMATICS, OXFORD UNIVERSITY PRESS , SURREY, GB, vol. 33, no. 15, 1 August 2017 (2017-08-01), GB, pages 2273 - 2280, XP055801107, ISSN: 1367-4803, DOI: 10.1093/bioinformatics/btx193 *
VERMA SHEFALI S., DE ANDRADE MARIZA, TROMP GERARD, KUIVANIEMI HELENA, PUGH ELIZABETH, NAMJOU-KHALES BAHRAM, MUKHERJEE SHUBHABRATA,: "Imputation and quality control steps for combining multiple genome-wide datasets", FRONTIERS IN GENETICS, vol. 5, 11 December 2014 (2014-12-11), XP055801112, DOI: 10.3389/fgene.2014.00370 *
WINKLER THOMAS W, DAY FELIX R, CROTEAU-CHONKA DAMIEN C, WOOD ANDREW R, LOCKE ADAM E, MÄGI REEDIK, FERREIRA TERESA, FALL TOVE, GRAF: "Quality control and conduct of genome-wide association meta-analyses", NATURE PROTOCOLS, NATURE PUBLISHING GROUP, GB, vol. 9, no. 5, 1 May 2014 (2014-05-01), GB, pages 1192 - 1212, XP055801109, ISSN: 1754-2189, DOI: 10.1038/nprot.2014.071 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114606302A (zh) * 2022-04-08 2022-06-10 复旦大学附属中山医院 一种提取口腔黏膜核酸进行全基因组高通量测序的方法

Also Published As

Publication number Publication date
CN112840404A (zh) 2021-05-25

Similar Documents

Publication Publication Date Title
US20200362415A1 (en) System and method for cleaning noisy genetic data from target individuals using genetic data from genetically related individuals
US10266893B2 (en) System and method for cleaning noisy genetic data and determining chromosome copy number
US20200190591A1 (en) System and method for cleaning noisy genetic data and determining chromosome copy number
US9639657B2 (en) Methods for allele calling and ploidy calling
EP2437191B1 (fr) Procédé et système de détection d&#39;anomalies chromosomiques
WO2021073604A1 (fr) Procédé et système de nettoyage de données génétiques bruitées, de phasage d&#39;haplotype et de reconstruction du génome de la descendance, et leur utilisation
US20140206552A1 (en) Methods for preimplantation genetic diagnosis by sequencing
US20110092763A1 (en) Methods for Embryo Characterization and Comparison
WO2013052557A2 (fr) Procédés pour diagnostic génétique préimplantatoire par séquençage
Chen et al. DNA methylome reveals cellular origin of cell-free DNA in spent medium of human preimplantation embryos
Aston et al. A review of genome-wide approaches to study the genetic basis for spermatogenic defects
JP7362789B2 (ja) 精子提供者、卵母細胞提供者、及びそれぞれの受胎産物の間の遺伝的関係を決定するためのシステム、コンピュータプログラム及び方法
WO2023246949A1 (fr) Procédé non invasif de détermination de parenté avant la naissance à l&#39;aide de micro-haplotypes
US20160371432A1 (en) Methods for allele calling and ploidy calling
JP7446343B2 (ja) ゲノム倍数性を判定するためのシステム、コンピュータプログラム及び方法
JP2022537442A (ja) ヒト胚におけるコピー数変異を検証するために単一ヌクレオチド変異の密度を使用するシステム、コンピュータプログラム製品及び方法
US20240185957A1 (en) Methods for allele calling and ploidy calling
CA3143723C (fr) Systemes et procedes de determination de motif d&#39;heredite dans des embryons
Sanchez-Mazas et al. Genetic variability and epigenetic alterations in Down syndrome

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20877266

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20877266

Country of ref document: EP

Kind code of ref document: A1