WO2021073604A1 - Method and system for clearing noisy genetic data, phasing haplotype, and reconstructing offspring genome, and use thereof - Google Patents

Method and system for clearing noisy genetic data, phasing haplotype, and reconstructing offspring genome, and use thereof Download PDF

Info

Publication number
WO2021073604A1
WO2021073604A1 PCT/CN2020/121432 CN2020121432W WO2021073604A1 WO 2021073604 A1 WO2021073604 A1 WO 2021073604A1 CN 2020121432 W CN2020121432 W CN 2020121432W WO 2021073604 A1 WO2021073604 A1 WO 2021073604A1
Authority
WO
WIPO (PCT)
Prior art keywords
genome
haplotype
offspring
genotype
information
Prior art date
Application number
PCT/CN2020/121432
Other languages
French (fr)
Chinese (zh)
Inventor
邹央云
陆思嘉
胡春旭
Original Assignee
苏州亿康医学检验有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州亿康医学检验有限公司 filed Critical 苏州亿康医学检验有限公司
Priority to CN202080005425.5A priority Critical patent/CN112840404A/en
Publication of WO2021073604A1 publication Critical patent/WO2021073604A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention generally relates to the field of biomedical diagnosis and detection. More specifically, the present invention relates to obtaining, manipulating and using genetic data to phase the haplotypes of progeny, methods for reconstructing progeny genomes, systems and devices for implementing the methods, and in particular, to the use of trace DNA from progeny
  • the method, system and computer device for phasing the haplotype of the progeny and reconstructing the progeny genome of the nucleic acid sample, and the identification of the haplotype of the progeny and the reconstructed progeny genome involved in the phasing may lead to a variety of expressions. Genetic variation of type results, especially the application of aneuploidy and disease-related genes.
  • Assisted reproductive technology has made great progress in overcoming human infertility and infertility. At present, about 3-4% of the total birth population in the world are born through assisted reproduction operations. Although ART has made some surprising theoretical and technological advancements, the actual implementation of the concept of "healthy babies" still faces unique challenges.
  • PGT Preimplantation Genetic Test
  • PGT is a test that performs preimplantation genetic analysis on embryos of patients with high genetic risk during the process from in vitro fertilization to embryo transfer, and aims to select normal genetic material The embryo is implanted into the mother’s uterine cavity to obtain healthy offspring.
  • PGT can be divided into aneuploidy test (PGT for Aneuploidies, PGT-A), single-gene genetic disease test (PGT for Monogenic gene defects, PGT-M) and chromosomal structure rearrangement test (PGT for chromosomal Structural Rearrangements, PGT-SR).
  • PGT Preimplantation Genetic Test
  • embryo culture fluid contains embryo-derived free DNA (cfDNA) fragments, making it possible to perform non-invasive preimplantation genetic testing.
  • cfDNA embryo-derived free DNA
  • polygenic diseases or chronic diseases such as cardiovascular diseases, diabetes, obesity, tumors, etc.
  • cardiovascular diseases such as diabetes, obesity, tumors, etc.
  • the polygenic diseases or chronic diseases are the result of multiple genes participating in the disease process.
  • cardiovascular diseases such as cardiovascular diseases, diabetes, obesity, tumors, etc.
  • many chronic diseases have a high heritability rate. Therefore, the genetic basis plays a more important role in determining the risk of an individual.
  • the technical difficulty is that the construction of the risk value of polygenic diseases requires embryo and/or fetal genome-wide genotype information; and in a non-invasive or minimally traumatic way
  • the obtained embryo and/or fetal DNA is small, especially the embryonic cell-free DNA present in the embryo culture medium has the characteristics of small fragments, poor DNA whole genome amplification uniformity, and high genotype error rate, which makes it impossible to produce high Quality, highly continuous embryo and/or fetal whole genome sequence.
  • the trace nucleic acid samples of the progeny are embryo culture fluid, blastocyst culture fluid, blastocoel fluid, Cell-free DNA (cfDNA) of the fetus in maternal plasma or other types of maternal body fluids, and/or blastocyst trophoblast cells, cleavage embryonic cells, maternal blood or other types of maternal body fluids), amplified in the whole genome
  • the offspring haplotypes are obtained by using sequence information acquisition technologies such as nucleic acid chips or next-generation sequencing technology, as well as statistical genetics and computational biology algorithms. Phasing and genomic reconstruction of offspring can obtain very high accuracy of haplotype phasing and genome reconstruction.
  • the haplotype phasing and genome reconstruction are obtained by using sequence information acquisition technologies such as nucleic acid chips or next-generation sequencing technology, as well as statistical genetics and computational biology algorithms. Phasing and genomic reconstruction of offspring can obtain very high accuracy of haplotype phasing and genome reconstruction.
  • the present invention performs amplification, data analysis, quality control and filtering on the obtained trace progeny nucleic acid, thereby eliminating noise genetic data (for example, the quality of genotyping is poor such as allele dropout (ADO) ) Locus genotype information), and then based on the haplotype phasing of the pedigree, obtain the haplotype phasing of the offspring; finally use the identity By Descent of the pedigree (Identity By Descent, IBD) and the linkage disequilibrium strategy of the population , Perform genotype filling in the missing genotypes in the offspring (for example, sites that have not been amplified and genotype errors such as ADO), thereby rebuilding the genome with high fidelity with a very high accuracy rate of genome reconstruction The genome of the offspring.
  • genetic data obtained from other related individuals such as other embryos, siblings, grandparents, or other relatives related to the offspring can also be used to further increase the accuracy of the reconstructed offspring genome.
  • the quality of genotyping is poor
  • the present invention relates to a method for removing noisy genetic data from offspring, the method comprising the steps of:
  • genomic sequence information from the offspring where the genomic sequence information of the offspring is obtained from about 0.1pg-40ng DNA, for example, 1-40ng DNA, 20-40ng DNA, 0.1-40pg DNA, 1- 40pg DNA, 10-40pg DNA trace nucleic acid sample of the offspring;
  • the trace nucleic acid sample of the offspring is embryo culture fluid, blastocyst culture fluid, blastocyst cavity fluid, maternal plasma or other types of maternal body fluids of the fetus without cells DNA, and/or fetal cells in blastocyst trophoblast cells, cleavage embryo cells, maternal blood or other types of maternal body fluids;
  • step (b) Quality control and filtering of the genomic sequence information of the progeny of step (a), wherein the quality control includes a quality control selected from the group consisting of implementing a whole genome amplification efficiency of trace nucleic acid, a quality control identifying Mendelian genetic errors, Identify quality controls that violate the theory of chromosomal interference suppression, quality controls that are deduced from multiple progeny haplotypes, and their combinations.
  • the quality control includes a quality control selected from the group consisting of implementing a whole genome amplification efficiency of trace nucleic acid, a quality control identifying Mendelian genetic errors, Identify quality controls that violate the theory of chromosomal interference suppression, quality controls that are deduced from multiple progeny haplotypes, and their combinations.
  • step (a) of the method for removing noisy genetic data from progeny of the present invention provides genome sequence information from progeny that does not exceed about 30% of its genome, for example, the coverage of its genome The coverage is about 30%, 25%, 20%, 15%.
  • the genome sequence information from the progeny in step (a) is performed by performing whole-genome expansion selected from the following group on the progeny trace nucleic acid sample.
  • the described nucleic acid chip, amplification and/or sequencing technology is a single nucleotide polymorphism site microarray nucleic acid chip, MassARRAY flying mass spectrometry chip, MLPA multiple connection amplification technology, second-generation sequencing, third-generation sequencing, or their Combinations;
  • the single nucleotide polymorphism site microarray nucleic acid chip is a SNP genotyping chip;
  • the second-generation sequencing includes whole-genome sequencing, whole-exome sequencing, and sequencing of targeted genomic regions ,
  • whole-genome sequencing for example, low-depth whole-genome sequencing, for example, the sequencing depth can be as low as 2x or even 1x or less.
  • the quality control of the whole genome amplification efficiency of trace nucleic acid described in step (b) of the method for removing noisy genetic data from progeny of the present invention is implemented as follows: whole genome amplification using multiple trace nucleic acid samples
  • the reference sequencing data of the product is used to identify the genotype of the site with low amplification efficiency, and the genotype of the site is marked as missing data, for example, the whole genome amplification product of multiple trace nucleic acid samples is used as a reference sample for low-depth sequencing
  • the sequencing depth is not higher than about 0.5X, not higher than about 0.4X, not higher than about 0.3X, not higher than about 0.2X, not higher than about 0.1X, for example, the sequencing depth is about 0.06X.
  • the sequencing data obtained from the reference sample is compared to a human reference genome (for example, hg19 or hg38), and the following formula is used to calculate the site amplification efficiency
  • DP i represents the absolute depth of the i-th site
  • N represents the number of sequencing reads
  • L represents the read length
  • step (b) When DP i ⁇ the average depth of the genome, and the site amplification efficiency ⁇ 1, it means that the site has passed the quality control of the whole genome amplification efficiency of trace nucleic acid; the genotype of the site that does not meet this quality control is marked as missing data .
  • the chromosomal interference suppression theory described in step (b) is that when two molecular marker sites within a genetic distance are exchanged or recombined twice, it is determined that the molecular marker in this recombination section has a genotyping error.
  • the molecular marker site is marked as missing data, for example, where the genetic distance is any distance below 1 centimer (cM).
  • the present invention relates to a method for phasing the haplotypes of progeny, the method comprising the above-mentioned steps (a) and (b), and the following steps:
  • the quality control and filtered progeny genome sequence information (for example, the genotype information of the progeny) is based on the pedigree information, Mendelian inheritance laws and the multi-locus linkage analysis strategy of gene linkage and exchange theory to phase the progeny
  • the haplotype of the progeny such as the haplotype of the progeny at the chromosome level, wherein the pedigree information is the genome sequence information of the genetic father of the progeny (for example, the genotype information of the genetic father) and/or the Genome sequence information of the genetic mother of the offspring (for example, the genotype information of the genetic mother), optionally, the pedigree information also includes the genome sequence information of other pedigree individuals of the offspring (for example, genotype information) ).
  • the genealogical information described in step (c) of the method for phasing the haplotype of the offspring is obtained from the nucleic acid of the family individual comprising at least about 100ng DNA (for example, 100ng-1000ng DNA)
  • a sample for example, the family individual nucleic acid sample is a nucleic acid sample from blood, saliva, buccal swabs, urine, nails, hair follicles, dander, cells, tissues, body fluids from the family individual, and the genealogical information is The coverage of the pedigree individual is not less than about 90%, for example, the coverage is about 90%, 95%, 98%, 99% or more, for example, where the genealogical information is obtained by analyzing the pedigree individual Genomic DNA (such as whole blood gDNA, oral epithelial cell gDNA, urothelial cell gDNA, nail bed gDNA, hair follicle gDNA and dander gDNA, preferably whole blood gDNA) is the data obtained by whole blood g
  • step (c) of the method for phasing the haplotypes of the offspring is implemented using statistical genetics and computational biology algorithms, for example, using a strategy selected from the likelihood method based on the pedigree information (Find the haplotype composition with the greatest probability), genetic rule strategy (see the haplotype composition with the smallest recombination number), the Expectation Maximisation (EM) algorithm and their combination to obtain the largest possible haplotype composition of the offspring .
  • EM Expectation Maximisation
  • the present invention relates to a method for reconstructing progeny genomes, the method comprising the above-mentioned steps (a), (b) and (c), and steps
  • step (d) of the method for reconstructing the offspring genome is to identify the same region of the blood source, that is, to determine the haplotype composition of the embryo in a certain region from the parent, and at the same time combine the parent's height. Density polymorphic locus genotype information, fill in the genotype locus information missing in the offspring.
  • step (d) of the method for reconstructing offspring genomes also involves using population reference haplotype information and population-level allele linkage failure for genotype information that has not been successfully populated based on family information.
  • the law of balance (LD) fills in genotype information at the whole genome level;
  • the present invention relates to a device or system capable of performing the above-mentioned removal of noisy genetic data from offspring; a device or system capable of performing the above-mentioned haplotype phasing; and a device or system capable of performing the above-mentioned A device or system for genotyping.
  • the present invention relates to a device or system characterized in that,
  • It can perform whole-genome amplification of DNA samples, for example, it can perform whole-genome amplification of DNA samples of offspring and/or whole-genome amplification of DNA samples of offspring’s genetic parents (in some implementations) In the plan, when the parent's DNA sample amount is sufficient, no amplification is required);
  • It can perform the detection of the sequence genetic information of the genome of the obtained whole-genome amplification product or gDNA sample, for example, read the sequence information after the nucleic acid chip or the second-generation sequencing;
  • the present invention relates to the use of the methods of the first to third aspects above or the use of the device or system of the fourth aspect to perform polygenic pre-implantation embryos and/or fetuses in early pregnancy.
  • Figure 1 shows a flow chart of a technical solution of the present invention.
  • Figure 2 shows the effect of the whole genome amplification efficiency of progeny trace nucleic acid on the quality of progeny genotypes.
  • the term “comprising” or “including” means including the stated elements, integers or steps, but does not exclude any other elements, integers or steps.
  • the term “comprises” or “includes” when used, unless otherwise specified, it also covers the case consisting of the stated elements, integers or steps.
  • an antibody variable region that "comprises” a specific sequence when referring to an antibody variable region that "comprises” a specific sequence, it is also intended to encompass the antibody variable region composed of that specific sequence.
  • offspring includes, but is not limited to, the offspring of a mammal, such as a human, and means a born or unborn offspring.
  • Unborn offspring include embryos or fetuses.
  • Embryo usually refers to the product of the division of the fertilized egg before the end of the embryonic period from the eighth week after fertilization. The cleavage stage of the embryo exists in the first three days of culture.
  • "Embryo transfer” is the operation of putting one or more embryos and/or blastocysts into the uterus or fallopian tube. Fetus usually refers to the unborn offspring of mammals after eight weeks of pregnancy, especially unborn human babies.
  • blastocyst is an embryo 5 or 6 days after fertilization, which has an inner cell mass, an outer cell layer called trophectoderm, and a fluid-filled blastocyst cavity that contains the inner cell mass from which the entire embryo is derived.
  • the trophectoderm is the precursor of the placenta.
  • related individuals or "family individuals” of the offspring are used interchangeably, and refer to any individual that is genetically related to the target offspring individual, for example, is genetically related to the target offspring individual and therefore shares an individual with the target offspring.
  • the relevant individual may be the genetic parent of the target individual or any genetic material derived from the parent, such as sperm, polar body, other embryos or fetuses. It can also refer to siblings, parents or grandparents, and grandparents. In this application, parent refers to the genetic father or mother of an individual.
  • Offspring individuals usually have two parents (maternal and male). The sibling refers to any individual whose genetic parents are the same as the offspring individual in question.
  • siblings can refer to a born child, embryo or fetus, or one or more cells derived from an embryo or fetus, or a child that has been born; siblings can also refer to haploid individuals derived from one parent, For example, sperm, polar bodies, or any other haplotype genetic material.
  • the DNA derived from the progeny refers to the DNA of the original part of the progeny cell, the body fluid of the progeny or the original DNA of the culture fluid of the progeny cell whose genotype is basically equivalent to the genotype of the progeny.
  • Parent-derived DNA refers to the DNA of the original part of the parent cell whose genotype is basically equivalent to the parental genotype, the parent body fluid, or the original DNA of the parent cell culture fluid.
  • maternal DNA refers to the DNA of the original part of the maternal cell whose genotype is basically equivalent to the maternal genotype, the maternal body fluid, or the original DNA of the maternal cell culture fluid.
  • SNP Single Nucleotide Polymorphism
  • the frequency of SNPs in a population is generally >1%. There is an average of 300-1000 bp in the whole human genome with one SNP.
  • SNP databases are currently available from a number of public databases, including, for example, http://cgap.ncbi.nih.gov/GAI; http://www.ncbi.nlm.nih.gov/SNP; human SNP database http:/ /hgbas.cgr.ki.sei or http://hgbase.interactiva.de/.
  • genotype refers to the type of alleles possessed by an individual at a locus, which is called the genotype of the individual at that locus.
  • genotype For humans, except for sex chromosomes, the type of a pair of alleles that each pair of homologous chromosomes has at the same locus is called the genotype of that locus. Genotyping refers to the process of determining the genotype of an individual.
  • noise genetic data refers to genetic data with any of the following: Allele Dropout (ADO), uncertain base pair measurement, wrong base pair measurement, missing base pair measurement , Indeterminate measurement of insertion or deletion, indeterminate measurement of chromosome segment copy number, false signal, other measurement errors, or a combination thereof.
  • ADO Allele Dropout
  • Sequence Depth refers to the ratio of the total number of bases obtained by sequencing to the size of the genome to be tested. Assuming a genome size of 2M and a sequencing depth of 10X, the total amount of data obtained is 20M. The sequencing depth can be expressed by the ratio of the total number of bases (bp) obtained by sequencing to the size of the genome (Genome).
  • absolute sequencing depth of a site and “absolute depth of a site” are used interchangeably and refer to the number of reads of the site.
  • average sequencing depth of the genome and “average depth of the genome” are used interchangeably, and refer to adding the absolute depth of each site on the whole genome and dividing by the number of sites to obtain the average depth of the genome.
  • the average sequencing depth of the genome can also be understood as the average number of times each base in the genome has been sequenced.
  • Read is also called “read length”. Each sequence in the sequencing data is a read.
  • the term “coverage” refers to the proportion of the sequence part of the genome or transcriptome or chromosome segment with known sequence information to the entire group or segment.
  • the coverage refers to the ratio of the number of bases of sequence information detected (for example, by means of sequence detection, such as sequencing) to the total number of bases in the detected region.
  • sequence detection such as sequencing
  • the coverage is the ratio of the number of sequenced bases finally obtained to the number of bases of the entire genome.
  • the coverage obtained by sequencing the human genome is 98.5%, which indicates that there are still 1.5% regions of the genome that have not been sequenced.
  • the coverage refers to the number of genetic sites (such as SNP sites or genetic variation sites) for which sequence information is detected (for example, by SNP chip or sequencing analysis) in terms of the detected area , The proportion of the total number of gene loci detected in the region.
  • the detected region can be the whole genome, a specific chromosome, or a specific chromosome segment, or a transcript set, or a specific transcription region.
  • Sequstq is one of the standard formats for sequence data storage. There is one piece of read information for every four rows, including sequencing read name, sequence, positive and negative chain identification, and sequence quality value.
  • Mendelian law of inheritance refers to the two basic laws of genetics, the law of separation and the law of free combination, collectively referred to as the law of Mendelian inheritance. According to Mendelian rules of inheritance, during meiosis, alleles will separate with the separation of homologous chromosomes, enter the two gametes separately, and be inherited independently from the gametes to offspring; in addition, at the same time as the alleles separate , The non-allelic genes on non-homologous chromosomes appear as free combinations.
  • linkage disequilibrium refers to the probability that alleles belonging to two or more gene loci appear on one chromosome at the same time, which is higher than the frequency of random occurrence. Linkage disequilibrium is also called allelic association (allelic association). Generally, the intensity of LD is related to the distance between two gene loci. Generally, the farther the two pairs of alleles are, the greater the chance of recombination, that is, the higher the recombination rate (exchange rate), the weaker the LD; conversely, the closer the distance, the lower the recombination rate, and the stronger the LD. Therefore, the recombination rate can be used to reflect the relative distance between two genes on the same chromosome. When the gene recombination rate is 1%, the distance between two genes is recorded as 1 centimorgan (cM).
  • chromosomal interference refers to the phenomenon in which homologous chromosomes and non-sister chromatids interact and inhibit each other in two adjacent single exchanges during meiosis. In this article, the inhibition of chromosomal interference is used to control the quality of genotyping data.
  • the genome formation process of the offspring is equivalent to a random recombination of the parental genome (that is, a random combination of chain interchange haplotype recombination and gametes).
  • ADO allelic dropout
  • haplotype refers to a combination of alleles at multiple sites that are usually inherited in common on the same chromosome. Depending on the number of recombination events that have occurred between a set of designated sites, haplotypes can refer to as few as two sites, or to the entire chromosome. Haplotype can also refer to a set of single nucleotide polymorphisms (SNPs) on a single chromatid that is statistically related.
  • SNPs single nucleotide polymorphisms
  • haplotypic data is also referred to as “haplotypic data”, “phased data” or “ordered genetic data”, which refers to data derived from double The determined genetic data of a single chromosome in a somatic or polyploid genome.
  • Unordered Genetic Data refers to data obtained by combining the sequencing data of two or more chromosomes in a diploid or polyploid genome.
  • haplotype phasing refers to the behavior of determining the haplotype genetic data of an individual for disordered genetic data from diploid or polyploid. It can refer to a set of alleles found on a chromosome, and determine which of the two genes at each allele is related to one of the two homologous chromosomes in an individual .
  • the haplotype phasing of multiple sites can find the haplotype-disease phenotype correlation, which is significantly stronger than the single site-disease phenotype correlation.
  • SNP chip is a chip that uses the signal (usually a fluorescent signal) obtained after hybridization of the chip to determine the genotype of a certain site.
  • signal usually a fluorescent signal
  • SNP chips will contain different SNP sites depending on chip manufacturers and models.
  • human chips produced by Affymetrix and Illumina contain different sets of SNPs.
  • IBD Identity By Descent
  • high-density genetic polymorphism information of parents and other families of embryos means that when the same genetic analysis method is used, the density of genetic polymorphisms of parents and embryos is different. The reason is that the parents’ samples are gDNA samples. If the concentration is high, most of the genotype locus information can be obtained smoothly; while embryos often use the whole gene amplification product of single cell or the whole gene amplification product of DNA in embryo culture fluid, and there is uneven whole genome amplification, Amplification errors such as ADO make the available genetic polymorphic loci information of embryos relatively sparse.
  • Genome-wide association study refers to identify sequence variations occurring within the whole range of human genome sequence variation screened out associated with the disease, in order to achieve cost-effective to find genetic The association between markers and disease.
  • module refers to a software object or routine (e.g., as an independent thread) that can be executed on a single computing system (e.g., computer program, tablet computer (PAD), one or more processors).
  • the program for implementing the method of the present invention may be stored on a computer-readable medium, which contains computer program logic or code parts, for implementing the system modules and methods.
  • system modules and methods described herein are preferably implemented by software, implementation by hardware or a combination of software and hardware is also possible, and can be conceived by those skilled in the art.
  • the present invention generally provides a method for reconstructing progeny genomes, the method comprising:
  • step (b) Perform quality control and filtering on the genomic sequence information of the offspring of step (a), and remove the sites with poor genotyping quality;
  • the offspring may be born or unborn offspring.
  • the related individual may be any individual who is genetically related to the target individual.
  • the method of the present invention involves genomic information processing and/or reconstruction based on original genetic data.
  • the original genetic data applicable to the method of the present invention includes genomic sequence information of the offspring and/or related individuals and related original genetic data, such as genotype information generated based on the sequence information. These original genetic data are disordered and unphased.
  • the original genetic data is in the form of a data set, for example, in the form of a computer-readable data set.
  • the user of the method of the present invention can directly provide a computer-readable medium recording the data, or a data package generated on a commercial platform, or preferably Specifically, the target nucleic acid sample is obtained by any sequence information detection technique known in the art.
  • the acquisition of original genetic data includes: Genome sequence information of the next generation, and the genotype analysis of the offspring based on this information.
  • the method further includes: obtaining genomic sequence information of related individuals and performing genotype analysis.
  • the step of obtaining genomic sequence information of offspring and/or related individuals includes:
  • the genomic sequence information includes, but is not limited to: sequence information of the whole genome, sequence information of the whole exome, and/or sequence information of a targeted chromosome region.
  • the sequence information may be, for example, but not limited to, a gene sequencing data set, a SNP data set, and a gene mutation site data set.
  • sequence information is not particularly limited. Any sequence detection technology suitable for nucleic acid known in the art is applicable to the present invention.
  • sequencing technology is used to detect sequence information, including but not limited to: whole genome sequencing, whole exome sequencing, and targeted sequencing.
  • whole-genome sequencing is used, and more preferably, high-throughput sequencing technology such as second-generation sequencing technology is used to detect sequence information of the whole genome from a nucleic acid sample.
  • the sequence information can be detected by a method selected from the group consisting of whole genome, whole exome, and gene polymorphic sites (for example, SNP or short tandem repeats) targeting genomic regions.
  • a CBC-PMRA Capital Biotechnology Precision Medicine Research Array
  • Boao Jingdian based on Affymetrix's PMRA (Precision Medicine Research Array) chip, which can detect 900,000 SNP sites.
  • an ASA Asian Screening Array
  • the offspring can be born or unborn offspring.
  • the offspring are unborn offspring, preferably fetuses or embryos, more preferably embryos produced by, for example, IVF.
  • the embryo may be an embryo about 3-10 days old, for example, a blastocyst about 5 days old.
  • the progeny nucleic acid sample is a sample containing a trace amount of genomic DNA nucleic acid of the progeny, for example, the sample contains about 0.1 pg-40 ng DNA, for example, 1-40 ng DNA, 20-40 ng DNA, 0.1-40 pg DNA , 1-40pg DNA, 10-40pg DNA trace nucleic acid samples of the offspring.
  • the progeny trace genomic DNA nucleic acid samples include, but are not limited to, embryo culture fluid (for example, IVF embryo culture fluid), blastocyst culture fluid (such as about 3-5 day-old blastocyst culture fluid), Fetal cell-free DNA in blastocoel fluid, maternal plasma or other types of maternal body fluids, and/or blastocyst trophoblast cells, cleavage embryonic cells, maternal blood or other types of maternal body fluids of fetal cells;
  • embryo culture fluid for example, IVF embryo culture fluid
  • blastocyst culture fluid such as about 3-5 day-old blastocyst culture fluid
  • Fetal cell-free DNA in blastocoel fluid maternal plasma or other types of maternal body fluids
  • blastocyst trophoblast cells blastocyst trophoblast cells
  • cleavage embryonic cells maternal blood or other types of maternal body fluids of fetal cells
  • CN106086199A discloses a method for detecting embryo chromosomes using blastocyst culture fluid, in particular it discloses the method of obtaining blastocyst culture fluid and the method of performing genome amplification on the obtained blastocyst culture fluid.
  • CN105368936A also discloses a method for detecting embryonic chromosomes for blastocyst culture fluid, especially discloses the collection of blastocyst culture fluid and whole genome amplification from trace DNA in culture fluid, including the design of primers used for amplification and Design of amplification reaction program.
  • CN109536581A discloses a method for obtaining a blastocyst culture solution used as a nucleic acid sample for genotyping analysis.
  • CN105543339A discloses embryos produced from in vitro fertilization (IVF) technology, obtaining outer trophoblast cells at the blastocyst stage, and performing embryo chromosome genome amplification.
  • IVF in vitro fertilization
  • the embryonic nucleic acid samples and their amplification methods disclosed in the above-mentioned documents are all suitable for obtaining the original genetic data of the offspring in the present invention, and they are hereby incorporated in their entirety into the present invention as a reference.
  • the culture fluid is aspirated from the embryo culture fertilized by the intracytoplasmic sperm injection technique (ICSI), preferably on the 3-10th day of culture, preferably the 5th day, the culture is aspirated Liquid, as a trace nucleic acid sample of the offspring, used to obtain the genomic sequence information of the offspring.
  • ICSI intracytoplasmic sperm injection technique
  • a single embryo culture system is used to culture the embryos in a culture medium of 0.1ul-1ml, and a small amount of culture medium (for example, about 0.1 ul-1ml, for example, about 0.1ul, 10ul, 20ul, 30ul, 40ul, 50ul, 100ul, 200ul, 500ul, 800ul, 1ml) for genetic information detection and genotype analysis of the offspring.
  • a culture medium of 0.1ul-1ml for example, about 0.1ul, 10ul, 20ul, 30ul, 40ul, 50ul, 100ul, 200ul, 500ul, 800ul, 1ml
  • the surface of the egg or the fertilized egg may be washed before the embryo culture is performed to remove the DNA impurities on the surface of the fertilized egg, thereby reducing the influence of the DNA impurities in the culture fluid.
  • the surface of the egg or the fertilized egg may be washed before the embryo culture is performed to remove the DNA impurities on the surface of the fertilized egg, thereby reducing the influence of the DNA impurities in the culture fluid.
  • this cleaning please refer to the descriptions in CN201610584345.5 and TW10612113, for example. These documents are hereby incorporated by reference.
  • the type of progeny nucleic acid sample used for sequence detection is not particularly limited, and it may be a sample containing a large amount of nucleic acid or a sample containing a small amount of nucleic acid.
  • the method of the present invention is particularly suitable for reconstructing embryonic genome information on embryo-derived DNA with a small amount and small fragments, such as cell-free DNA (cf DNA) in blastocyst culture fluid.
  • the present invention is particularly useful for prenatal diagnosis, for example, before the establishment of pregnancy (for example, before embryo implantation in IVF technology), in ligands and cells or culture medium taken from early embryos , Or in the later stages of pregnancy in cell samples taken from the placenta or fetus or fetal DNA taken from maternal body fluids, such as fetal cfDNA in maternal body fluids, for offspring haplotype phasing and/or genome reconstruction.
  • the offspring are unborn offspring, and a sample containing a small amount of offspring nucleic acid is used, for example, a single cell of an embryo produced by IVF or a culture medium of an embryo.
  • the offspring is a fetus
  • the free DNA content of the fetus contained in the offspring nucleic acid sample is, for example, 0.1pg-40ng, preferably 1-40ng, more preferably 20-40ng free DNA.
  • the offspring are embryos, and the free DNA content of the embryo contained in the offspring nucleic acid sample is, for example, 0.1-40 pg, preferably, 1-40 pg, and more preferably, 10-40 pg.
  • Those skilled in the art can use any known method to take nucleic acid samples from related individuals of the offspring, detect the genomic sequence information of the related individuals, and then obtain the genotype and haplotype information, thereby providing the family genotype information of the offspring .
  • the type of nucleic acid sample used for obtaining the raw data of related individuals is not particularly limited, and may be a sample containing a large amount of nucleic acid or a sample containing a small amount of nucleic acid.
  • nucleic acid samples that are conducive to obtaining high-density genotypic site information in the whole genome will be preferred.
  • the nucleic acid sample is any sample that contains genomic DNA nucleic acids of related individuals.
  • the sample may be a nucleic acid sample of the related individual that contains at least about 1 ng DNA (for example, 1 pg-1000 ng DNA); for example, the nucleic acid sample of the related individual is a tissue, cell, or tissue from the related individual.
  • Nucleic acid samples and body fluids for example, nucleic acid samples from blood, saliva, oral epithelium, urine, nails, hair follicles, and dander.
  • the nucleic acid sample may or may not be extracted and/or purified.
  • the nucleic acid contained in the nucleic acid sample is genomic DNA (gDNA) selected from the following various sources: whole blood gDNA, oral epithelial cell gDNA, urothelial cell gDNA, nail bed gDNA, hair follicle gDNA and Dandruff gDNA, preferably whole blood gDNA.
  • nucleic acid amplification can be performed.
  • Those skilled in the art can use any known nucleic acid amplification technology to perform whole-genome amplification of nucleic acids of progeny and/or related individuals.
  • the amplification method is selected from: primer extension PCR before amplification; degenerate oligonucleotide primer PCR (DOP-PCR); multiple displacement amplification technology (MDA); multiple annealing circular cycle amplification technology (MALBAC); blunt-end or sticky-end connection library method, or a combination thereof.
  • DOP-PCR degenerate oligonucleotide primer PCR
  • MDA multiple displacement amplification technology
  • MALBAC multiple annealing circular cycle amplification technology
  • blunt-end or sticky-end connection library method or a combination thereof.
  • the progeny nucleic acid content in the sample is small, it is preferable to use a whole-genome amplification method suitable for single cells, such as the MALBAC method for amplification, so as to reduce the erroneous gene sequence information caused by amplification.
  • a whole-genome amplification method suitable for single cells such as the MALBAC method for amplification
  • the sequence information of the genome is preferably detected by a technique selected from nucleic acid chips, amplification and/or sequencing.
  • the technology can be any such technology known in the art, including, but not limited to, mononucleotide polymorphism site microarray nucleic acid chip, MassARRAY flight mass spectrometry chip, MLPA multiplex connection amplification technology, second-generation sequencing, third-generation Sequencing, or a combination thereof.
  • SNP chips are used to obtain genomic sequence information of progeny and/or related individuals.
  • the SNP chip contains at least 700k sites, such as 800-1000K sites.
  • genome sequence information is obtained through whole genome sequencing.
  • a high-throughput sequencing platform is used to sequence the whole genome amplification products of the nucleic acid sample.
  • the sequencing platform is not particularly limited.
  • the second-generation sequencing platform including but not limited to Illumina's GA, GAII, GAIIx, HiSeq1000/2000/2500/3000/4000, X Ten, XFive, NextSeq500/550, MiSeq, Applied Biosystems' SOLiD , Roche’s 454FLX, ThermoFisherScientific (LifeTechnologies)’s IonTorrent, IonPGM, IonProton I/II; third-generation single-molecule sequencing platforms: including but not limited to HelicosBioSciences’ HeliScope system, PacificBioscience’s SMRT system, Oxford Nanopore Technologies’ GridION, MinION.
  • the sequencing type can be single-end (SingleEnd) sequencing or paired-end (PairedEnd) sequencing, and the sequencing length can be 30bp, 40bp, 50bp, 100bp, 300bp, etc., any length greater than 30bp.
  • whole-genome sequencing is performed using a sequencing depth, such as ⁇ 20x. More preferably, for related individuals, the sequencing depth can be higher, such as at least 20X, at least 30X, at least 40X, at least 50X. , At least 60X, at least 70X, at least 80X, or above.
  • the gDNA of related individuals adopts a high-depth whole-genome sequencing strategy in order to obtain high-accuracy and high-density polymorphic molecular markers.
  • a low-depth whole-genome sequencing method can be used, which is beneficial to cost control. Therefore, in some embodiments, low-depth sequencing methods can be used to obtain relatively low-density genotype information on samples containing trace progeny nucleic acids, such as embryo culture fluid, for example, the sequencing depth is as low as 2x, or even less than 1x. . Of course, the higher the sequencing depth, the higher the accuracy of the offspring genome reconstruction.
  • quality control and filtering of the sequence information data are performed to remove low-quality data.
  • Any tool known in the art that can perform such information quality control and clean up noisy genetic data can be used for this, including but not limited to various software that performs data quality control and filtering on the original fastq files generated by sequencing, for example, fastp software .
  • a variety of methods for analyzing the genotype of a subject based on the genome sequence information of a subject are known in the art, including various algorithms and computer executable programs. As understood by those skilled in the art, these methods are all applicable to the genotype analysis in the method of the present invention.
  • genotype analysis includes: determining the genotype of the offspring and/or related individuals based on the genome sequence information of the offspring and/or related individuals. In some preferred embodiments, for example, based on SNP chip detection data, determine the SNP polymorphism site and genotype of the subject, or, for example, analyze the genetic variation site and genotype in the subject's genome based on a sequencing data set. .
  • genotype analysis includes:
  • nucleic acid samples such as embryo culture fluid
  • multiple cases such as at least 200 or 300 or 400 or more cases
  • micro-nucleic acid whole-genome amplification data are used as a database for sequence information Fix.
  • the type of reference genome sequence is not particularly limited.
  • a known human reference genome can be used as a reference sequence, such as the hg19 and hg38 reference genomes provided by UCSC.
  • the coordinate system will be different. Therefore, in the analysis process, it is necessary to map the detected sequence data (such as sequencing data or SNP chip detection data) to the specific reference genome used to maintain the consistency of the information.
  • the genomic coordinates can also be converted by means known in the art, such as LiftOver.
  • the method of comparison is not particularly limited.
  • the BWA-MEM algorithm is used to align the sequence to a reference genome such as hg19.
  • the obtained comparison files are sorted and indexed, as well as deduplication and base quality correction.
  • genotype analysis includes:
  • the method further includes the step of obtaining the genotype information of the progeny of other genomic regions other than the nucleic acid chip site.
  • a high-density SNP chip that evenly covers the entire genome of Asians, especially Chinese, is used to meet the needs of genome-wide association analysis and genotyping.
  • a chip containing at least 500,000 (also referred to as 500K) SNP sites, at least 600K SNP sites, or 800K SNP sites or 900K SNP sites or even more sites is used to compare the data of offspring and related individuals. Genome-wide amplification products were subjected to genotyping analysis.
  • SNP genotype analysis tools are known in the art.
  • the Genotyping function module in the Axiom TM Analysis Suite analysis platform of Thermo Fisher Scientific can be used to perform SNP genotype analysis, and the genotype quality can be selected to meet PolyHighResolution, NoMinorHom, MonoHighResolution, and Hemizygous standards for SNP sites for use in the present invention. The next steps of the method.
  • the method of the present invention includes: obtaining a whole genome sequencing data set of offspring and related individuals, and detecting gene mutation sites based on the data set.
  • the sequencing data set is preferably a data set obtained after quality control and cleaning of the original sequencing data, comparison with a reference genome, sorting and deduplication, such as a BAM data format.
  • the prior art describes the quality control and cleaning of raw sequencing data, see CN108573125A.
  • the sequencer for acquiring the data includes an Illumina platform.
  • the Genome Analysis Toolkit (GATK) optimal strategy is used for gene mutation analysis.
  • quality control filtering is performed on the obtained gene mutation site to obtain sites that have genotype information in the parents and can be used for linkage analysis.
  • progeny nucleic acid When a sample containing a small amount of progeny nucleic acid is used for genotyping genetic data analysis, for example, when cfDNA in embryo culture fluid or embryonic tissue biopsy or fetal free DNA is used as the sample, because in these samples, progeny The amount of nucleic acid is small and the fragments are small, and there is often a high rate of genotyping errors.
  • the removal of noisy genetic data includes: identifying a genotyping error site in the genotyping data and marking the site as missing data.
  • the present invention uses the quality control of the amplification efficiency of the whole genome to identify the poorly typed sites in the progeny genotyping data caused by the low amplification efficiency, and mark the sites as missing data.
  • the heterogeneity of genome-wide amplification efficiency is a feature of single-cell amplification technology used for the amplification of trace nucleic acid samples, and regions with low amplification efficiency will lead to poor genotyping quality of base sites in this region.
  • the inventor proposes to construct a reference sequencing data set using multi-sample whole-genome amplification products to determine the distribution pattern of the whole-genome amplification efficiency of trace nucleic acids.
  • the quality control of the whole genome amplification efficiency of trace nucleic acid is implemented as follows:
  • the sequencing depth is not higher than about 0.5X, not higher than about 0.4X, not higher than about 0.3X, not higher than about 0.2X, not higher than about 0.1X
  • the sequencing depth is about 0.06X
  • the sequencing data is compared to the BAM file on the reference genome, and the BAM files of multiple reference samples are combined into a large BAM library.
  • the BAM files of the multiple reference samples are BAM files of at least 300, 400, 500, 600, 700, 800 reference samples.
  • DP i represents the absolute depth of the i-th site
  • N represents the number of sequencing reads
  • L represents the read length
  • the present inventors used the next-generation sequencing data of the MALBAC whole genome amplification products of 463 trace nucleic acids to plot the whole genome amplification efficiency distribution
  • the genotype information of the locus with low amplification efficiency for example, the amplification efficiency ⁇ 1
  • the locus is marked as missing data.
  • the Mendelian error rate refers to the ratio of the loci where Mendelian genetic errors occur to the total loci.
  • the present invention uses Mendelian laws of inheritance and chromosomal interference theory to identify ADO and other genotypic errors in progeny genotyping data and mark them as missing data.
  • Mendelian inheritance law means that if the father of a certain locus is of the A/C genotype and the mother is of the C/C genotype, their offspring must be of the A/C or C/C genotype, unless a new mutation occurs (The frequency of occurrence is extremely low). If the genotype of the offspring at this locus is A/A, it means that ADO may have occurred at this locus, and the information at this locus is marked as missing data.
  • chromosomal interference refers to the phenomenon that homologous chromosomes and non-sister chromatids interact and inhibit each other in two adjacent single exchanges during meiosis.
  • the present invention adopts the inhibition theory, which specifically refers to that when two molecular marker sites within a genetic distance undergo two exchanges or recombination, it is determined that the molecular marker in this recombination section has a genotyping error, and the molecular marker The marker site is marked as missing data, for example, where the genetic distance is any distance below 1 centimorgan (cM).
  • the haplotypes of the paternal and maternal origin of the offspring when constructing the haplotypes of the paternal and maternal origin of the offspring, if the constructed haplotypes are recombined twice within a small genetic distance, such as the previous molecular marker A ( SNP locus genotype) and the previous locus both indicate that the paternal haplotype was inherited from the grandfather, the next molecular marker B indicates that the paternal haplotype was inherited from the grandmother, and the downstream molecular marker C and subsequent sites of B all indicate the father
  • the source haplotype is inherited from the grandfather, and the A, B, and C locus are within a relatively small genetic distance such as 1 centimole. In this case, it can be inferred that the genotype of the B locus is wrong and mark it. Is missing data.
  • multiple (preferably more than 2) offspring samples are used (preferably unborn offspring samples, such as 2-4 or more blastocyst culture fluid samples; or siblings of embryos), And get the genotype data of multiple offspring.
  • the mutual derivation of the haplotypes of the multiple offspring refers to the use of the haplotype phasing method of the present invention and the use of the obtained genotype data of multiple offspring to deduce the largest possible haplotype composition of the offspring. This identifies the site of the genotype error and marks the site as missing data.
  • the removal of noisy genetic data includes: performing quality control of the above-mentioned nucleic acid genome-wide amplification efficiency and at least one quality control selected from the following: quality control for identifying Mendelian genetic errors, and identifying chromosomal interference suppression. Quality control and quality control of multiple progeny haplotypes derived from each other.
  • the elimination of noisy genetic data includes: quality control of the amplification efficiency of the whole nucleic acid genome and quality control of all three or less: quality control for identifying Mendelian genetic errors, quality control for identifying chromosomal interference suppression Quality control of mutual deduction with multiple progeny haplotypes.
  • the haplotype phasing of the offspring can be performed to determine the paternal and maternal haplotype composition of the offspring.
  • haplotype phasing based on genealogy it is preferable to perform haplotype phasing based on genealogy to obtain the paternal and maternal haplotype composition of the offspring.
  • the haplotype phasing includes:
  • a multi-locus linkage analysis strategy based on Mendelian law of inheritance and linkage disequilibrium theory is used to construct haplotypes of progeny at the chromosome level.
  • an algorithm selected from the group consisting of Lander-Green algorithm, Elston-Stewart algorithm, and Idury-Elston algorithm is used for haplotype phasing.
  • the haplotype phasing method of the present invention further includes: using nucleic acid samples of the grandparents and/or maternal grandparents of the (for example, unborn) offspring to construct the haplotypes of the unborn offspring and their parents .
  • a family-based multi-site linkage analysis method is used for haplotype analysis.
  • the haplotype analysis includes the use of multiple, preferably more than two progeny samples.
  • the grandparents and/or maternal grandparents of the unborn offspring can also be used to construct the haplotypes of the unborn offspring and their parents in the haplotype analysis.
  • the following methods are used for family-based haplotype analysis: Lander-Green algorithm, Elston-Stewart algorithm, or Idury-Elston algorithm.
  • haplotype construction is performed based on pedigree information to obtain the largest possible haplotype composition inherited from parents by offspring.
  • Construction methods include, but are not limited to, the likelihood method strategy (seeking the haplotype composition with the maximum probability), the genetic rule strategy (the haplotype composition with the minimum recombination number), and the Expectation Maximisation (EM) algorithm.
  • the likelihood method strategy includes, but is not limited to: Lander-Green algorithm and Viterbi dynamic programming algorithm, Elston-Stewart algorithm and Bayesian network algorithm, and the preferred solutions are Lander-Green algorithm and Viterbi dynamic programming algorithm.
  • the genetic rule method includes a zero recombination hypothesis strategy and a minimum recombination hypothesis strategy, and available software carriers include but are not limited to ZAPLO and HAPLORE.
  • haplotype phasing includes: performing the following steps after obtaining genotyping information and removing noisy genetic data in the genotyping information
  • V i (P 1,i ,M 1,i ,P 2,i ,M 2,i ,...,P n,i ,M n,i ).
  • m represents the number of sites;
  • P(V 1 ) is the initial probability of the paternal or maternal ancestor haplotype at the first site;
  • V i-1 ) is the i-1th site to its neighbor
  • the transfer probability of the haplotype state at the i-th site is obtained by calculating the recombination rate between the two sites;
  • the present invention provides a method for reconstructing the genome of the progeny.
  • haplotype phasing is performed before genotype filling to infer the haplotype of the sample.
  • genotype filling is performed for alleles that are missing in the phased haplotypes obtained after phasing.
  • Genotype filling is performed because of missing data in the genome of the offspring object.
  • a genotype deletion refers to a site with an unknown genotype, that is, an area in a sample that is not covered by sequencing data or a site with missing sequence information data, also known as missing data.
  • the lack of genotype data can be divided into genetic deletion and detection deletion.
  • Hereditary deletion refers to a genotypic deletion caused by a variation of an individual's genetic information (for example, a true deletion of a DNA fragment at a locus).
  • Loss of detection refers to the loss of sequence information due to the limitations of detection technology, errors and other factors.
  • Various genotyping techniques will produce detectable genotype deletions.
  • genotype deletion occurs due to the efficiency of probe hybridization and capture.
  • genotypic deletions include the above two types of deletions, as well as sites with poor genotyping quality that are removed from the offspring genome sequence information based on noise genetic data removal.
  • the sites with missing data on the genome of the offspring (such as embryos) will, in some embodiments, be at least 1/2 or higher, such as 4/5, of the whole genome. or above.
  • the progeny may have up to about 6/7 locus genotype deletions before genotype filling.
  • the method for reconstructing the genome of a progeny object of the present invention includes: combining family genotype information from related individuals, and performing haplotype phasing and missing genotypes on the genotypes of the progeny after removing the noise genetic data Of filling.
  • the missing genotype can be the genotype of the locus where the offspring is not amplified, the locus marked as missing data by removing the noise genetic data, or both.
  • the present invention provides a method for reconstructing the genome of a progeny object, which is characterized in that it comprises the steps:
  • (a) Provide a data set for the analysis and processing, the data set including: a first data set from a progeny subject, a second data set from a father of the progeny subject, and/or from the The third data set of the mother of the offspring object; wherein the data set is the corresponding genotyping information data set obtained by performing genetic testing and typing analysis on the nucleic acid or nucleic acid amplification products of the offspring object and its parents ,
  • the offspring object is preferably an unborn offspring object;
  • step (c) Perform haplotype phasing on the typing data obtained in step (b);
  • step (d) it further includes: adding family and/or population-based typing data for genotype filling, so as to obtain information about the whole genome genotype of the offspring object.
  • the pedigree is genetically related relatives other than the parents of the offspring object, such as siblings, grandparents and/or maternal grandparents.
  • the population typing data may be reference haplotype and haplotype frequency information from, for example, HapMap and 1000Genomes.
  • the desiccation and genotype filling of the present invention can be repeated for multiple different progeny chromosomes.
  • the number of repetitions is determined according to the number of progeny chromosomes to obtain the reconstruction of the whole genome information. For example, for diploid offspring, repeat 23 times (for female individuals) or 24 times (for male individuals).
  • pedigree-based genotype filling includes: based on parental high-density polymorphic molecular marker information and the composition of paternal and maternal haplotypes in the offspring constructed by haplotype phasing, using the same blood source (Identity By Descent, IBD) strategy to fill in missing genotypes in the offspring.
  • population-based genotype filling includes: analyzing the genome-wide genotype information of the progeny based on the population linkage disequilibrium law and reference haplotype and haplotype frequency information such as HapMap, 1000Genomes, etc. .
  • the analysis method can be selected from the following groups: Expectation Maximization (EM), Hidden Markov Model (HMM), Markov Chain Monte Carlo (MCMC), Coalescent Theory, or a combination thereof.
  • Genotype filling algorithms based on the law of population linkage disequilibrium include but are not limited to IMPUTE(2), MaCH, Beagle, Minimac.
  • genotype information that is not successfully filled based on family information
  • population-based genotype filling is used to complete the offspring genome information.
  • the genotype filling of offspring samples can be performed compared with the genotypes of the paternal and maternal samples.
  • the genotypes of individuals from other families can be further added for genealogical analysis and genotype filling of offspring objects, such as siblings of offspring or embryos from the same parent and maternal parent (including IVF production Embryos and their culture media), and/or grandparents and grandparents of offspring.
  • the haplotypes of embryonic progeny can be combined to complement the genotypes of the progeny that are missing.
  • the inventors found that combining the identification of the same blood (IBD) region and the genotype information of related individuals (especially, the parental high-density polymorphic locus (SNP) genotype information), the offspring Filling in the missing genotypes in the middle can obtain allele estimates with higher accuracy, for example, at least 90% or more accuracy, or even as high as 99% or more.
  • IBD same blood
  • SNP parental high-density polymorphic locus
  • step (d) filling in the missing genotypes of the offspring of the method of the present invention includes: identifying the same region in combination with the blood source, that is, determining the haplotype composition of the embryo in a certain region from the parent , Combined with the genotype information of the parents' high-density polymorphic loci to fill in the genotype information missing in the offspring;
  • genotype information that has not been successfully populated based on the family information, use the population reference haplotype information and the population-level allelic linkage disequilibrium (LD) law to fill in the genome-wide genotype information;
  • LD population-level allelic linkage disequilibrium
  • the population reference haplotype information is HapMap, 1000Genomes, HRC (Haplotype Reference Consortium);
  • the genotype filling algorithm of the population-level allelic linkage disequilibrium (LD) law is, for example, IMPUTE(2), MaCH, Beagle, Minimac algorithm;
  • a maximization algorithm (Expectation Maximization, EM), Hidden Markov Model (Hidden Markov Model, HMM), Markov chain Monte Carlo (MCMC), Coalescent theory, or a combination thereof is used to implement Genotype filling.
  • the reconstruction preferably includes:
  • population-based genotype population you can consider selecting a reference population template that is closer to the offspring object in genetic background. For example, when the offspring are Chinese individuals, you can consider using 1000 Genomes Phases3 Chinese population reference haplotype information.
  • the population-based genotype filling software known in the art, including but not limited to the MACH software package, can be used to reconstruct the whole genome genotype of the progeny.
  • whether to perform population-based genotype filling can be determined based on the following factors: (1) desired accuracy of genotype filling, (2) desired genome coverage, (3) desired target Regional coverage or not.
  • the accuracy of genome reconstruction of progeny (such as embryos) based on the family reaches more than 90%, preferably more than 95%, more preferably more than 97%, and most preferably more than 99%.
  • the genome coverage of the offspring reaches more than 60%, for example, more than 70%, more than 80%.
  • the accuracy of the progeny (such as embryo) genome reconstruction reaches 90% or more, preferably 95% or more, more preferably 97% or more, and most preferably 99% or more.
  • the genome coverage of the offspring is further improved.
  • genotype population includes two steps:
  • haplotype in the sample is most similar to the reference haplotype set based on the genotype information of the unmissed site on the sample to be filled, and then assign the corresponding most similar haplotype to the haplotype Sample, thereby reconstructing the complete genotype of the sample.
  • Genotype filling based on the genetic characteristics of family samples can generally find the haplotypes shared between the two by comparing the offspring samples to be filled with the haplotypes of the father and mother. However, the matching haplotypes can be found. The sites on the reference template are copied to the target data set of the offspring, and the genotype of the offspring samples is reconstructed.
  • Population-based genotype filling can generally compare the haplotypes of the offspring samples to be filled with the reference population haplotypes, and find the haplotypes shared between the two. However, the sites on the matched reference template can be copied to The target data set of the offspring is used to reconstruct the genotype of the offspring samples.
  • population based on family-based embryo progeny genotypes includes the following steps:
  • a chromosome-level haplotype is constructed based on pedigree information (parents), Mendelian inheritance rules, and gene linkage and exchange theory of multi-locus linkage analysis. Distinguish the two haplotypes of the embryo's father (mother), and construct the paternal and maternal haplotypes of the embryo at the same time, that is, it is clear which haplotype the embryo has inherited from the parent. When there are some heterozygous sites in the offspring information that cannot be haplotyped, you can add more offspring such as embryo's siblings or embryo's grandparents (maternal grandparent information) genotype information to increase the ability to perform haplotypes The number of loci for typing.
  • the amplified embryo samples and the gDNA samples of the embryo parent's gDNA can obtain genotyping information after genomic sequence information detection.
  • the small amount of embryonic DNA (or other samples containing small amounts of progeny DNA) in the embryo culture medium cannot be fully amplified due to the heterogeneity of the whole genome amplification.
  • the SNP chip as an example, only about 1/5 of the sites on the chip can pass the quality control of the chip genotyping, plus the amplification efficiency and genotyping error quality control of the present invention, the embryo genome is missing The locus genotype information will be more. Therefore, on the basis of haplotype construction, combined with the identity By Descent (IBD) method and parental high-density polymorphic locus genotyping information, the missing genotype information on a certain chromosome of the embryo is filled in.
  • IBD identity By Descent
  • Figure 1 shows an example of genotype filling.
  • 6.1 and 6.2 after the haplotypes are constructed, it is clear that the haplotypes inherited from the father and mother of the embryo are respectively ..A...C.G...T and ..G...T.C...A.
  • the completed embryo haplotype information is GAACGA..T and CGTTCA..A, which are completed.
  • the genotype information of the 3 locus missing in the embryo namely G/C, A/T, A/A.
  • the offspring haplotype information can be obtained at the level of the entire chromosome.
  • the offspring to be tested in the family has siblings, as shown in Figure 1, 6.3, based on the haplotypes of the siblings, the second stage genotype filling is performed, and the embryos to be tested and their siblings and their paternal haplotypes The same, the maternal haplotype is different, the other 3 genotypes missing in the embryo can be further complemented, namely A/C, C/T, C/A, and the completed haplotype information is GAAACCGAC.T And CGTCTTCAA.A..
  • the family-based genotype fill-in after the family-based genotype fill-in, it further includes population-based genotype fill-in of the embryo, so as to complement the genotype information missing from both the parent and the embryo.
  • the filling includes:
  • haplotypes of the progeny constructed according to the above-mentioned method of the present invention based on the population linkage disequilibrium law and HapMap and1000Genomes and other reference haplotypes and haplotype frequency information, a certain chromosome of the embryo is also missing in the parental genome information. Genotype information of the locus. Specifically, the population reference haplotype information can be used to find the population haplotype segment that is most similar to the embryo haplotype, and then based on other genotype information of this haplotype segment in the population to complete the embryo Missing information.
  • the two haplotypes in the embryo are GAAACCGAC.T and CGTCTTCAA.A, and the most similar and most frequent haplotypes in the population are GTACAACCGACGT and CGGATCTTCAACA, thus complementing the three embryos.
  • the estimation methods that can be used for this filling include, but are not limited to, the Expectation Maximization (EM), Hidden Markov Model (HMM), and Markov chain Monte Carlo (MCMC). ) And Coalescent theory.
  • the method of the present invention may include:
  • the quality control method includes the noise genetic data removal method of the present invention as described above.
  • the method of the present invention includes the following steps:
  • Embryonic nucleic acid samples can be taken from free DNA in embryo culture medium.
  • the amplification method adopts a single-cell amplification strategy, and the specific method is not limited, including but not limited to primer extension PCR (PEP-PCR) before amplification, and degenerate oligonucleotide primer PCR (Degenerate oligonucleotide primer- PCR, DOP-PCR), Multiple Displacement Amplification (MDA), Multiple Annealing and Looping Based Amplification Cycles (MALBAC), blunt-end or sticky-end connection construction And other methods.
  • PEP-PCR primer extension PCR
  • DOP-PCR degenerate oligonucleotide primer PCR
  • MDA Multiple Displacement Amplification
  • MALBAC Multiple Annealing and Looping Based Amplification Cycles
  • Detection methods can use nucleic acid chips, second-generation sequencing and other platforms, and genetic analysis uses genotyping methods to obtain genotype information of parents and embryos.
  • Quality control of the whole-genome amplification efficiency of trace nucleic acid The unevenness of the whole-genome amplification efficiency of trace nucleic acid is the characteristic of the amplification technology of trace nucleic acid (for example, trace nucleic acid from single cell), and the region with low amplification efficiency will affect The genotyping quality of the base site in this region.
  • the inventors used multi-sample whole-genome amplification products to construct a reference sequencing data set to determine the distribution pattern of the whole-genome amplification efficiency of trace nucleic acids.
  • the specific method is to perform low-depth sequencing of multiple reference samples of corresponding amplification products (such as about 0.06X), obtain the sequencing data and compare them to the BAM files on the reference genome, and merge the BAM files of multiple reference samples into one large BAM library.
  • N represents the number of sequencing reads
  • L represents the length of reads
  • DP i represents the absolute depth of the i-th position (total depth), that is, the number of reads at that position.
  • total depth the average sequencing depth of the genome of the reference sample BAM library is 27.
  • the inventor’s research found that compared to sites with a trace nucleic acid genome amplification efficiency ⁇ 1X (DP i ⁇ 27, absolute locus depth ⁇ genomic average depth) sites, amplification efficiency ⁇ 1X (DP i ⁇ 27) sites The Mendelian genetic error rate is nearly 6 times higher ( Figure 2).
  • the present inventors used multiple trace nucleic acid amplification products (different amplification methods require corresponding amplification reference samples).
  • Generation sequencing data draws a distribution map of the amplification efficiency of the whole genome, and on this basis, identifies the genotype information of the loci with low amplification efficiency ( ⁇ 1X), and marks it as missing data.
  • 3) Identify the wrong genotype sites of embryos and mark them as missing data: After the embryonic trace DNA is amplified by the whole genome, except for some sites that are not amplified due to low amplification efficiency, or the quality of genotyping is poor, At some sites, due to amplification bias, one of the two alleles is predominantly amplified, or even the other one fails to amplify completely, causing allele dropout (ADO) problems, thereby affecting the site Genotyping.
  • ADO allele dropout
  • first use step 6.1 to construct the paternal and maternal haplotypes of the offspring use Mendelian inheritance and chromosomal interference to identify the sites where ADO and other genotype errors occur in the embryo, and mark them as missing data .
  • Specific methods or frameworks include, but are not limited to, the likelihood method strategy (seeking the haplotype composition with the maximum probability), the genetic rule strategy (the haplotype composition with the minimum recombination number), and the Expectation Maximisation (EM) algorithm.
  • the likelihood method includes but is not limited to Lander-Green algorithm and Viterbi dynamic programming algorithm, Elston-Stewart algorithm and Bayesian network algorithm.
  • the preferred schemes are Lander-Green algorithm and Viterbi dynamic programming algorithm; genetic rule method includes zero recombination hypothesis Strategy and minimum reorganization hypothesis strategy, software carrier includes but not limited to ZAPLO, HAPLORE. If there is only one offspring information, such as only one embryo information, some heterozygous loci may not be haplotyped.
  • step 6.1 Completion of embryo genome information.
  • haplotype construction combined with the identity By Descent (IBD) method and parental high-density polymorphic locus genotyping information, the missing genotype information on a certain chromosome of the embryo is filled in.
  • IBD identity By Descent
  • the haplotype is reconstructed based on the information of other members of the family (for example, the siblings of the offspring to be tested), and the haplotype information of the siblings of the offspring to be tested is also used to further the lack of embryos based on Mendelian inheritance rules Genotype information is filled in. Generally, the more other members of the family (for example, the siblings of the offspring to be tested), the more genotypes that can be filled, and the higher the accuracy rate.
  • genotype information missing from both the parent and the embryo Using the haplotype constructed in step 6.1, based on the population linkage disequilibrium law and the reference haplotype and haplotype frequency information such as HapMap and 1000Genomes, complete the locus genes in a chromosome of the embryo that are also missing in the parental genome information Type information.
  • the parents of the embryo use higher-depth whole-genome sequencing to obtain accurate whole-genome genotype information, such as sequencing depth ⁇ 20x.
  • the embryos can also use low-depth whole-genome sequencing methods. Genome sequencing methods obtain relatively low-density genotype information, such as sequencing depth can be as low as 2x, or even lower.
  • the present invention also provides computer products, systems and equipment for the implementation of removing noise genetic data, phasing haplotypes and/or reconstructing progeny genomes
  • the present invention provides a device for reconstructing progeny genomes (especially haplotypes), the device comprising:
  • a non-transitory computer-readable storage medium carrying instructions for executing the method for reconstructing offspring genome information of the present invention including:
  • the device includes the following modules:
  • Sequence information data acquisition module used to acquire the original sequence information data of the offspring and/or related individuals
  • Genotype analysis module used to analyze the genotype of the original sequence information data of module (1);
  • Quality control filtering module used to perform quality control and filtering on the genotype information obtained by module (2);
  • Haplotype phasing module used to perform haplotype phasing on the genotype after quality control and filtering of module (3);
  • Genotype filling module used to further reconstruct the genotype of the progeny from the phased haplotype of module (4);
  • report output module process and integrate the data obtained in steps (1)-(5) to generate a report.
  • the present invention provides a device comprising:
  • At least one processor and at least one memory the at least one memory has a code stored thereon, and when executed by the at least one processor, the code causes the apparatus to be able to execute the method of the present invention.
  • the code when executed by the at least one processor, the code causes the apparatus to execute at least:
  • sequence information data for example, original sequence information data of offspring and/or related individuals
  • the present invention also provides a computer-readable storage medium having code stored thereon for use by a device, and when executed by a processor, the code causes the device to execute the progeny genome of the present invention.
  • Information reconstruction method when executed by the at least one processor, the code causes the apparatus to execute at least:
  • sequence information data for example, original sequence information data of offspring and/or related individuals
  • the present invention provides a system for reconstructing progeny genomes (especially haplotypes).
  • the system includes a device (device or module) configured to implement the method of the present invention, such as , Configured as:
  • the input includes genomic sequence information of offspring and related individuals
  • the haplotype of the progeny is determined to reconstruct the genomic information of the progeny.
  • the device is the aforementioned device of the present invention for reconstructing progeny genomes (especially haplotypes).
  • it may further include:
  • -Amplification device used to amplify nucleic acid samples of progeny and/or related individuals, preferably whole genome amplification
  • sequence information detection device used for sequence information detection of amplified products, including but not limited to polymorphic loci (such as SNP) detection and sequencing detection.
  • the present invention provides a device for analyzing and processing progeny genome reconstruction, including:
  • Amplification unit used for whole genome amplification of the DNA sample of the test sample and the parent family of the offspring
  • the detection and analysis unit is used for genetic detection and analysis of the amplified products obtained by the amplification unit;
  • the data processing unit is used for quality control and filtering of the detection and analysis data of the amplified products obtained by the amplification unit, and remove the genotypes of the sites that are not amplified or genotyping errors;
  • Information reconstruction unit for the whole genome genotype of the offspring wherein the information reconstruction unit for the whole genome genotype of the offspring is used to perform haplotype phasing and genotype filling, and output the obtained whole genome of the offspring The results of genotype information reconstruction.
  • the system of the present invention will include tools for querying genomic sequence information, and a programmed storage or medium for the computer to analyze the obtained data.
  • Sequence information query data can be a stored data set, or "on the fly” "form.
  • data set covers these two types of data sources.
  • the tools used for querying genome sequence information are not particularly limited.
  • a high-density SNP chip is used.
  • a high-throughput sequencing device is used to obtain high-depth sequencing data of related individuals of the offspring.
  • the present invention can be executed by a computer. Therefore, the present invention also provides a computer programmed to perform the above method.
  • a computer typically includes: a CPU connected to a computer communication interface, a system memory (RAM), a non-transitory memory (ROM), and one or more other storage devices such as a hard board, a floppy disk, and a CD ROM drive.
  • the computer may also include a display device, such as a printer, a CRT monitor, or an LCD display, and an input device, such as a keyboard, mouse, pen, touch screen, or voice activation system.
  • the input device can receive data, for example, directly from a sequence information query tool through an interface.
  • the method, computer product (especially the above-mentioned device of the present invention), system and equipment according to the present invention can be used for disease detection or disease susceptibility detection of pre-implantation embryos and/or fetuses in early pregnancy, including but not limited to: Aneuploidy detection, single gene genetic disease detection, chromosome structure rearrangement detection, polygenic disease genetic risk assessment.
  • the use includes: diagnosing common diseases or cancer susceptibility, including: for example, comparing the progeny haplotype reconstructed according to the method of the present invention with known disease-related haplotypes.
  • the relationship between such haplotypes and diseases is being established in the art.
  • "International HapMap consortium” maps and locates the genome-wide variation of SNP haplotypes in the human population, which is conducive to disease-related research (international, HapMap consortium, 2005). Therefore, combining the genome reconstruction method of the present invention with these haplotype analyses also forms an aspect of the present invention.
  • the present invention finds for the first time that on the basis of whole genome amplification technology, combined with chip or second- and third-generation sequencing technologies, the use of high-density gene polymorphism information and statistical genetics and computational biology of embryonic parents and other families
  • the algorithm can fill in the genotype of the unamplified sites in the embryo's genome and the sites with ADO and other genotyping errors, so as to obtain the embryo's whole genome information.
  • the present invention finds for the first time that quality control and filtering of embryo DNA analysis data can filter out information on sites with poor genotyping quality, especially sites with poor single-cell whole-genome amplification efficiency, thereby improving The accuracy of genome reconstruction.
  • the present invention finds for the first time that the integrated application of gene filling methods based on family and population can obtain the locus information of the entire genome of the embryo to the greatest extent.
  • the collected samples were 1ml of the father’s peripheral blood and 1ml of the mother’s peripheral blood in a family, collected with an EDTA anticoagulation tube; and the father’s sperm was collected using the technique of IntraCytoplasmic Sperm Injection ;ICSI) to fertilize the mother’s eggs in vitro using GM medium (Quinn's Advantage Protein Plus Cleavage Medium) (manufacturer: SAGE, product number: ART-1526) at 37°C, 5
  • GM medium Quinn's Advantage Protein Plus Cleavage Medium
  • SAGE product number: ART-1526
  • the fertilized egg was cultured in a 5% CO 2 and 5% O 2 incubator, and grew into a blastocyst on the fifth day, and about 20 ul of the blastocyst culture fluid was sucked.
  • the culture fluid of 4 blastocysts from the same parents was prepared.
  • the father’s peripheral blood samples and the mother’s peripheral blood samples were taken to extract the whole genome DNA by conventional whole blood genome extraction steps.
  • the kit used in this step is the commercially available DNeasy Blood&Tissue Kit (50) (manufacturer QIAGEN, article number 69504), and the extraction of whole genome DNA is carried out according to the manufacturer's instructions.
  • the amplification method of the present invention refers to the instruction of the MALBAC single-cell whole-genome amplification kit (Cat. No. KT110700150) of Xukang Medical Technology (Suzhou) Co., Ltd. to perform whole-genome amplification of the blastocyst culture medium.
  • Qubit dsDNA HS Assay Kit (Invitrogen, Q32584) was used to quantify the whole genome amplification products of each blastocyst culture medium. The quantitative results show that the total amount of nucleic acid in each sample is about 500-1000ng.
  • CBC-PMRA Capital Biotechnology-Precision Medicine Research Array
  • step 2) above was operated on the SNP chip 900K according to the manufacturer's instructions to obtain the original data for genotyping.
  • the Axiom TM Analysis Suite software of Thermo Fisher Scientific was selected as the platform for analyzing the original data obtained from the SNP chip 900K in the above step 3).
  • the embryonic SNP genotype information of each blastocyst culture medium is basically 1/4 of the parental SNP genotype information.
  • kits can be purchased on the market, and the library is constructed according to the instructions (manufacturer: Jiangsu Kangwei Century Biotechnology Co., Ltd., trade name: NGS Fast DNA Library Prep Set for Illumina, catalog number : CW2585M), and then use Illumina's NextSeq550 sequencing platform to perform whole-genome sequencing, and the average sequencing depth of each sample is 0.06X;
  • N the number of sequencing reads
  • L the read length
  • 3 ⁇ 10 9 is the size of the human genome.
  • the absolute depth of the SNP locus that is, the number of reads covering the locus, is greater than or equal to the average sequencing depth of the genome, it means that the locus has passed the quality control of amplification efficiency; the locus that does not meet this quality control The genotype is marked as missing data.
  • the mutual derivation of embryo haplotypes refers to the use of the genotype phasing method of step 8) and the acquisition of genotype data of multiple embryos to deduce the maximum possible haplotype composition of the embryo. Then, based on the chromosomal interference suppression strategy, the site where two crossover recombination occurred within 1cM genetic distance was identified as the wrong genotype.
  • V i (P 1,i ,M 1,i ,P 2,i ,M 2,i ,...,P n,i ,M n,i ).
  • m represents the number of sites;
  • P(V 1 ) is the initial probability of the paternal or maternal ancestor haplotype at the first site;
  • V i-1 ) is the i-1th site to its neighbor
  • the transfer probability of the haplotype state at the i-th site is obtained by calculating the recombination rate between the two sites.
  • the present invention estimates the recombination rate between sites through the genetic map of the 1000 Genomes Project Phase 3 (1000 Genomes Project Phase 3); P(G i
  • IBD identity by descent
  • Table 3 shows the polymorphic site information obtained after embryo genome reconstruction.
  • the results of this experiment were compared with paired biopsy samples.
  • the biopsy sample is the blastocyst trophoblast cells corresponding to the blastocyst culture medium.
  • the specific experimental steps are as follows: 1Transfer the biopsy blastocyst to an in vitro operation culture medium without calcium and magnesium ions (such as G-PGD containing 5% HSA); 2In Under an inverted microscope (200X), fix the embryo with an egg-holding needle; 3 Cut or perforate the zona pellucida with a diameter of 35-40 ⁇ m; 4 Use a needle with an inner diameter of 35-40 ⁇ m to absorb a cell with a nucleus; 5 Transfer the embryo The operating fluid is removed, washed and cultured in the blastocyst culture fluid. Indicate the patient's name and embryo number; 6DNA extraction, amplification, quantification and genotype analysis procedures are the same as the blastocyst culture medium.
  • allelic remodeling accuracy rate of this example can reach about 99.2% (Table 4) .
  • the Chinese population reference haplotype information in 1000 Genomes Phase 3 is further used, based on the haplotype information that has been phased using genealogical information, Hidden Markov Model (HMM) is used, and MACH software is used. package to predict embryo genotype 2 genome.
  • HMM Hidden Markov Model
  • the DNA samples from the above step 2) were ultrasonically interrupted, and the interrupted fragments were distributed in 200-800bp, and then the Illumina ligation library method was used to construct the second-generation sequencing library.
  • the kits can be purchased in the market and proceed according to the instructions. Library construction (manufacturer: Jiangsu Kangwei Century Biotechnology Co., Ltd., trade name: NGS Fast DNA Library Prep Set for Illumina, Item No.: CW2585M).
  • next-generation sequencing library of step 3 use the NovaSeq 6000 sequencing platform of Illumina to perform whole-genome sequencing.
  • the average sequencing depth of the whole genome is 20X
  • the average sequencing depth of the whole genome is 2X. .
  • GTK Genome Analysis Toolkit
  • VQSR Variant Quality Score Recalibration
  • the quality control filtering principle of the site is: 1 the point where the VQSR is "PASS"; 2 the point where the sequencing depth of the parent's site is DP>20 and the genotype information is not “./.”; and 3 Embryo site sequencing depth DP>5 and genotype information is not the point "./.”.
  • the "./.” is a site where genotyping cannot be performed.
  • the obtained parents have genotype information and the loci that can be used for linkage analysis are 1608593 loci.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Zoology (AREA)
  • Theoretical Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Wood Science & Technology (AREA)
  • Molecular Biology (AREA)
  • Bioethics (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Artificial Intelligence (AREA)
  • Genetics & Genomics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention relates to a method for obtaining genetic data from a nucleic acid sample of an offspring, and clearing the noisy genetic data therein by quality control and filtering, a method for phasing a haplotype of the offspring for the offspring genetic data subjected to quality control and filtering, a method for reconstructing an offspring genome, and a device or system for implementing the method. The present invention further relates to the use of the method and the device or system for implementing the method, which is used for polygenic disease genetic risk rating, aneuploidy detection, single-gene genetic disorder detection, and structural chromosomal rearrangement detection of a preimplantation embryo and/or fetus in the first trimester.

Description

清除噪音遗传数据、单体型定相、重构子代基因组的方法、系统和其用途Method, system and application for removing noise genetic data, haplotype phasing, reconstructing progeny genome 技术领域Technical field
本发明总体上涉及生物医学诊断和检测领域。更具体地,本发明涉及获取、操作和使用遗传数据来定相子代的单体型、重构子代基因组的方法、实施所述方法的系统和装置,尤其是,涉及利用子代微量DNA核酸样本进行子代的单体型定相和子代基因组重构的方法、系统和计算机装置,并涉及所述定相的子代单体型、重构的子代基因组在鉴定可能导致多种表型结果的遗传变异、特别是非整倍性和疾病相关基因上的应用。The present invention generally relates to the field of biomedical diagnosis and detection. More specifically, the present invention relates to obtaining, manipulating and using genetic data to phase the haplotypes of progeny, methods for reconstructing progeny genomes, systems and devices for implementing the methods, and in particular, to the use of trace DNA from progeny The method, system and computer device for phasing the haplotype of the progeny and reconstructing the progeny genome of the nucleic acid sample, and the identification of the haplotype of the progeny and the reconstructed progeny genome involved in the phasing may lead to a variety of expressions. Genetic variation of type results, especially the application of aneuploidy and disease-related genes.
背景技术Background technique
辅助生殖技术(assisted reproductive technology,ART)在克服人类的不孕和不育方面取得了长足的进步。目前在世界范围内,总出生人口中的约3-4%是通过辅助生殖操作降生的。尽管ART取得了一些令人惊讶的理论和技术进步,但“健康婴儿”概念的实际实施仍面临着独特的挑战。Assisted reproductive technology (ART) has made great progress in overcoming human infertility and infertility. At present, about 3-4% of the total birth population in the world are born through assisted reproduction operations. Although ART has made some surprising theoretical and technological advancements, the actual implementation of the concept of "healthy babies" still faces unique challenges.
胚胎植入前遗传学检测(Preimplantation Genetic Test,PGT)是在体外受精到胚胎移植过程中,对具有高遗传风险患者的胚胎进行植入前遗传学分析的一种检测,旨在选择遗传物质正常的胚胎植入母体宫腔,从而获得健康子代的方法。根据检测内容,PGT可分为非整倍体检测(PGT for Aneuploidies,PGT-A)、单基因遗传病检测(PGT for Monogenic gene defects,PGT-M)和染色体结构重排检测(PGT for chromosomal Structural Rearrangements,PGT-SR)。目前PGT的临床应用主要还是通过胚胎活检获取细胞来进行上述遗传学检测的。然而,越来越多的研究表明,该有创的细胞活检过程会对胚胎发育潜能及之后的个体发育产生不良影响。而近年来多项研究发现胚胎培养液中含有胚胎来源的游离DNA(cfDNA)片段,使无创进行胚胎植入前遗传学检测成为可能。胚胎培养液中的cfDNA在PGT-A、PGT-M以及PGT-SR上的成功应用更表明该方法在胚胎植入前的遗传学检测上是无创、准确且有效的。Preimplantation Genetic Test (PGT) is a test that performs preimplantation genetic analysis on embryos of patients with high genetic risk during the process from in vitro fertilization to embryo transfer, and aims to select normal genetic material The embryo is implanted into the mother’s uterine cavity to obtain healthy offspring. According to the test content, PGT can be divided into aneuploidy test (PGT for Aneuploidies, PGT-A), single-gene genetic disease test (PGT for Monogenic gene defects, PGT-M) and chromosomal structure rearrangement test (PGT for chromosomal Structural Rearrangements, PGT-SR). At present, the clinical application of PGT is mainly to obtain cells through embryo biopsy for the above-mentioned genetic testing. However, more and more studies have shown that this invasive cell biopsy process will adversely affect embryonic developmental potential and subsequent ontogeny. In recent years, a number of studies have found that embryo culture fluid contains embryo-derived free DNA (cfDNA) fragments, making it possible to perform non-invasive preimplantation genetic testing. The successful application of cfDNA in embryo culture medium on PGT-A, PGT-M and PGT-SR further shows that this method is non-invasive, accurate and effective in preimplantation genetic testing.
考虑到多基因疾病或慢性病,如心血管疾病、糖尿病、肥胖、肿瘤等已成为威胁人类健康的第一杀手,其中所述多基因疾病或慢性病是多个基因参与疾病进程的结果。研究表明,很多慢性疾病具有较高的遗传率,因此,遗传基础在决定个体的患病风险上起了较为重要的作用。但是,理论上和实践上仍然无法实现多基因疾病风险检测,其技术难点在于多基因疾 病风险值的构建需要胚胎和/或胎儿全基因组范围的基因型信息;而以无创或者创伤很小的方式获得的胚胎和/或胎儿的DNA由于存在量少,尤其是胚胎培养液中存在的胚胎无细胞DNA还存在片段小、DNA全基因组扩增均一性差、基因型错误率高等特点,导致无法产生高质量的、高度连续的胚胎和/或胎儿全基因组序列。Considering that polygenic diseases or chronic diseases, such as cardiovascular diseases, diabetes, obesity, tumors, etc., have become the number one killer threatening human health, wherein the polygenic diseases or chronic diseases are the result of multiple genes participating in the disease process. Studies have shown that many chronic diseases have a high heritability rate. Therefore, the genetic basis plays a more important role in determining the risk of an individual. However, theoretically and practically, it is still impossible to realize the risk detection of polygenic diseases. The technical difficulty is that the construction of the risk value of polygenic diseases requires embryo and/or fetal genome-wide genotype information; and in a non-invasive or minimally traumatic way The obtained embryo and/or fetal DNA is small, especially the embryonic cell-free DNA present in the embryo culture medium has the characteristics of small fragments, poor DNA whole genome amplification uniformity, and high genotype error rate, which makes it impossible to produce high Quality, highly continuous embryo and/or fetal whole genome sequence.
因此,本领域迫切需要一种能够对以无创或者创伤很小的方式获得的胚胎和/或胎儿的微量遗传材料提供高质量的、高度连续的胚胎和/或胎儿全基因组基因型信息的方法和系统,从而使植入前的胚胎和/或孕早期的胎儿的多基因疾病遗传风险评级成为可能。Therefore, there is an urgent need in the art for a method and method that can provide high-quality, highly continuous embryo and/or fetal genome-wide genotype information for the trace genetic material of embryos and/or fetuses obtained in a non-invasive or minimally traumatic manner. System, thus making it possible to rate the genetic risk of polygenic diseases in pre-implantation embryos and/or fetuses in early pregnancy.
发明概述Summary of the invention
本发明人经过长期广泛深入的研究,通过大量筛选和测试,首次意外地发现,对于子代微量核酸样本(例如,所述微量核酸样本是胚胎培养液、囊胚培养液、囊胚腔液、母体血浆或母体其他类型体液中胎儿的无细胞DNA(cfDNA)、和/或囊胚滋养层细胞、卵裂期胚胎细胞、母体血液或母体其他类型体液中的胎儿细胞),在全基因组扩增技术基础上,结合子代家系(例如父母等)的基因组(gDNA)序列信息,利用核酸芯片或二代测序技术等序列信息获取技术以及统计遗传学与计算生物学算法,进行子代单体型定相、子代基因组重构,可获得非常高的单体型定相和基因组重构准确率。在一些实施方案中,所述单体型定相和重构准确率≥90%,甚至可高达97%。After long-term, extensive and in-depth research, through a large number of screenings and tests, the inventors unexpectedly discovered for the first time that the trace nucleic acid samples of the progeny (for example, the trace nucleic acid samples are embryo culture fluid, blastocyst culture fluid, blastocoel fluid, Cell-free DNA (cfDNA) of the fetus in maternal plasma or other types of maternal body fluids, and/or blastocyst trophoblast cells, cleavage embryonic cells, maternal blood or other types of maternal body fluids), amplified in the whole genome On the basis of technology, combined with the genome (gDNA) sequence information of the offspring's family (such as parents, etc.), the offspring haplotypes are obtained by using sequence information acquisition technologies such as nucleic acid chips or next-generation sequencing technology, as well as statistical genetics and computational biology algorithms. Phasing and genomic reconstruction of offspring can obtain very high accuracy of haplotype phasing and genome reconstruction. In some embodiments, the haplotype phasing and reconstitution accuracy rate is ≥90%, and can even be as high as 97%.
具体地,本发明通过对获取的微量子代核酸进行扩增、分析数据、质控和过滤,从而清除噪音遗传数据(例如,基因型分型质量差如等位基因脱扣(allele dropout,ADO)的位点基因型信息),之后基于系谱的单体型定相,获取子代的单体型定相;最后利用系谱的血源同一(Identity By Descent,IBD)及群体的连锁不平衡策略,对子代中缺失的基因型(例如,未被扩增和发生ADO等基因型错误的位点)进行基因型填充,由此以非常高的基因组重构准确率,高保真地重构了子代的基因组。另外,从其他相关个体,例如与该子代相关的其他胚胎、兄弟姐妹、祖父母或其他亲戚获得的遗传数据也可以用于进一步增加重构的该子代基因组的准确率。在此基础上,本发明人完成了本发明。Specifically, the present invention performs amplification, data analysis, quality control and filtering on the obtained trace progeny nucleic acid, thereby eliminating noise genetic data (for example, the quality of genotyping is poor such as allele dropout (ADO) ) Locus genotype information), and then based on the haplotype phasing of the pedigree, obtain the haplotype phasing of the offspring; finally use the identity By Descent of the pedigree (Identity By Descent, IBD) and the linkage disequilibrium strategy of the population , Perform genotype filling in the missing genotypes in the offspring (for example, sites that have not been amplified and genotype errors such as ADO), thereby rebuilding the genome with high fidelity with a very high accuracy rate of genome reconstruction The genome of the offspring. In addition, genetic data obtained from other related individuals, such as other embryos, siblings, grandparents, or other relatives related to the offspring can also be used to further increase the accuracy of the reconstructed offspring genome. On this basis, the inventor completed the present invention.
因此,在第一方面,本发明涉及一种清除来自子代的噪音遗传数据的方法,所述方法包括步骤:Therefore, in the first aspect, the present invention relates to a method for removing noisy genetic data from offspring, the method comprising the steps of:
(a)提供来自子代的基因组序列信息,其中所述子代的基因组序列信息获自包含约0.1pg -40ng DNA,例如,1-40ng DNA、20-40ng DNA、0.1-40pg DNA、1-40pg DNA、10-40pg DNA的子代微量核酸样本;例如,所述子代微量核酸样本是胚胎培养液、囊胚培养液、囊胚腔液、母体血浆或母体其他类型体液中胎儿的无细胞DNA、和/或囊胚滋养层细胞、卵裂期胚胎细胞、母体血液或母体其他类型体液中的胎儿细胞;(a) Provide genomic sequence information from the offspring, where the genomic sequence information of the offspring is obtained from about 0.1pg-40ng DNA, for example, 1-40ng DNA, 20-40ng DNA, 0.1-40pg DNA, 1- 40pg DNA, 10-40pg DNA trace nucleic acid sample of the offspring; for example, the trace nucleic acid sample of the offspring is embryo culture fluid, blastocyst culture fluid, blastocyst cavity fluid, maternal plasma or other types of maternal body fluids of the fetus without cells DNA, and/or fetal cells in blastocyst trophoblast cells, cleavage embryo cells, maternal blood or other types of maternal body fluids;
(b)对步骤(a)的子代的基因组序列信息进行质控和过滤,其中所述质控包括选自实施微量核酸全基因组扩增效率的质控、识别孟德尔遗传错误的质控、识别违反染色体干涉抑制理论的质控、多个子代单体型相互推导的质控和它们的组合。(b) Quality control and filtering of the genomic sequence information of the progeny of step (a), wherein the quality control includes a quality control selected from the group consisting of implementing a whole genome amplification efficiency of trace nucleic acid, a quality control identifying Mendelian genetic errors, Identify quality controls that violate the theory of chromosomal interference suppression, quality controls that are deduced from multiple progeny haplotypes, and their combinations.
在一些实施方案中,本发明的清除来自子代的噪音遗传数据的方法的步骤(a)提供的来自子代的基因组序列信息对其基因组的覆盖度可不超过约30%,例如对其基因组的覆盖度为约30%、25%、20%、15%,例如,其中步骤(a)的来自子代的基因组序列信息是通过对所述子代微量核酸样本实施选自下组的全基因组扩增:扩增前引物延伸PCR、退变寡核苷酸引物PCR、多重置换扩增技术、多次退火环状循环(MALBAC)扩增技术、平末端或黏性末端连接建库等方法、或它们的组合,优选地为MALBAC扩增技术,然后通过选自核酸芯片、扩增和/或测序的技术,检测子代的基因组序列而获得的序列信息。所述的核酸芯片、扩增和/或测序的技术是单苷酸多态性位点微阵列核酸芯片、MassARRAY飞行质谱芯片、MLPA多重连接扩增技术、二代测序、三代测序、或它们的组合;例如,所述单苷酸多态性位点微阵列核酸芯片是SNP基因分型芯片;例如,所述二代测序包括全基因组测序、全外显子组测序和靶向基因组区域的测序,优选地为全基因组测序,例如,低深度的全基因组测序,例如测序深度可低至2x甚至1x以下。In some embodiments, step (a) of the method for removing noisy genetic data from progeny of the present invention provides genome sequence information from progeny that does not exceed about 30% of its genome, for example, the coverage of its genome The coverage is about 30%, 25%, 20%, 15%. For example, the genome sequence information from the progeny in step (a) is performed by performing whole-genome expansion selected from the following group on the progeny trace nucleic acid sample. Increase: primer extension PCR before amplification, degenerate oligonucleotide primer PCR, multiple displacement amplification technology, multiple annealing loop cycle (MALBAC) amplification technology, blunt-end or sticky-end connection library building methods, or The combination of them is preferably the MALBAC amplification technology, and then the sequence information obtained by detecting the genome sequence of the progeny through a technology selected from the group consisting of nucleic acid chips, amplification and/or sequencing. The described nucleic acid chip, amplification and/or sequencing technology is a single nucleotide polymorphism site microarray nucleic acid chip, MassARRAY flying mass spectrometry chip, MLPA multiple connection amplification technology, second-generation sequencing, third-generation sequencing, or their Combinations; for example, the single nucleotide polymorphism site microarray nucleic acid chip is a SNP genotyping chip; for example, the second-generation sequencing includes whole-genome sequencing, whole-exome sequencing, and sequencing of targeted genomic regions , Preferably whole-genome sequencing, for example, low-depth whole-genome sequencing, for example, the sequencing depth can be as low as 2x or even 1x or less.
进一步地,本发明的清除来自子代的噪音遗传数据的方法的步骤(b)所述的微量核酸全基因组扩增效率的质控是如下实施的:利用多个微量核酸样本的全基因组扩增产物的参考测序数据来识别扩增效率低的位点基因型,并将该位点基因型标为缺失数据,例如,将多个微量核酸样本的全基因组扩增产物作为参考样本进行低深度测序,例如测序深度不高于约0.5X,不高于约0.4X,不高于约0.3X,不高于约0.2X,不高于约0.1X,例如测序深度为约0.06X,将自所述参考样本获得的测序数据比对到人类参考基因组(例如,hg19或hg38)上,使用如下公式计算位点扩增效率Further, the quality control of the whole genome amplification efficiency of trace nucleic acid described in step (b) of the method for removing noisy genetic data from progeny of the present invention is implemented as follows: whole genome amplification using multiple trace nucleic acid samples The reference sequencing data of the product is used to identify the genotype of the site with low amplification efficiency, and the genotype of the site is marked as missing data, for example, the whole genome amplification product of multiple trace nucleic acid samples is used as a reference sample for low-depth sequencing For example, the sequencing depth is not higher than about 0.5X, not higher than about 0.4X, not higher than about 0.3X, not higher than about 0.2X, not higher than about 0.1X, for example, the sequencing depth is about 0.06X. The sequencing data obtained from the reference sample is compared to a human reference genome (for example, hg19 or hg38), and the following formula is used to calculate the site amplification efficiency
Figure PCTCN2020121432-appb-000001
其中,
Figure PCTCN2020121432-appb-000002
Figure PCTCN2020121432-appb-000001
among them,
Figure PCTCN2020121432-appb-000002
其中DP i表示第i个位点的绝对深度,N表示测序read次数,L表示read长度, Where DP i represents the absolute depth of the i-th site, N represents the number of sequencing reads, and L represents the read length,
当DP i≥基因组平均深度时,位点扩增效率≥1,则表示该位点通过微量核酸全基因组扩增效率的质控;将未符合这一质控的位点基因型标记为缺失数据。此外,步骤(b)所述的染色体干涉抑制理论是当一段遗传距离内两个分子标记位点出现两次交换或重组,则判定这一重组区段内的分子标记发生基因型分型错误,并将所述分子标记位点标记为缺失数据,例如,其中所述一段遗传距离是1个厘摩(cM)以下的任一距离。 When DP i ≥ the average depth of the genome, and the site amplification efficiency ≥ 1, it means that the site has passed the quality control of the whole genome amplification efficiency of trace nucleic acid; the genotype of the site that does not meet this quality control is marked as missing data . In addition, the chromosomal interference suppression theory described in step (b) is that when two molecular marker sites within a genetic distance are exchanged or recombined twice, it is determined that the molecular marker in this recombination section has a genotyping error. The molecular marker site is marked as missing data, for example, where the genetic distance is any distance below 1 centimer (cM).
在第二方面,本发明涉及一种对子代的单体型进行定相的方法,所述方法包括上述的步骤(a)和步骤(b)、以及如下步骤:In the second aspect, the present invention relates to a method for phasing the haplotypes of progeny, the method comprising the above-mentioned steps (a) and (b), and the following steps:
(c)对质控和过滤后的子代基因组序列信息(例如,子代的基因型信息)基于系谱信息及孟德尔遗传规律和基因连锁与交换理论的多位点连锁分析策略来定相子代的单体型,例如染色体水平的子代单体型,其中所述系谱信息为所述子代的遗传学父亲的基因组序列信息(例如,遗传学父亲的基因型信息)和/或所述子代的遗传学母亲的基因组序列信息(例如,遗传学母亲的基因型信息),任选地,所述系谱信息还包括所述子代的其他家系个体的基因组序列信息(例如,基因型信息)。(c) The quality control and filtered progeny genome sequence information (for example, the genotype information of the progeny) is based on the pedigree information, Mendelian inheritance laws and the multi-locus linkage analysis strategy of gene linkage and exchange theory to phase the progeny The haplotype of the progeny, such as the haplotype of the progeny at the chromosome level, wherein the pedigree information is the genome sequence information of the genetic father of the progeny (for example, the genotype information of the genetic father) and/or the Genome sequence information of the genetic mother of the offspring (for example, the genotype information of the genetic mother), optionally, the pedigree information also includes the genome sequence information of other pedigree individuals of the offspring (for example, genotype information) ).
在一些实施方案中,对子代的单体型进行定相的方法的步骤(c)中所述的系谱信息获自包含至少约100ng DNA(例如100ng-1000ng DNA)的所述家系个体的核酸样本;例如,所述家系个体核酸样本是来自所述家系个体的血液、唾液、口腔拭子、尿液、指甲、毛囊、皮屑、细胞、组织、体液的核酸样本,并且所述系谱信息对所述家系个体的覆盖度不少于约90%,例如对其覆盖度为约90%、95%、98%、99%或以上,例如,其中所述系谱信息是通过对所述家系个体的基因组DNA(例如全血gDNA、口腔上皮细胞gDNA、尿路上皮细胞gDNA、甲床gDNA、毛囊gDNA和皮屑gDNA、优选地全血gDNA)进行全基因组测序获得的数据,优选地,对所述gDNA采用高深度的全基因组测序策略,例如测序深度为至少20X、至少30X、至少40X、至少50X、至少60X、至少70X、至少80X。In some embodiments, the genealogical information described in step (c) of the method for phasing the haplotype of the offspring is obtained from the nucleic acid of the family individual comprising at least about 100ng DNA (for example, 100ng-1000ng DNA) A sample; for example, the family individual nucleic acid sample is a nucleic acid sample from blood, saliva, buccal swabs, urine, nails, hair follicles, dander, cells, tissues, body fluids from the family individual, and the genealogical information is The coverage of the pedigree individual is not less than about 90%, for example, the coverage is about 90%, 95%, 98%, 99% or more, for example, where the genealogical information is obtained by analyzing the pedigree individual Genomic DNA (such as whole blood gDNA, oral epithelial cell gDNA, urothelial cell gDNA, nail bed gDNA, hair follicle gDNA and dander gDNA, preferably whole blood gDNA) is the data obtained by whole-genome sequencing, preferably, the gDNA adopts a high-depth whole-genome sequencing strategy, for example, the sequencing depth is at least 20X, at least 30X, at least 40X, at least 50X, at least 60X, at least 70X, at least 80X.
在一些实施方案中,对子代的单体型进行定相的方法的步骤(c)是利用统计遗传学与计算生物学算法实施的,例如,基于所述系谱信息使用选自似然法策略(求最大概率的单体型组成)、遗传规则策略(求最小重组数的单体型组成)、最大期望(Expectation Maximisation,EM)算法和它们的组合,获得子代最大可能的单体型组成。In some embodiments, step (c) of the method for phasing the haplotypes of the offspring is implemented using statistical genetics and computational biology algorithms, for example, using a strategy selected from the likelihood method based on the pedigree information (Find the haplotype composition with the greatest probability), genetic rule strategy (see the haplotype composition with the smallest recombination number), the Expectation Maximisation (EM) algorithm and their combination to obtain the largest possible haplotype composition of the offspring .
在第三方面,本发明涉及一种对子代基因组进行重构的方法,所述方法包括上述步骤(a)、 步骤(b)和步骤(c)、以及步骤In the third aspect, the present invention relates to a method for reconstructing progeny genomes, the method comprising the above-mentioned steps (a), (b) and (c), and steps
(d)进行子代缺失基因型的填充,(d) Fill in the missing genotypes of the offspring,
在一些实施方案中,对子代基因组进行重构的方法的步骤(d)是通过结合血源同一区域识别,也即确定的某一区域胚胎来自父母的单体型组成情况,同时结合父母高密度多态位点基因型信息,对子代中缺失的基因型位点信息实施填充。In some embodiments, step (d) of the method for reconstructing the offspring genome is to identify the same region of the blood source, that is, to determine the haplotype composition of the embryo in a certain region from the parent, and at the same time combine the parent's height. Density polymorphic locus genotype information, fill in the genotype locus information missing in the offspring.
在一些实施方案中,对子代基因组进行重构的方法的步骤(d)还涉及对于基于家系信息未被成功填充的基因型信息,利用群体参考单体型信息及群体水平等位基因连锁不平衡(LD)规律填补全基因组水平的基因型信息;In some embodiments, step (d) of the method for reconstructing offspring genomes also involves using population reference haplotype information and population-level allele linkage failure for genotype information that has not been successfully populated based on family information. The law of balance (LD) fills in genotype information at the whole genome level;
在第四方面,本发明涉及一种能够实施上述的清除来自子代的噪音遗传数据的设备或系统;一种能够执行上述的单体型定相的设备或系统;以及一种能够执行上述的基因型填充的设备或系统。In the fourth aspect, the present invention relates to a device or system capable of performing the above-mentioned removal of noisy genetic data from offspring; a device or system capable of performing the above-mentioned haplotype phasing; and a device or system capable of performing the above-mentioned A device or system for genotyping.
在一些实施方案中,本发明涉及一种设备或系统,其特征在于,In some embodiments, the present invention relates to a device or system characterized in that,
能够执行对DNA样本的全基因组扩增,例如,能够执行对子代的DNA样本进行全基因组扩增和/或用于对子代的遗传学父母的DNA样本进行全基因组扩增(在一些实施方案中,当父母的DNA样本量足够时,不需要扩增);It can perform whole-genome amplification of DNA samples, for example, it can perform whole-genome amplification of DNA samples of offspring and/or whole-genome amplification of DNA samples of offspring’s genetic parents (in some implementations) In the plan, when the parent's DNA sample amount is sufficient, no amplification is required);
能够执行对获得的全基因组扩增产物或gDNA样本进行基因组的序列遗传信息的检测,例如,读取核酸芯片或二代测序后的序列信息;It can perform the detection of the sequence genetic information of the genome of the obtained whole-genome amplification product or gDNA sample, for example, read the sequence information after the nucleic acid chip or the second-generation sequencing;
能够执行对原始遗传数据的质控和过滤,将质量不符合要求的数据去除,例如,将扩增效率低的位点基因型标记为缺失数据;Ability to perform quality control and filtering of the original genetic data, and remove the data that does not meet the requirements, for example, mark the genotype of the locus with low amplification efficiency as missing data;
能够识别原始遗传数据中的错误的基因型位点,并将其标记为缺失数据;Be able to identify the wrong genotype locus in the original genetic data and mark it as missing data;
能够执行对单体型的定相;和Able to perform phasing of haplotypes; and
能够执行基因型的填充。Able to perform genotype filling.
在第五方面,本发明涉及上述第一方面至第三方面的方法的用途或者上述第四方面的设备或系统的用途,用于对植入前的胚胎和/或孕早期的胎儿进行多基因疾病遗传风险评级、非整倍体检测、单基因遗传病检测、染色体结构重排检测和它们的组合。In the fifth aspect, the present invention relates to the use of the methods of the first to third aspects above or the use of the device or system of the fourth aspect to perform polygenic pre-implantation embryos and/or fetuses in early pregnancy. Disease genetic risk rating, aneuploidy detection, single-gene genetic disease detection, chromosome structure rearrangement detection and their combination.
本发明的其它实施方案将通过参阅此后的详细说明而清楚明了。Other embodiments of the present invention will be made clear by referring to the detailed description hereinafter.
附图简述Brief description of the drawings
结合以下附图一起阅读时,将更好地理解以下详细描述的本发明的优选实施方案。出于说明本发明的目的,图中显示了目前优选的实施方案。然而,应当理解本发明不限于图中所示实施方案的精确安排和手段。When read together with the following drawings, the preferred embodiments of the present invention described in detail below will be better understood. For the purpose of illustrating the invention, the figure shows a currently preferred embodiment. However, it should be understood that the present invention is not limited to the precise arrangements and instrumentalities of the embodiments shown in the drawings.
图1显示了本发明的一个技术方案的流程图。Figure 1 shows a flow chart of a technical solution of the present invention.
图2显示了子代微量核酸全基因组扩增效率对子代基因型质量的影响。Figure 2 shows the effect of the whole genome amplification efficiency of progeny trace nucleic acid on the quality of progeny genotypes.
发明详述Detailed description of the invention
在详细描述本发明之前,应了解,本发明不受限于本说明书中的特定方法及实验条件,因为所述方法以及条件是可以改变的。另外,本文所用术语仅是供说明特定实施方案之用,而不意欲为限制性的。Before describing the present invention in detail, it should be understood that the present invention is not limited to the specific methods and experimental conditions in this specification, because the methods and conditions can be changed. In addition, the terms used herein are only for describing specific embodiments, and are not intended to be limiting.
定义definition
除非另有定义,否则本文中使用的所有技术和科学术语均具有与本领域一般技术人员通常所理解的含义相同的含义。为了本发明的目的,下文定义了以下术语。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those of ordinary skill in the art. For the purpose of the present invention, the following terms are defined below.
术语“约”在与数字数值联合使用时意为涵盖具有比指定数字数值小5%的下限和比指定数字数值大5%的上限的范围内的数字数值。The term "about" when used in conjunction with a numerical value means to cover a numerical value within a range having a lower limit that is 5% smaller than the specified numerical value and an upper limit that is 5% larger than the specified numerical value.
术语“和/或”当用于连接两个或多个可选项时,应理解为意指可选项中的任一项或可选项中的任意两项或多项。When the term "and/or" is used to connect two or more alternatives, it should be understood to mean any one of the alternatives or any two or more of the alternatives.
如本文中所用,术语“包含”或“包括”意指包括所述的要素、整数或步骤,但是不排除任意其他要素、整数或步骤。在本文中,当使用术语“包含”或“包括”时,除非另有指明,否则也涵盖由所述及的要素、整数或步骤组成的情形。例如,当提及“包含”某个具体序列的抗体可变区时,也旨在涵盖由该具体序列组成的抗体可变区。As used herein, the term "comprising" or "including" means including the stated elements, integers or steps, but does not exclude any other elements, integers or steps. In this document, when the term "comprises" or "includes" is used, unless otherwise specified, it also covers the case consisting of the stated elements, integers or steps. For example, when referring to an antibody variable region that "comprises" a specific sequence, it is also intended to encompass the antibody variable region composed of that specific sequence.
术语“子代”包括但不限于哺乳动物例如人的子代,意指出生的或未出生的子代。未出生的子代包含胚胎(embryo)或胎儿(fetus)。胚胎通常指受精后到第八周的、胚胎期结束之前受精卵分裂的产物。胚胎的卵裂期存在于培养的前三天。“胚胎移植”是将一个或多个胚胎和/或囊胚放入子宫或输卵管的操作。胎儿通常指哺乳动物怀孕八周后的未出生的后代,特别是未出生的人类婴儿。The term "offspring" includes, but is not limited to, the offspring of a mammal, such as a human, and means a born or unborn offspring. Unborn offspring include embryos or fetuses. Embryo usually refers to the product of the division of the fertilized egg before the end of the embryonic period from the eighth week after fertilization. The cleavage stage of the embryo exists in the first three days of culture. "Embryo transfer" is the operation of putting one or more embryos and/or blastocysts into the uterus or fallopian tube. Fetus usually refers to the unborn offspring of mammals after eight weeks of pregnancy, especially unborn human babies.
术语“囊胚”是受精后5天或6天的胚胎,其具有内部细胞团、称为滋养外胚层的外部细胞层、以及容纳胚胎整体所来源的内部细胞团的充满液体的囊胚腔。滋养外胚层是胎盘的前体。The term "blastocyst" is an embryo 5 or 6 days after fertilization, which has an inner cell mass, an outer cell layer called trophectoderm, and a fluid-filled blastocyst cavity that contains the inner cell mass from which the entire embryo is derived. The trophectoderm is the precursor of the placenta.
术语子代的“相关个体”或“家系个体”可互换地使用,是指与目标子代个体在遗传上具有亲缘关系的任何个体,例如与目标子代个体遗传上相关并且因此与其共有单倍体的任何个体。在一种情况下,相关个体可以是目标个体的基因父母或来源于父母的任何遗传物质,例如精子、极体、其他胚胎或胎儿。它还可以指兄弟姐妹、父母或祖父母、外祖父母。在本申请中,父母是指个体的遗传学父亲或母亲。子代个体通常具有两个亲本(母本和父本)。所述兄弟姐妹是指其基因父母与所讨论的子代个体相同的任何个体。在一些实施方案中,兄弟姐妹可以指已出生孩子、胚胎或胎儿、或来源于胚胎或胎儿、已出生孩子的一个或多个细胞;兄弟姐妹还可以指源自父母一方的单倍体个体,例如精子、极体或任何其它单体型遗传物质。The terms "related individuals" or "family individuals" of the offspring are used interchangeably, and refer to any individual that is genetically related to the target offspring individual, for example, is genetically related to the target offspring individual and therefore shares an individual with the target offspring. Any individual that is ploidy. In one case, the relevant individual may be the genetic parent of the target individual or any genetic material derived from the parent, such as sperm, polar body, other embryos or fetuses. It can also refer to siblings, parents or grandparents, and grandparents. In this application, parent refers to the genetic father or mother of an individual. Offspring individuals usually have two parents (maternal and male). The sibling refers to any individual whose genetic parents are the same as the offspring individual in question. In some embodiments, siblings can refer to a born child, embryo or fetus, or one or more cells derived from an embryo or fetus, or a child that has been born; siblings can also refer to haploid individuals derived from one parent, For example, sperm, polar bodies, or any other haplotype genetic material.
子代来源的DNA是指其基因型基本上与子代基因型等同的子代细胞原始部分的DNA、子代体液或子代细胞的培养液原始来源的DNA。The DNA derived from the progeny refers to the DNA of the original part of the progeny cell, the body fluid of the progeny or the original DNA of the culture fluid of the progeny cell whose genotype is basically equivalent to the genotype of the progeny.
亲本来源的DNA是指其基因型基本上与亲本基因型等同的亲本细胞原始部分的DNA、亲本体液或亲本细胞的培养液原始来源的DNA。例如,母本来源的DNA是指其基因型基本上与母本基因型等同的母本细胞原始部分的DNA、母本体液或母本细胞的培养液原始来源的DNA。Parent-derived DNA refers to the DNA of the original part of the parent cell whose genotype is basically equivalent to the parental genotype, the parent body fluid, or the original DNA of the parent cell culture fluid. For example, maternal DNA refers to the DNA of the original part of the maternal cell whose genotype is basically equivalent to the maternal genotype, the maternal body fluid, or the original DNA of the maternal cell culture fluid.
术语“SNP(单核苷酸多态性)”是指染色体DNA序列中的某个位点由于单个核苷酸的变化而引起的多态性,在群体中SNP的频率一般为>1%。在人类全基因组上平均300-1000bp有一个SNP。目前可以从多个公开数据库获得SNP数据库,包括例如,http://cgap.ncbi.nih.gov/GAI;http://www.ncbi.nlm.nih.gov/SNP;人类SNP数据库http://hgbas.cgr.ki.sei或http://hgbase.interactiva.de/。The term "SNP (Single Nucleotide Polymorphism)" refers to a polymorphism caused by a single nucleotide change at a certain site in a chromosomal DNA sequence. The frequency of SNPs in a population is generally >1%. There is an average of 300-1000 bp in the whole human genome with one SNP. SNP databases are currently available from a number of public databases, including, for example, http://cgap.ncbi.nih.gov/GAI; http://www.ncbi.nlm.nih.gov/SNP; human SNP database http:/ /hgbas.cgr.ki.sei or http://hgbase.interactiva.de/.
术语“基因型”是指个体在一个位点上所拥有的等位基因的类型,称作个体在该位点的基因型。对于人类而言,除了性染色体外,每对同源染色体在同一位点上具有的一对等位基因的类型,称作该位点的基因型。基因分型是指确定个体的基因型的过程。The term "genotype" refers to the type of alleles possessed by an individual at a locus, which is called the genotype of the individual at that locus. For humans, except for sex chromosomes, the type of a pair of alleles that each pair of homologous chromosomes has at the same locus is called the genotype of that locus. Genotyping refers to the process of determining the genotype of an individual.
术语“噪音遗传数据”是指具有以下任何一项的遗传数据:等位基因脱扣(Allele Dropout,ADO)、不确定的碱基对测量、错误的碱基对测量、缺失的碱基对测量、插入或缺失的不确定测量、染色体区段拷贝数的不确定测量、虚假的信号、其他测量错误、或它们的组合。The term "noise genetic data" refers to genetic data with any of the following: Allele Dropout (ADO), uncertain base pair measurement, wrong base pair measurement, missing base pair measurement , Indeterminate measurement of insertion or deletion, indeterminate measurement of chromosome segment copy number, false signal, other measurement errors, or a combination thereof.
术语“测序深度(Sequencing Depth)”是指测序得到的总碱基数与待测基因组大小的比值。假设一个基因组大小为2M,测序深度为10X,那么获得的总数据量为20M。测序深度能够用测序得到的碱基总量(bp)与基因组大小(Genome)的比值来表示。The term "Sequencing Depth" refers to the ratio of the total number of bases obtained by sequencing to the size of the genome to be tested. Assuming a genome size of 2M and a sequencing depth of 10X, the total amount of data obtained is 20M. The sequencing depth can be expressed by the ratio of the total number of bases (bp) obtained by sequencing to the size of the genome (Genome).
术语“位点的绝对测序深度”和“位点的绝对深度”可互换地使用,是指该位点的read次数。The terms "absolute sequencing depth of a site" and "absolute depth of a site" are used interchangeably and refer to the number of reads of the site.
术语“基因组的平均测序深度”和“基因组的平均深度”可互换地使用,是指将全基因组上每个位点的绝对深度相加,除以位点个数,得到基因组的平均深度。基因组的平均测序深度也可以理解为基因组中每个碱基被测序到的平均次数。The terms "average sequencing depth of the genome" and "average depth of the genome" are used interchangeably, and refer to adding the absolute depth of each site on the whole genome and dividing by the number of sites to obtain the average depth of the genome. The average sequencing depth of the genome can also be understood as the average number of times each base in the genome has been sequenced.
术语“Read”也称为“读长”,测序数据中每一条序列就是一个read。The term "Read" is also called "read length". Each sequence in the sequencing data is a read.
术语“覆盖度(coverage)”是指,基因组或转录组或染色体区段上获知序列信息的序列部分占整个组或区段的比例。在一些实施方案中,覆盖度是指,(例如通过序列检测手段,如测序)测到序列信息的碱基数占所检测区域的总碱基数的比例。例如,在测序检测全基因组序列时,由于存在大片段拼接的缺口(gap)、测序读长有限、重复序列等问题,测序后得到的基因组序列通常无法完全覆盖基因组的所有区域,此时,覆盖度就是最终得到的测序碱基数占整个基因组碱基数的比例。例如,对人的基因组测序获得的覆盖度为98.5%,这表明该基因组还有1.5%的区域的序列没有得到。在另一些实施方案中,覆盖度是指,就所检测的区域而言,(例如通过SNP芯片或测序分析)测到序列信息的基因位点(例如SNP位点或基因变异位点)的数目,占该区域中所检测的总基因位点数的比例。所检测区域可以是全基因组、特定染色体、或特定染色体区段、或转录物组、或特定转录区域。The term "coverage" refers to the proportion of the sequence part of the genome or transcriptome or chromosome segment with known sequence information to the entire group or segment. In some embodiments, the coverage refers to the ratio of the number of bases of sequence information detected (for example, by means of sequence detection, such as sequencing) to the total number of bases in the detected region. For example, when the whole genome sequence is detected by sequencing, due to problems such as gaps in splicing large fragments, limited read length, and repetitive sequences, the genome sequence obtained after sequencing usually cannot completely cover all regions of the genome. In this case, the coverage The degree is the ratio of the number of sequenced bases finally obtained to the number of bases of the entire genome. For example, the coverage obtained by sequencing the human genome is 98.5%, which indicates that there are still 1.5% regions of the genome that have not been sequenced. In other embodiments, the coverage refers to the number of genetic sites (such as SNP sites or genetic variation sites) for which sequence information is detected (for example, by SNP chip or sequencing analysis) in terms of the detected area , The proportion of the total number of gene loci detected in the region. The detected region can be the whole genome, a specific chromosome, or a specific chromosome segment, or a transcript set, or a specific transcription region.
术语“Fastq”是序列数据存储的标准格式之一,每4行为一条read的信息,包含测序read名、序列、正反链标示、序列质量值。The term "Fastq" is one of the standard formats for sequence data storage. There is one piece of read information for every four rows, including sequencing read name, sequence, positive and negative chain identification, and sequence quality value.
术语“孟德尔遗传规律”涉及遗传学的两个基本定律,为分离定律和自由组合定律,统称为孟德尔遗传规律。根据孟德尔遗传规律,在减数分裂期,等位基因会随着同源染色体的分离而分开,分别进入两个配子当中,独立地随配子遗传给后代;此外,在等位基因分离的同时,非同源染色体上的非等位基因表现为自由组合。The term "Mendelian law of inheritance" refers to the two basic laws of genetics, the law of separation and the law of free combination, collectively referred to as the law of Mendelian inheritance. According to Mendelian rules of inheritance, during meiosis, alleles will separate with the separation of homologous chromosomes, enter the two gametes separately, and be inherited independently from the gametes to offspring; in addition, at the same time as the alleles separate , The non-allelic genes on non-homologous chromosomes appear as free combinations.
术语“连锁不平衡(Linkage disequilibrium,LD)”是指分属两个或两个以上基因座位的等位基因同时出现在一条染色体上的几率,高于随机出现的频率。连锁不平衡又称等位基因关联(allelic association)。一般,LD的强度与2个基因座位点间的距离有关。通常,两对等位 基因相距越远,发生重组的机会越大,即重组率(交换率)越高,LD越弱;反之,相距越近,重组率越低,LD越强。因此,可以用重组率来反映同一染色体上两个基因之间的相对距离。以基因重组率为1%时两个基因间的距离记作1厘摩(centimorgan,cM)。The term "linkage disequilibrium (LD)" refers to the probability that alleles belonging to two or more gene loci appear on one chromosome at the same time, which is higher than the frequency of random occurrence. Linkage disequilibrium is also called allelic association (allelic association). Generally, the intensity of LD is related to the distance between two gene loci. Generally, the farther the two pairs of alleles are, the greater the chance of recombination, that is, the higher the recombination rate (exchange rate), the weaker the LD; conversely, the closer the distance, the lower the recombination rate, and the stronger the LD. Therefore, the recombination rate can be used to reflect the relative distance between two genes on the same chromosome. When the gene recombination rate is 1%, the distance between two genes is recorded as 1 centimorgan (cM).
术语“染色体干涉(chromosomal interference)”是指减数分裂时期,同源染色体非姊妹染色单体相邻两次单交换相互影响和抑制的现象。在本文中,采用染色体干涉的抑制现象,来进行基因型分型数据的质控。The term "chromosomal interference" refers to the phenomenon in which homologous chromosomes and non-sister chromatids interact and inhibit each other in two adjacent single exchanges during meiosis. In this article, the inhibition of chromosomal interference is used to control the quality of genotyping data.
子代的基因组形成过程,相当于亲代基因组的一次随机重组(即,连锁互换单体型重组和配子的随机组合)。The genome formation process of the offspring is equivalent to a random recombination of the parental genome (that is, a random combination of chain interchange haplotype recombination and gametes).
术语“等位基因脱扣(ADO)”是指由扩增偏差导致不能扩增杂合细胞中两个等位基因之一,例如,一个等位基因优势扩增,而另一个等位基因完全扩增失败。对于微量DNA的全基因组扩增而言,ADO能够影响高达大于40%的扩增,并且已经引起了胚胎植入前的遗传诊断(pre-implantation genetic diagnosis,PGD)错误。The term "allelic dropout (ADO)" refers to the inability to amplify one of the two alleles in a heterozygous cell due to amplification bias, for example, one allele is dominantly amplified while the other allele is completely Amplification failed. For the whole genome amplification of trace DNA, ADO can affect the amplification up to more than 40%, and has caused pre-implantation genetic diagnosis (PGD) errors.
术语“单体型(haplotype)”是指在同一染色体上通常共同遗传的多个位点处等位基因的组合。根据一组指定位点间已经发生的重组事件的数量,单体型可以仅指少至两个位点,或者指整个染色体。单体型还可以指统计学相关的单一染色单体上的一组单核苷酸多态性(SNP)。The term "haplotype" refers to a combination of alleles at multiple sites that are usually inherited in common on the same chromosome. Depending on the number of recombination events that have occurred between a set of designated sites, haplotypes can refer to as few as two sites, or to the entire chromosome. Haplotype can also refer to a set of single nucleotide polymorphisms (SNPs) on a single chromatid that is statistically related.
术语“单体型数据(haplotypic data)”也称为“单体型遗传数据”、“定相的数据(phased data)”或“有序遗传数据(ordered genetic data)”,是指来自二倍体或多倍体基因组中单条染色体的已经确定了的遗传数据。The term "haplotypic data" is also referred to as "haplotypic data", "phased data" or "ordered genetic data", which refers to data derived from double The determined genetic data of a single chromosome in a somatic or polyploid genome.
术语“无序遗传数据(Unordered Genetic Data)”是指来自二倍体或多倍体基因组中两条或多条染色体的测序数据合在一起的数据。The term "Unordered Genetic Data" refers to data obtained by combining the sequencing data of two or more chromosomes in a diploid or polyploid genome.
术语“单体型定相(haplotype phasing)”是指对来自二倍体或多倍体的无序遗传数据,确定个体的单体型遗传数据的行为。它可以指针对一条染色体上所发现的一组等位基因,确定在每一等位基因处的两个基因中的哪一个基因与个体中的两条同源染色体中的某一条染色体相关的行为。对多位点进行单体型定相能够发现单体型-疾病表型之间的关联,这种关联要明显强于单个位点-疾病表型之间的关联。The term "haplotype phasing" refers to the behavior of determining the haplotype genetic data of an individual for disordered genetic data from diploid or polyploid. It can refer to a set of alleles found on a chromosome, and determine which of the two genes at each allele is related to one of the two homologous chromosomes in an individual . The haplotype phasing of multiple sites can find the haplotype-disease phenotype correlation, which is significantly stronger than the single site-disease phenotype correlation.
术语“SNP芯片”是一种这样的芯片,即,利用所述芯片杂交后获得的信号(通常为荧光信号)能够判断某个位点的基因型。在实际的研究中,SNP芯片会因芯片厂家、型号等的不同 而包含不同的SNP位点。例如Affymetrix公司和Illumina公司生产的人类芯片包含不同的SNP集。The term "SNP chip" is a chip that uses the signal (usually a fluorescent signal) obtained after hybridization of the chip to determine the genotype of a certain site. In actual research, SNP chips will contain different SNP sites depending on chip manufacturers and models. For example, the human chips produced by Affymetrix and Illumina contain different sets of SNPs.
术语“血缘同一(Identity By Descent,IBD)”是指两个或者多个等位基因均遗传于同一祖先,且在此过程中不发生基因重组事件,则称此类等位基因具有共同血缘。IBD区域识别方法可以参考例如Browning BL,A fast,powerful method for detecting identity by descent,Am J Hum Genet.2011 Feb 11;88(2):173-82;以及Augustine Kong等,Detection of sharing by descent,long-range phasing and haplotype imputation,Nat Genet.2008 Sep;40(9):1068–1075。The term "Identity By Descent (IBD)" means that two or more alleles are inherited from the same ancestor, and no gene recombination event occurs during this process, it is said that such alleles have a common blood relationship. IBD area identification methods can refer to, for example, Browning BL, A fast, powerful method for detecting identity by descent, Am J Hum Genet. 2011 February 11; 88(2): 173-82; and Augustine Kong, etc., Detection of sharing by descent, long-range phasing and haplotype imputation, Nat Genet. 2008 Sep; 40(9):1068-1075.
短语“胚胎父母等家系高密度的基因多态位点信息”是指当采用相同的遗传分析手段时,父母和胚胎的基因多态位点密度不一样,原因在于父母的样本是gDNA样本,DNA浓度大,大多数的基因型位点信息可顺利获取;而胚胎往往利用的是单细胞的全基因扩增产物或者胚胎培养液中DNA的全基因扩增产物,存在全基因组扩增不均一、ADO等扩增错误,使得胚胎可用的基因多态位点信息相对稀疏。The phrase "high-density genetic polymorphism information of parents and other families of embryos" means that when the same genetic analysis method is used, the density of genetic polymorphisms of parents and embryos is different. The reason is that the parents’ samples are gDNA samples. If the concentration is high, most of the genotype locus information can be obtained smoothly; while embryos often use the whole gene amplification product of single cell or the whole gene amplification product of DNA in embryo culture fluid, and there is uneven whole genome amplification, Amplification errors such as ADO make the available genetic polymorphic loci information of embryos relatively sparse.
术语“全基因组关联分析(Genome-wide association study)”是指在人类全 基因组范围内找出存在的序列变异,从中筛选出与疾病相关的序列变异,以实现低成本、高效益地找到遗传标志物与疾病间的关联。 The term "genome-wide association study (Genome-wide association study)" refers to identify sequence variations occurring within the whole range of human genome sequence variation screened out associated with the disease, in order to achieve cost-effective to find genetic The association between markers and disease.
如本文所用,术语“模块”指可以集中在单个计算系统上(例如,计算机程序、平板电脑(PAD)、一个或者多个处理器)执行的软件对象或例程(例如,作为独立的线程)。实现本发明方法的程序可以存储在计算机可读介质上,该介质上包含计算机程序逻辑或代码部分,用于实现所述系统模块和方法。虽然优选地以软件来实现本文中所描述的系统模块和方法,但是以硬件或者软件和硬件的组合的实现也是可以的,并且是本领域技术人员可以设想的。As used herein, the term "module" refers to a software object or routine (e.g., as an independent thread) that can be executed on a single computing system (e.g., computer program, tablet computer (PAD), one or more processors). . The program for implementing the method of the present invention may be stored on a computer-readable medium, which contains computer program logic or code parts, for implementing the system modules and methods. Although the system modules and methods described herein are preferably implemented by software, implementation by hardware or a combination of software and hardware is also possible, and can be conceived by those skilled in the art.
以下就本发明的各方面进行详细描述。Various aspects of the present invention will be described in detail below.
本发明的方法The method of the invention
本发明总体上提供了一种对子代基因组进行重构的方法,所述方法包括:The present invention generally provides a method for reconstructing progeny genomes, the method comprising:
(a)获取子代的基因组序列信息;(a) Obtain the genome sequence information of the offspring;
(b)对步骤(a)的子代的基因组序列信息进行质控和过滤,去除基因分型(genotyping)质量差的位点;(b) Perform quality control and filtering on the genomic sequence information of the offspring of step (a), and remove the sites with poor genotyping quality;
(c)对质控和过滤后的子代基因组序列信息基于系谱信息等来定相子代的单体型;(c) Phase the haplotypes of the progeny based on pedigree information on the quality control and filtered progeny genome sequence information;
(d)进行子代缺失基因型的填充。(d) Fill in the missing genotypes of the offspring.
在本发明中,子代可以是出生的或未出生的子代。在本发明中,相关个体可以是与目标个体遗传上具有亲缘关系的任何个体。In the present invention, the offspring may be born or unborn offspring. In the present invention, the related individual may be any individual who is genetically related to the target individual.
关于本发明方法的各个方面,进一步描述如下:Regarding various aspects of the method of the present invention, further description is as follows:
I.原始遗传数据的获得I. Acquisition of original genetic data
本发明方法在一个方面涉及以原始遗传数据为基础进行基因组信息处理和/或重构。在本发明中,可以适用于本发明方法的原始遗传数据包括子代和/或相关个体的基因组序列信息及其相关原始遗传数据,例如基于所述序列信息产生的基因型信息。这些原始遗传数据是无序的,未定相的。在一些实施方案中,所述原始遗传数据为数据集的形式,例如,计算机可读数据集的形式。In one aspect, the method of the present invention involves genomic information processing and/or reconstruction based on original genetic data. In the present invention, the original genetic data applicable to the method of the present invention includes genomic sequence information of the offspring and/or related individuals and related original genetic data, such as genotype information generated based on the sequence information. These original genetic data are disordered and unphased. In some embodiments, the original genetic data is in the form of a data set, for example, in the form of a computer-readable data set.
在本发明中,原始遗传数据的获取途径并无特定限制,例如,可以由本发明的方法的使用者直接提供记载了该数据的计算机可读介质、或在商业平台上产生的数据包,或优选地,自靶核酸样本通过本领域已知的任何序列信息检测技术获得。In the present invention, there are no specific restrictions on the way to obtain the original genetic data. For example, the user of the method of the present invention can directly provide a computer-readable medium recording the data, or a data package generated on a commercial platform, or preferably Specifically, the target nucleic acid sample is obtained by any sequence information detection technique known in the art.
在一个优选的实施方案中,在本发明方法,例如本发明的遗传信息质控、清除噪音遗传数据、单体型定相和/或基因组重构方法中,原始遗传数据的获取包括:获得子代的基因组序列信息,和基于该信息进行子代的基因型分析。在另一些实施方案中,还包括:获取相关个体的基因组序列信息和进行基因型分析。In a preferred embodiment, in the method of the present invention, such as the method of genetic information quality control, noise removal genetic data, haplotype phasing and/or genome reconstruction methods of the present invention, the acquisition of original genetic data includes: Genome sequence information of the next generation, and the genotype analysis of the offspring based on this information. In some other embodiments, the method further includes: obtaining genomic sequence information of related individuals and performing genotype analysis.
在一些实施方案中,优选地,获取子代和/或相关个体的基因组序列信息的步骤包括:In some embodiments, preferably, the step of obtaining genomic sequence information of offspring and/or related individuals includes:
-对子代和/或相关个体的核酸样本进行全基因组扩增;-Whole genome amplification of nucleic acid samples of progeny and/or related individuals;
-对扩增产物,检测子代的基因组序列信息。-For the amplified product, detect the genome sequence information of the offspring.
在一些实施方案中,所述基因组序列信息包括但不限于:全基因组的序列信息,全外显子组的序列信息、和/或靶向染色体区域的序列信息。所述序列信息可以是,例如但不限于,基因测序数据集,SNP数据集、基因变异位点数据集。In some embodiments, the genomic sequence information includes, but is not limited to: sequence information of the whole genome, sequence information of the whole exome, and/or sequence information of a targeted chromosome region. The sequence information may be, for example, but not limited to, a gene sequencing data set, a SNP data set, and a gene mutation site data set.
用于获取序列信息的技术并无特别限制。在本领域中已知的任何适用于核酸的序列检测技术,均适用于本发明。在一些实施方案中,优选地,采用测序技术,检测序列信息,包括但不限于:全基因组测序、全外显子组测序和靶向测序。优选地,采用全基因组测序,更优选通过高通量测序技术例如二代测序技术,从核酸样本检测全基因组的序列信息。在另一些 实施方案中,优选地,序列信息可以通过选自如下的方式检测:全基因组、全外显子组和靶向基因组区域的基因多态性位点(例如,SNP或短串联重复序列(short tandem repeat,STR))或基因变异位点检测,优选高密度的多态性或基因变异位点检测。在一个实施方案中,使用博奥晶典基于Affymetrix公司的PMRA(Precision Medicine Research Array)芯片所定制的CBC-PMRA(CapitalBiotechnology Precision Medicine Research Array)芯片,其可检测90万个SNP位点。又在一个实施方案中,使用Illumina公司的ASA(Asian Screening Array)芯片,其可检测80万个标签SNP。The technique used to obtain sequence information is not particularly limited. Any sequence detection technology suitable for nucleic acid known in the art is applicable to the present invention. In some embodiments, preferably, sequencing technology is used to detect sequence information, including but not limited to: whole genome sequencing, whole exome sequencing, and targeted sequencing. Preferably, whole-genome sequencing is used, and more preferably, high-throughput sequencing technology such as second-generation sequencing technology is used to detect sequence information of the whole genome from a nucleic acid sample. In other embodiments, preferably, the sequence information can be detected by a method selected from the group consisting of whole genome, whole exome, and gene polymorphic sites (for example, SNP or short tandem repeats) targeting genomic regions. (short tandem repeat, STR)) or gene mutation site detection, preferably high-density polymorphism or gene mutation site detection. In one embodiment, a CBC-PMRA (Capital Biotechnology Precision Medicine Research Array) chip customized by Boao Jingdian based on Affymetrix's PMRA (Precision Medicine Research Array) chip, which can detect 900,000 SNP sites. In another embodiment, an ASA (Asian Screening Array) chip from Illumina is used, which can detect 800,000 tagged SNPs.
用于获取原始数据的子代核酸样本The progeny nucleic acid samples used to obtain the original data
在本发明方法中,子代可以是出生的或未出生的子代。在一些优选实施方案中,其中子代为未出生的子代,优选是胎儿或胚胎,更优选通过例如IVF产生的胚胎。在一些实施方案中,胚胎可以是约3-10日龄的胚胎,例如,是约5日龄的囊胚。In the method of the present invention, the offspring can be born or unborn offspring. In some preferred embodiments, where the offspring are unborn offspring, preferably fetuses or embryos, more preferably embryos produced by, for example, IVF. In some embodiments, the embryo may be an embryo about 3-10 days old, for example, a blastocyst about 5 days old.
本领域技术人员可以采用任何已知的方法,从子代采取核酸样本,用于获取子代原始数据。Those skilled in the art can use any known method to take a nucleic acid sample from the progeny to obtain the original data of the progeny.
在一些实施方案中,子代核酸样本为包含子代微量基因组DNA核酸的样本,例如所述样本是包含约0.1pg-40ng DNA,例如,1-40ng DNA、20-40ng DNA、0.1-40pg DNA、1-40pg DNA、10-40pg DNA的子代微量核酸样本。在一些实施方案中,所述子代微量基因组DNA核酸样本包括但不限于胚胎培养液(例如,IVF的胚胎培养液)、囊胚培养液(如约3-5日龄囊胚的培养液)、囊胚腔液、母体血浆或母体其他类型体液中胎儿的无细胞DNA、和/或囊胚滋养层细胞、卵裂期胚胎细胞、母体血液或母体其他类型体液中的胎儿细胞;In some embodiments, the progeny nucleic acid sample is a sample containing a trace amount of genomic DNA nucleic acid of the progeny, for example, the sample contains about 0.1 pg-40 ng DNA, for example, 1-40 ng DNA, 20-40 ng DNA, 0.1-40 pg DNA , 1-40pg DNA, 10-40pg DNA trace nucleic acid samples of the offspring. In some embodiments, the progeny trace genomic DNA nucleic acid samples include, but are not limited to, embryo culture fluid (for example, IVF embryo culture fluid), blastocyst culture fluid (such as about 3-5 day-old blastocyst culture fluid), Fetal cell-free DNA in blastocoel fluid, maternal plasma or other types of maternal body fluids, and/or blastocyst trophoblast cells, cleavage embryonic cells, maternal blood or other types of maternal body fluids of fetal cells;
从体外培养的胚胎或胚胎培养液获取胚胎的核酸,并进行基因组扩增的方法,是本领域已知的。例如,CN106086199A公开了利用囊胚培养液检测胚胎染色体的方法,其中尤其公开了囊胚培养液的获取方式,和对获取的囊胚培养液进行基因组扩增的方法。CN105368936A也公开了用于囊胚培养液检测胚胎染色体的方法,尤其是公开了囊胚培养液的采集、和自培养液中的微量DNA进行的全基因组扩增,包括扩增所用引物的设计和扩增反应程序的设计。CN109536581A公开了用作基因型分析的核酸样本的囊胚培养液的获得方式。CN105543339A公开了从体外受精(IVF)技术产生的胚胎,在囊胚期获取外滋养层细胞,并进行胚胎染色体基 因组扩增。上述文献中公开的胚胎核酸样本及其扩增方法,均适用于本发明中用于子代的原始遗传数据获取,特此将它们全文并入本发明中作为参考。Methods for obtaining embryonic nucleic acid from embryos cultured in vitro or embryo culture fluid and performing genome amplification are known in the art. For example, CN106086199A discloses a method for detecting embryo chromosomes using blastocyst culture fluid, in particular it discloses the method of obtaining blastocyst culture fluid and the method of performing genome amplification on the obtained blastocyst culture fluid. CN105368936A also discloses a method for detecting embryonic chromosomes for blastocyst culture fluid, especially discloses the collection of blastocyst culture fluid and whole genome amplification from trace DNA in culture fluid, including the design of primers used for amplification and Design of amplification reaction program. CN109536581A discloses a method for obtaining a blastocyst culture solution used as a nucleic acid sample for genotyping analysis. CN105543339A discloses embryos produced from in vitro fertilization (IVF) technology, obtaining outer trophoblast cells at the blastocyst stage, and performing embryo chromosome genome amplification. The embryonic nucleic acid samples and their amplification methods disclosed in the above-mentioned documents are all suitable for obtaining the original genetic data of the offspring in the present invention, and they are hereby incorporated in their entirety into the present invention as a reference.
在一些实施方案中,从利用卵胞浆内单精子显微注射技术(ICSI)受精的胚胎培养物中,吸取培养液,优选地在培养的第3-10天,优选第5天,吸取培养液,作为子代微量核酸样本,用于获得子代的基因组序列信息。In some embodiments, the culture fluid is aspirated from the embryo culture fertilized by the intracytoplasmic sperm injection technique (ICSI), preferably on the 3-10th day of culture, preferably the 5th day, the culture is aspirated Liquid, as a trace nucleic acid sample of the offspring, used to obtain the genomic sequence information of the offspring.
在本发明方法的一些实施方案中,在去除透明带后,采用单胚胎培养体系,在0.1ul-1ml的培养液中对胚胎进行培养,从培养物中分离出少许培养液(例如,约0.1ul-1ml,例如,约0.1ul、10ul、20ul、30ul、40ul、50ul、100ul、200ul、500ul、800ul、1ml)进行子代的基因信息检测和基因型分析。In some embodiments of the method of the present invention, after the zona pellucida is removed, a single embryo culture system is used to culture the embryos in a culture medium of 0.1ul-1ml, and a small amount of culture medium (for example, about 0.1 ul-1ml, for example, about 0.1ul, 10ul, 20ul, 30ul, 40ul, 50ul, 100ul, 200ul, 500ul, 800ul, 1ml) for genetic information detection and genotype analysis of the offspring.
在一些实施方案中,可以在进行胚胎培养之前,清洗卵子或受精卵表面,以去除受精卵表面的DNA杂质,从而降低培养液杂质DNA的影响。关于该清洗,可以参见例如CN201610584345.5和TW10612113中的描述。这些文献特此并入本文作为参考。In some embodiments, the surface of the egg or the fertilized egg may be washed before the embryo culture is performed to remove the DNA impurities on the surface of the fertilized egg, thereby reducing the influence of the DNA impurities in the culture fluid. For this cleaning, please refer to the descriptions in CN201610584345.5 and TW10612113, for example. These documents are hereby incorporated by reference.
如本领域技术人员可以理解,在本发明中,用于序列检测的子代核酸样本的类型不受特别限制,可以是含有大量核酸的样品,也可以是含有微量核酸的样本。如实施例所证实的,本发明方法尤其适用于对存在量少且片段小的胚胎来源的DNA、如囊胚培养液中的无细胞DNA(cf DNA)上进行胚胎基因组信息重构。因此,在一些实施方案中,本发明尤其可用于产前诊断,例如,在妊娠确立前(例如,在IVF技术的胚胎植入之前),在配体和取自早期胚胎的细胞或培养液中,或者在怀孕后期在取自胎盘或胎儿的细胞样本中或取自母体体液的胎儿来源DNA如母体体液中的胎儿cfDNA中,进行子代的单体型定相和/或基因组重构。As those skilled in the art can understand, in the present invention, the type of progeny nucleic acid sample used for sequence detection is not particularly limited, and it may be a sample containing a large amount of nucleic acid or a sample containing a small amount of nucleic acid. As demonstrated in the examples, the method of the present invention is particularly suitable for reconstructing embryonic genome information on embryo-derived DNA with a small amount and small fragments, such as cell-free DNA (cf DNA) in blastocyst culture fluid. Therefore, in some embodiments, the present invention is particularly useful for prenatal diagnosis, for example, before the establishment of pregnancy (for example, before embryo implantation in IVF technology), in ligands and cells or culture medium taken from early embryos , Or in the later stages of pregnancy in cell samples taken from the placenta or fetus or fetal DNA taken from maternal body fluids, such as fetal cfDNA in maternal body fluids, for offspring haplotype phasing and/or genome reconstruction.
在一些实施方案中,子代是尚未出生的子代,并且使用含有微量子代核酸的样本,例如,通过IVF产生的胚胎的单细胞或胚胎的培养液。In some embodiments, the offspring are unborn offspring, and a sample containing a small amount of offspring nucleic acid is used, for example, a single cell of an embryo produced by IVF or a culture medium of an embryo.
在一些优选实施方案中,子代为胎儿,子代核酸样本中包含的胎儿游离DNA含量为例如,0.1pg-40ng,较佳地,1-40ng,更佳地,20-40ng游离DNA。在另一些优选实施方案中,子代为胚胎,子代核酸样本中包含的胚胎游离DNA含量为例如0.1-40pg,较佳地,1-40pg,更佳地,10-40pg。In some preferred embodiments, the offspring is a fetus, and the free DNA content of the fetus contained in the offspring nucleic acid sample is, for example, 0.1pg-40ng, preferably 1-40ng, more preferably 20-40ng free DNA. In other preferred embodiments, the offspring are embryos, and the free DNA content of the embryo contained in the offspring nucleic acid sample is, for example, 0.1-40 pg, preferably, 1-40 pg, and more preferably, 10-40 pg.
用于获取原始数据的相关个体核酸样本Related individual nucleic acid samples used to obtain raw data
本领域技术人员可以采用任何已知的方法,从子代的相关个体采取核酸样本,检测相关个体的基因组序列信息,进而获得其基因型和单体型信息,从而提供子代的家系基因型信息。Those skilled in the art can use any known method to take nucleic acid samples from related individuals of the offspring, detect the genomic sequence information of the related individuals, and then obtain the genotype and haplotype information, thereby providing the family genotype information of the offspring .
在本发明中,用于相关个体原始数据获取的核酸样本的类型不受特别限制,可以是含有大量核酸的样品,或含有微量核酸的样本。在一些实施方案中,有利于获得在全基因组范围的高密度基因型位点信息的核酸样本,将是优选的。In the present invention, the type of nucleic acid sample used for obtaining the raw data of related individuals is not particularly limited, and may be a sample containing a large amount of nucleic acid or a sample containing a small amount of nucleic acid. In some embodiments, nucleic acid samples that are conducive to obtaining high-density genotypic site information in the whole genome will be preferred.
在一些实施方案中,核酸样本为包含相关个体的基因组DNA核酸的任何样本。在一些实施方案中,所述样本可以是包含至少约1ng DNA(例如1pg-1000ng DNA)的所述相关个体的核酸样本;例如,所述相关个体核酸样本是来自该相关个体的组织、细胞、和体液的核酸样本,例如,来自血液、唾液、口腔上皮、尿液、指甲、毛囊、皮屑的核酸样本。In some embodiments, the nucleic acid sample is any sample that contains genomic DNA nucleic acids of related individuals. In some embodiments, the sample may be a nucleic acid sample of the related individual that contains at least about 1 ng DNA (for example, 1 pg-1000 ng DNA); for example, the nucleic acid sample of the related individual is a tissue, cell, or tissue from the related individual. Nucleic acid samples and body fluids, for example, nucleic acid samples from blood, saliva, oral epithelium, urine, nails, hair follicles, and dander.
取决于用于序列信息检测的方法,核酸样本可以进行或不进行提取和/或纯化。在一些实施方案中,所述核酸样本中包含的核酸为选自以下各种来源的基因组DNA(gDNA):全血gDNA、口腔上皮细胞gDNA、尿路上皮细胞gDNA、甲床gDNA、毛囊gDNA和皮屑gDNA,优选全血gDNA。Depending on the method used for sequence information detection, the nucleic acid sample may or may not be extracted and/or purified. In some embodiments, the nucleic acid contained in the nucleic acid sample is genomic DNA (gDNA) selected from the following various sources: whole blood gDNA, oral epithelial cell gDNA, urothelial cell gDNA, nail bed gDNA, hair follicle gDNA and Dandruff gDNA, preferably whole blood gDNA.
核酸样本的全基因组扩增Whole genome amplification of nucleic acid samples
在得到核酸样本后,在一些实施方案中,可以进行核酸的扩增。本领域技术人员可以采用任何已知的核酸扩增技术,进行子代和/或相关个体核酸的全基因组扩增。After obtaining the nucleic acid sample, in some embodiments, nucleic acid amplification can be performed. Those skilled in the art can use any known nucleic acid amplification technology to perform whole-genome amplification of nucleic acids of progeny and/or related individuals.
优选地,所述扩增方法选自:扩增前引物延伸PCR;简并寡核苷酸引物PCR(DOP-PCR);多重置换扩增技术(MDA);多次退火环状循环扩增技术(MALBAC);平末端或粘性末端连接建库法,或其组合。Preferably, the amplification method is selected from: primer extension PCR before amplification; degenerate oligonucleotide primer PCR (DOP-PCR); multiple displacement amplification technology (MDA); multiple annealing circular cycle amplification technology (MALBAC); blunt-end or sticky-end connection library method, or a combination thereof.
在样本中子代核酸含量微量的情况下,优选采用适用于单细胞的全基因组扩增方法,例如MALBAC方法进行扩增,以减少由扩增带来的错误基因序列信息。In the case where the progeny nucleic acid content in the sample is small, it is preferable to use a whole-genome amplification method suitable for single cells, such as the MALBAC method for amplification, so as to reduce the erroneous gene sequence information caused by amplification.
基因序列信息的获取Access to gene sequence information
在一些实施方案中,优选地通过选自核酸芯片、扩增和/或测序的技术,检测基因组的序列信息。所述技术可以是本领域已知的任何此类技术,包括但不限于,单苷酸多态性位点微阵列核酸芯片、MassARRAY飞行质谱芯片、MLPA多重连接扩增技术、二代测序、三代测序、或其组合。In some embodiments, the sequence information of the genome is preferably detected by a technique selected from nucleic acid chips, amplification and/or sequencing. The technology can be any such technology known in the art, including, but not limited to, mononucleotide polymorphism site microarray nucleic acid chip, MassARRAY flight mass spectrometry chip, MLPA multiplex connection amplification technology, second-generation sequencing, third-generation Sequencing, or a combination thereof.
在一些实施方案中,通过SNP芯片,获取子代和/或相关个体的基因组序列信息。在一些实施方案中,对于全基因组序列信息获取,SNP芯片上包含至少700k位点,例如800-1000K位点。In some embodiments, SNP chips are used to obtain genomic sequence information of progeny and/or related individuals. In some embodiments, for the acquisition of whole genome sequence information, the SNP chip contains at least 700k sites, such as 800-1000K sites.
在一些实施方案中,通过全基因组测序,获取基因组序列信息。优选,采用高通量测序平台,对核酸样本的全基因组扩增产物进行测序。测序平台不受特别限制,第二代测序平台:包括但不限于Illumina公司的GA、GAII、GAIIx、HiSeq1000/2000/2500/3000/4000、X Ten、XFive、NextSeq500/550、MiSeq,AppliedBiosystems的SOLiD,Roche的454FLX,ThermoFisherScientific(LifeTechnologies)的IonTorrent、IonPGM、IonProton I/II;第三代单分子测序平台:包括但不限于HelicosBioSciences公司的HeliScope系统,PacificBioscience的SMRT系统,OxfordNanoporeTechnologies的GridION、MinION。测序类型可为单端(SingleEnd)测序或双端(PairedEnd)测序,测序长度可为30bp、40bp、50bp、100bp、300bp等大于30bp的任意长度。In some embodiments, genome sequence information is obtained through whole genome sequencing. Preferably, a high-throughput sequencing platform is used to sequence the whole genome amplification products of the nucleic acid sample. The sequencing platform is not particularly limited. The second-generation sequencing platform: including but not limited to Illumina's GA, GAII, GAIIx, HiSeq1000/2000/2500/3000/4000, X Ten, XFive, NextSeq500/550, MiSeq, Applied Biosystems' SOLiD , Roche’s 454FLX, ThermoFisherScientific (LifeTechnologies)’s IonTorrent, IonPGM, IonProton I/II; third-generation single-molecule sequencing platforms: including but not limited to HelicosBioSciences’ HeliScope system, PacificBioscience’s SMRT system, Oxford Nanopore Technologies’ GridION, MinION. The sequencing type can be single-end (SingleEnd) sequencing or paired-end (PairedEnd) sequencing, and the sequencing length can be 30bp, 40bp, 50bp, 100bp, 300bp, etc., any length greater than 30bp.
在一些实施方案中,对于子代和相关个体,全基因组测序采用测序深度例如≧20x进行,更优选地,对于相关个体,测序深度可以更高,例如至少20X、至少30X、至少40X、至少50X、至少60X、至少70X、至少80X,或以上。In some embodiments, for offspring and related individuals, whole-genome sequencing is performed using a sequencing depth, such as ≧20x. More preferably, for related individuals, the sequencing depth can be higher, such as at least 20X, at least 30X, at least 40X, at least 50X. , At least 60X, at least 70X, at least 80X, or above.
在一些优选实施方案中,在全基因组测序中,相关个体的gDNA采用高深度的全基因组测序策略,以期获得较高准确率的高密度多态分子标记。In some preferred embodiments, in whole-genome sequencing, the gDNA of related individuals adopts a high-depth whole-genome sequencing strategy in order to obtain high-accuracy and high-density polymorphic molecular markers.
在一些实施方案中,全基因组测序方法中,未出生子代对象的核酸扩增产物因其扩增的不均一特性,可采用低深度的全基因组测序方法,对成本控制是有利的。因此,在一些实施方案中,在含微量子代核酸的样本例如胚胎培养液上,可采用低深度测序方法,来获取相对低密度的基因型信息,例如测序深度为低至2x,甚至1x以下。当然,测序深度越高,子代基因组重构准确率越高。In some embodiments, in the whole-genome sequencing method, because of the heterogeneous characteristics of the amplification of the nucleic acid amplification product of the unborn offspring subject, a low-depth whole-genome sequencing method can be used, which is beneficial to cost control. Therefore, in some embodiments, low-depth sequencing methods can be used to obtain relatively low-density genotype information on samples containing trace progeny nucleic acids, such as embryo culture fluid, for example, the sequencing depth is as low as 2x, or even less than 1x. . Of course, the higher the sequencing depth, the higher the accuracy of the offspring genome reconstruction.
在一些实施方案中,在获取原始序列信息后,进行序列信息数据的质控和过滤,以去除低质量的数据。任何本领域已知可以进行此类信息质控并清理噪音遗传数据的工具均可用于此,包括但不限于,对测序产生的原始fastq文件进行数据质控过滤的各种软件,例如,fastp软件。In some embodiments, after obtaining the original sequence information, quality control and filtering of the sequence information data are performed to remove low-quality data. Any tool known in the art that can perform such information quality control and clean up noisy genetic data can be used for this, including but not limited to various software that performs data quality control and filtering on the original fastq files generated by sequencing, for example, fastp software .
基因型分析Genotype analysis
本领域中已知多种基于受试者的基因组序列信息分析其基因型的手段,包括各种算法和计算机可执行程序。如本领域技术人员理解,这些手段均可适用于本发明方法中的基因型分析。A variety of methods for analyzing the genotype of a subject based on the genome sequence information of a subject are known in the art, including various algorithms and computer executable programs. As understood by those skilled in the art, these methods are all applicable to the genotype analysis in the method of the present invention.
在一些实施方案中,在本发明方法中,基因型分析包括:基于子代和/或相关个体的基因组序列信息,确定子代和/或相关个体的基因型。在一些优选实施方案中,例如基于SNP芯片检测数据,确定受试者的SNP多态性位点和基因型,或例如基于测序数据集,分析受试者基因组中的基因变异位点和基因型。In some embodiments, in the method of the present invention, genotype analysis includes: determining the genotype of the offspring and/or related individuals based on the genome sequence information of the offspring and/or related individuals. In some preferred embodiments, for example, based on SNP chip detection data, determine the SNP polymorphism site and genotype of the subject, or, for example, analyze the genetic variation site and genotype in the subject's genome based on a sequencing data set. .
在一些实施方案中,基因型分析包括:In some embodiments, genotype analysis includes:
- 全基因组扩增产物的高通量测序或SNP芯片检测;- High-throughput sequencing or SNP chip detection of whole genome amplification products;
- 通过比对参考基因组序列,获取基因变异位点或SNP多态性位点的等位基因分布信息,- Obtain allele distribution information at gene mutation sites or SNP polymorphism sites by comparing the reference genome sequence,
- 任选地,对于以微小量核酸样本例如胚胎培养液上获得的测序数据,以多例(如至少200或300或400或更多例)微量核酸全基因组扩增数据为数据库来进行序列信息修正。-Optionally, for the sequencing data obtained from a small amount of nucleic acid samples, such as embryo culture fluid, multiple cases (such as at least 200 or 300 or 400 or more cases) of micro-nucleic acid whole-genome amplification data are used as a database for sequence information Fix.
参考基因组序列的类型不受特别限制。例如,可以采用已知的人类参考基因组作为参照序列,例如由UCSC提供的hg19和hg38参考基因组。如本领域技术人员明了的,采用不同的参考基因组版本,比如hg19或hg38,坐标系统将不一样。因此,分析过程中,需要将检测的序列数据(例如测序数据或SNP芯片检测数据)对应到所使用的特定参考基因组上,保持信息的一致性。但是,如果需要,基因组坐标间也能通过本领域已知的方式例如采用LiftOver进行转换。The type of reference genome sequence is not particularly limited. For example, a known human reference genome can be used as a reference sequence, such as the hg19 and hg38 reference genomes provided by UCSC. As those skilled in the art will understand, if different reference genome versions are used, such as hg19 or hg38, the coordinate system will be different. Therefore, in the analysis process, it is necessary to map the detected sequence data (such as sequencing data or SNP chip detection data) to the specific reference genome used to maintain the consistency of the information. However, if necessary, the genomic coordinates can also be converted by means known in the art, such as LiftOver.
在本发明中,进行比对的方法并不受特别限制。在一个优选实施方案中,采用BWA-MEM算法,将序列比对到参考基因组例如hg19上。优选地,在比对后,对所得比对文件进行排序和索引,以及去重和碱基质量的校正。In the present invention, the method of comparison is not particularly limited. In a preferred embodiment, the BWA-MEM algorithm is used to align the sequence to a reference genome such as hg19. Preferably, after the comparison, the obtained comparison files are sorted and indexed, as well as deduplication and base quality correction.
在一些优选实施方案中,基因型分析包括:In some preferred embodiments, genotype analysis includes:
- 全基因组扩增产物的SNP芯片检测;- SNP chip detection of whole genome amplification products;
- 比对参考基因组序列,获取SNP位点的等位基因分布信息。-Compare the reference genome sequence to obtain the allele distribution information of the SNP locus.
在另一些优选实施方案中,还包括对核酸芯片位点之外的其它基因组区域的子代基因型信息获取的步骤。In some other preferred embodiments, the method further includes the step of obtaining the genotype information of the progeny of other genomic regions other than the nucleic acid chip site.
在一些实施方案中,使用平均覆盖了亚洲人、尤其是中国人的全基因组的高密度SNP芯片,以满足全基因组关联分析和基因分型的需要。优选地,采用包含至少500,000个(也称 为500K)SNP位点、至少600K SNP位点、或800K SNP位点或900K SNP位点甚至更多位点的芯片,对子代和其相关个体的全基因组扩增产物进行基因分型分析。In some embodiments, a high-density SNP chip that evenly covers the entire genome of Asians, especially Chinese, is used to meet the needs of genome-wide association analysis and genotyping. Preferably, a chip containing at least 500,000 (also referred to as 500K) SNP sites, at least 600K SNP sites, or 800K SNP sites or 900K SNP sites or even more sites is used to compare the data of offspring and related individuals. Genome-wide amplification products were subjected to genotyping analysis.
本领域已知多种SNP基因型分析工具。例如,可以采用Thermo Fisher Scientific公司的Axiom TM Analysis Suite分析平台中的Genotyping功能模块,进行SNP基因型分析,并选择基因型质量符合PolyHighResolution、NoMinorHom、MonoHighResolution、Hemizygous标准的SNP位点,用于本发明方法的后续步骤。 A variety of SNP genotype analysis tools are known in the art. For example, the Genotyping function module in the Axiom TM Analysis Suite analysis platform of Thermo Fisher Scientific can be used to perform SNP genotype analysis, and the genotype quality can be selected to meet PolyHighResolution, NoMinorHom, MonoHighResolution, and Hemizygous standards for SNP sites for use in the present invention. The next steps of the method.
在另一些实施方案中,本发明方法包括:获取子代和相关个体的全基因组测序数据集,基于所述数据集检测基因变异位点。在一个实施方案中,所述测序数据集优选为:对原始测序数据进行了质控和清理、与参考基因组进行比对、排序和去重后得到的数据集,例如BAM数据格式。现有技术中描述了关于原始测序数据的质控和清理,参见CN108573125A。优选地,获取数据的测序仪包括Illumina平台。在一个实施方案中,采用Genome Analysis Toolkit(GATK)最优策略进行基因变异分析。在一些实施方案中,在基因变异位点分析后,对获得的基因变异位点进行质控过滤,以获得在父母本中都有基因型信息且可用于连锁分析的位点。In other embodiments, the method of the present invention includes: obtaining a whole genome sequencing data set of offspring and related individuals, and detecting gene mutation sites based on the data set. In one embodiment, the sequencing data set is preferably a data set obtained after quality control and cleaning of the original sequencing data, comparison with a reference genome, sorting and deduplication, such as a BAM data format. The prior art describes the quality control and cleaning of raw sequencing data, see CN108573125A. Preferably, the sequencer for acquiring the data includes an Illumina platform. In one embodiment, the Genome Analysis Toolkit (GATK) optimal strategy is used for gene mutation analysis. In some embodiments, after gene mutation site analysis, quality control filtering is performed on the obtained gene mutation site to obtain sites that have genotype information in the parents and can be used for linkage analysis.
II.噪音遗传数据的清除II. Elimination of noise genetic data
当采用含微量子代核酸的样本进行基因分型遗传数据分析时,例如,采用胚胎培养液中的cfDNA或胚胎组织的活检样本或胎儿的游离DNA为样本时,由于在这些样本中,子代核酸的存在量少、片段小,往往会存在高的基因分型错误率。When a sample containing a small amount of progeny nucleic acid is used for genotyping genetic data analysis, for example, when cfDNA in embryo culture fluid or embryonic tissue biopsy or fetal free DNA is used as the sample, because in these samples, progeny The amount of nucleic acid is small and the fragments are small, and there is often a high rate of genotyping errors.
因此,为了获得高质量的子代胚胎和/或胎儿基因组重构,有必要对基因分型原始遗传数据进行噪音遗传数据的清除,通过质控和过滤,去除基因分型(genotyping)质量差的位点。位点分型质量差包括位点本身扩增效率低或者基因分型发生错误的位点,例如ADO、随机扩增错误的位点、或检测质量不佳的位点。在一些实施方案中,噪音遗传数据的清除包括:识别基因型分型数据中基因型分型错误的位点并将所述位点标为缺失数据。Therefore, in order to obtain high-quality offspring embryos and/or fetal genome reconstructions, it is necessary to remove noise and genetic data from the original genetic data of genotyping, and remove poor quality genotyping (genotyping) through quality control and filtering. Site. Poor loci typing quality includes sites with low amplification efficiency or genotyping errors, such as ADO, sites with random amplification errors, or sites with poor detection quality. In some embodiments, the removal of noisy genetic data includes: identifying a genotyping error site in the genotyping data and marking the site as missing data.
本发明人发现,通过选自以下的质控手段进行噪音遗传数据的清除是有利的:核酸全基因组扩增效率的质控、识别孟德尔遗传错误的质控、识别违反染色体干涉抑制理论的质控、多个子代单体型相互推导的质控和它们的组合。The inventor found that it is advantageous to eliminate noise genetic data by quality control methods selected from the following: quality control of nucleic acid genome-wide amplification efficiency, quality control of identifying Mendelian genetic errors, and identifying quality that violates the theory of chromosomal interference suppression. Control, multiple progeny haplotypes deduced from each other, and their combinations.
基于全基因组扩增效率的质控Quality control based on the efficiency of whole genome amplification
本发明利用全基因组扩增效率质控,识别子代基因分型数据中因扩增效率低引起的基因分型差的位点,并将所述位点标为缺失数据。The present invention uses the quality control of the amplification efficiency of the whole genome to identify the poorly typed sites in the progeny genotyping data caused by the low amplification efficiency, and mark the sites as missing data.
全基因组扩增效率不均一性是用于微量核酸样本扩增的单细胞扩增技术的特点,而扩增效率低的区域会导致该区域碱基位点的基因分型质量不佳。本发明人提出,利用多样本全基因组扩增产物构建参考测序数据集,来确定微量核酸全基因组扩增效率分布模式。The heterogeneity of genome-wide amplification efficiency is a feature of single-cell amplification technology used for the amplification of trace nucleic acid samples, and regions with low amplification efficiency will lead to poor genotyping quality of base sites in this region. The inventor proposes to construct a reference sequencing data set using multi-sample whole-genome amplification products to determine the distribution pattern of the whole-genome amplification efficiency of trace nucleic acids.
在一个实施方案中,如下实施微量核酸全基因组扩增效率的质控:In one embodiment, the quality control of the whole genome amplification efficiency of trace nucleic acid is implemented as follows:
将多个相应扩增产物参考样本进行低深度测序(测序深度不高于约0.5X,不高于约0.4X,不高于约0.3X,不高于约0.2X,不高于约0.1X,例如测序深度为约0.06X),获取测序数据比对到参考基因组上的BAM文件,并将多个参考样本的BAM文件合并成一个大的BAM库。例如,所述多个参考样本的BAM文件是至少300、400、500、600、700、800个参考样本的BAM文件。Perform low-depth sequencing on multiple reference samples of corresponding amplification products (sequencing depth is not higher than about 0.5X, not higher than about 0.4X, not higher than about 0.3X, not higher than about 0.2X, not higher than about 0.1X For example, the sequencing depth is about 0.06X), the sequencing data is compared to the BAM file on the reference genome, and the BAM files of multiple reference samples are combined into a large BAM library. For example, the BAM files of the multiple reference samples are BAM files of at least 300, 400, 500, 600, 700, 800 reference samples.
进一步地,通过如下公式计算微量核酸全基因组中各位点的扩增效率Further, the amplification efficiency of each point in the whole genome of the trace nucleic acid is calculated by the following formula
Figure PCTCN2020121432-appb-000003
其中,
Figure PCTCN2020121432-appb-000004
Figure PCTCN2020121432-appb-000003
among them,
Figure PCTCN2020121432-appb-000004
其中DP i表示第i个位点的绝对深度,N表示测序read次数,L表示read长度。 Where DP i represents the absolute depth of the i-th site, N represents the number of sequencing reads, and L represents the read length.
当DP i≥基因组平均深度(DP i≥27)时,位点扩增效率≥1,则表示该位点通过微量核酸全基因组扩增效率的质控;将未符合这一质控的位点基因型标记为缺失数据。这是基于本发明人研究中的发现得出的质控手段。与微量核酸全基因组扩增效率≥1的位点相比较,扩增效率<1(DP i<27)的位点的孟德尔遗传错误率要高出将近6倍左右(图2)。因此,为尽可能减少微量核酸全基因组扩增效率对胚胎全基因组基因分型的影响,本发明人利用463个微量核酸的MALBAC全基因组扩增产物的二代测序数据绘制全基因组扩增效率分布图谱,在此基础上识别扩增效率低(例如,扩增效率<1)的位点基因型信息,并将所述位点标为缺失数据。这里,孟德尔错误率是指发生孟德尔遗传错误的位点占总位点数的比率。 When DP i ≥ the average depth of the genome (DP i ≥ 27) and the site amplification efficiency ≥ 1, it means that the site has passed the quality control of the whole genome amplification efficiency of the trace nucleic acid; the site that does not meet this quality control The genotype is marked as missing data. This is a quality control method based on the findings of the inventors' research. Compared with the sites with the whole genome amplification efficiency of trace nucleic acid ≥ 1, the Mendelian genetic error rate of the sites with amplification efficiency <1 (DP i <27) is about 6 times higher (Figure 2). Therefore, in order to minimize the impact of the whole genome amplification efficiency of trace nucleic acids on the genotyping of the embryo whole genome, the present inventors used the next-generation sequencing data of the MALBAC whole genome amplification products of 463 trace nucleic acids to plot the whole genome amplification efficiency distribution On the basis of the map, the genotype information of the locus with low amplification efficiency (for example, the amplification efficiency<1) is identified, and the locus is marked as missing data. Here, the Mendelian error rate refers to the ratio of the loci where Mendelian genetic errors occur to the total loci.
在一些实施方案中,本发明利用孟德尔遗传规律和染色体干涉理论,识别子代基因分型数据中出现的ADO及其它基因型错误的位点,并将其标为缺失数据。In some embodiments, the present invention uses Mendelian laws of inheritance and chromosomal interference theory to identify ADO and other genotypic errors in progeny genotyping data and mark them as missing data.
基于孟德尔遗传规律的质控Quality control based on Mendelian genetic law
这里,孟德尔遗传规律是指若某位点父亲是A/C基因型,母亲是C/C基因型,他们的子代必须是A/C或C/C的基因型,除非发生新发突变(发生频率极低)。若该位点子代的基因型为A/A,即提示该位点可能发生了ADO,把这一位点的信息标记为缺失数据。Here, Mendelian inheritance law means that if the father of a certain locus is of the A/C genotype and the mother is of the C/C genotype, their offspring must be of the A/C or C/C genotype, unless a new mutation occurs (The frequency of occurrence is extremely low). If the genotype of the offspring at this locus is A/A, it means that ADO may have occurred at this locus, and the information at this locus is marked as missing data.
基于染色体干涉理论的质控Quality control based on chromosome interference theory
染色体干涉理论是指减数分裂时期,同源染色体非姐妹染色单体相邻两次单交换相互影响和抑制的现象。本发明采用抑制理论,具体是指当一段遗传距离内两个分子标记位点出现两次交换或重组,则判定这一重组区段内的分子标记发生基因型分型错误,并将所述分子标记位点标记为缺失数据,例如,其中所述一段遗传距离是1个厘摩(centimorgan,cM)以下的任一距离。例如,基于获取的基因组序列信息,在构建子代的父源和母源的单体型时,若构建的单体型中在很小的遗传距离内发生两次重组,如上一个分子标记A(SNP位点基因型)及之前位点均提示该父源单体型遗传自祖父,下一个分子标记B提示父源单体型遗传自祖母,B的下游分子标记C及之后位点均提示父源单体型遗传自祖父,且A、B、C位点在比较小的遗传距离内如1个厘摩,则在这种情况下即可推断B位点的基因型有误,将其标为缺失数据。The theory of chromosomal interference refers to the phenomenon that homologous chromosomes and non-sister chromatids interact and inhibit each other in two adjacent single exchanges during meiosis. The present invention adopts the inhibition theory, which specifically refers to that when two molecular marker sites within a genetic distance undergo two exchanges or recombination, it is determined that the molecular marker in this recombination section has a genotyping error, and the molecular marker The marker site is marked as missing data, for example, where the genetic distance is any distance below 1 centimorgan (cM). For example, based on the obtained genome sequence information, when constructing the haplotypes of the paternal and maternal origin of the offspring, if the constructed haplotypes are recombined twice within a small genetic distance, such as the previous molecular marker A ( SNP locus genotype) and the previous locus both indicate that the paternal haplotype was inherited from the grandfather, the next molecular marker B indicates that the paternal haplotype was inherited from the grandmother, and the downstream molecular marker C and subsequent sites of B all indicate the father The source haplotype is inherited from the grandfather, and the A, B, and C locus are within a relatively small genetic distance such as 1 centimole. In this case, it can be inferred that the genotype of the B locus is wrong and mark it. Is missing data.
基于多个子代的单体型相互推导的质控Quality control based on the mutual deduction of haplotypes of multiple progeny
在本发明的方法中,使用了多个(优选大于2个)子代样本(优选未出生子代样本,例如2-4或更多个囊胚培养液样本;或者如胚胎的兄弟姊妹),并获取了多个子代的基因型数据。所述多个子代的单体型相互推导是指利用本发明的单体型定相方法,并利用所获取的多个子代的基因型数据获取来推导子代最大可能的单体型组成,由此识别基因型错误的位点,并将所述位点标为缺失数据。In the method of the present invention, multiple (preferably more than 2) offspring samples are used (preferably unborn offspring samples, such as 2-4 or more blastocyst culture fluid samples; or siblings of embryos), And get the genotype data of multiple offspring. The mutual derivation of the haplotypes of the multiple offspring refers to the use of the haplotype phasing method of the present invention and the use of the obtained genotype data of multiple offspring to deduce the largest possible haplotype composition of the offspring. This identifies the site of the genotype error and marks the site as missing data.
在一些优选的实施方案中,噪音遗传数据的清除包括:进行上述核酸全基因组扩增效率的质控和至少一个选自以下的质控:识别孟德尔遗传错误的质控、识别染色体干涉抑制的质控和多个子代单体型相互推导的质控。In some preferred embodiments, the removal of noisy genetic data includes: performing quality control of the above-mentioned nucleic acid genome-wide amplification efficiency and at least one quality control selected from the following: quality control for identifying Mendelian genetic errors, and identifying chromosomal interference suppression. Quality control and quality control of multiple progeny haplotypes derived from each other.
在一个优选的实施方案中,噪音遗传数据的清除包括:进行核酸全基因组扩增效率的质控和全部三个以下的质控:识别孟德尔遗传错误的质控、识别染色体干涉抑制的质控和多个子代单体型相互推导的质控。In a preferred embodiment, the elimination of noisy genetic data includes: quality control of the amplification efficiency of the whole nucleic acid genome and quality control of all three or less: quality control for identifying Mendelian genetic errors, quality control for identifying chromosomal interference suppression Quality control of mutual deduction with multiple progeny haplotypes.
III.单体型定相III. Haplotype phasing
在获取基因分型信息并清除所述基因分型信息中的噪音遗传数据后,可以进行子代的单体型定相,确定子代的父源和母源单体型组成。After obtaining the genotyping information and removing the noisy genetic data in the genotyping information, the haplotype phasing of the offspring can be performed to determine the paternal and maternal haplotype composition of the offspring.
在一些实施方案中,优选地,基于系谱进行单体型定相(phasing),获取子代的父源母源单体型组成。In some embodiments, it is preferable to perform haplotype phasing based on genealogy to obtain the paternal and maternal haplotype composition of the offspring.
在一些实施方案中,所述单体型定相包括:In some embodiments, the haplotype phasing includes:
-区分子代(例如胚胎)的父本和母本的两条单体型;-Two haplotypes of the male parent and the female parent of the molecular generation (such as embryo);
-构建子代(例如胚胎)的父源单体型组成和母源单体型组成,由此确定子代的两条单体型各自遗传了父母的哪一条单体型。-Construct the paternal haplotype composition and maternal haplotype composition of the offspring (such as embryos), thereby determining which haplotype of the parent is inherited by the two haplotypes of the offspring.
在一些实施方案中,采用基于孟德尔遗传规律和连锁不平衡理论的多位点连锁分析策略,构建子代在染色体水平的单体型。在一些实施方案中,采用选自以下的算法进行单体型定相:Lander-Green算法、Elston-Stewart算法和Idury-Elston算法。In some embodiments, a multi-locus linkage analysis strategy based on Mendelian law of inheritance and linkage disequilibrium theory is used to construct haplotypes of progeny at the chromosome level. In some embodiments, an algorithm selected from the group consisting of Lander-Green algorithm, Elston-Stewart algorithm, and Idury-Elston algorithm is used for haplotype phasing.
在一些实施方案中,本发明的单体型定相方法还包括:用(例如未出生)子代的祖父母和/或外祖父母的核酸样本,来构建未出生子代及其父母的单体型。In some embodiments, the haplotype phasing method of the present invention further includes: using nucleic acid samples of the grandparents and/or maternal grandparents of the (for example, unborn) offspring to construct the haplotypes of the unborn offspring and their parents .
在优选的实施方案中,采用基于家系的多位点连锁分析方法进行单体型分析。在一个优选的实施方案中,所述单体型分析中,包括使用多个,优选为大于2个的子代样本。在另一优选实施方案中,单体型分析中也可以用未出生子代的祖父母和/或外祖父母来构建未出生子代及其父母的单体型。In a preferred embodiment, a family-based multi-site linkage analysis method is used for haplotype analysis. In a preferred embodiment, the haplotype analysis includes the use of multiple, preferably more than two progeny samples. In another preferred embodiment, the grandparents and/or maternal grandparents of the unborn offspring can also be used to construct the haplotypes of the unborn offspring and their parents in the haplotype analysis.
在一个优选实施方案中,采用如下方法进行基于家系的单体型分析:Lander-Green算法、Elston-Stewart算法、或Idury-Elston算法。In a preferred embodiment, the following methods are used for family-based haplotype analysis: Lander-Green algorithm, Elston-Stewart algorithm, or Idury-Elston algorithm.
在一个优选的实施方案中,基于系谱(pedigree)信息进行单体型构建,获取子代遗传自父母的最大可能的单体型组成。构建方法包括但不限于似然法策略(求最大概率的单体型组成)、遗传规则策略(求最小重组数的单体型组成)及最大期望(Expectation Maximisation,EM)算法。优选地,所述的似然法策略包括但不限于:Lander-Green算法和Viterbi动态规划算法、Elston-Stewart算法和贝叶斯网络算法,优选方案为Lander-Green算法和Viterbi动态规划算法。优选地,所述遗传规则方法包括零重组假说策略和最小重组假说策略,可用的软件载体包括但不限于ZAPLO、HAPLORE。In a preferred embodiment, haplotype construction is performed based on pedigree information to obtain the largest possible haplotype composition inherited from parents by offspring. Construction methods include, but are not limited to, the likelihood method strategy (seeking the haplotype composition with the maximum probability), the genetic rule strategy (the haplotype composition with the minimum recombination number), and the Expectation Maximisation (EM) algorithm. Preferably, the likelihood method strategy includes, but is not limited to: Lander-Green algorithm and Viterbi dynamic programming algorithm, Elston-Stewart algorithm and Bayesian network algorithm, and the preferred solutions are Lander-Green algorithm and Viterbi dynamic programming algorithm. Preferably, the genetic rule method includes a zero recombination hypothesis strategy and a minimum recombination hypothesis strategy, and available software carriers include but are not limited to ZAPLO and HAPLORE.
在一个优选的实施方案中,单体型定相包括:在获取基因分型信息并清除所述基因分型信息中的噪音遗传数据后实施以下步骤In a preferred embodiment, haplotype phasing includes: performing the following steps after obtaining genotyping information and removing noisy genetic data in the genotyping information
i)基于系谱结构构建每个位点的二进制基因流向量V i,V i=(P 1,i,M 1,i,P 2,i,M 2,i,...,P n,i,M n,i)。这里,i表示位点;1...n表示胚胎数;P n,i表示第n个胚胎第i个位点遗传自父系祖先的单体型,P n,i=0表示该胚胎该位点遗传了祖父的单体型,P n,i=1表示该胚胎该位点遗传了祖母的单体 型;M n,i表示第n个胚胎第i个位点遗传自母系祖先的单体型,M n,i=0表示该胚胎该位点遗传了外祖父的单体型,M n,i=1表示该胚胎该位点遗传了外祖母的单体型。 i) Construct a binary gene flow vector V i for each site based on the pedigree structure, V i = (P 1,i ,M 1,i ,P 2,i ,M 2,i ,...,P n,i ,M n,i ). Here, i represents the locus; 1...n represents the number of embryos; P n,i represents the haplotype inherited from the paternal ancestor at the i-th locus of the nth embryo, and P n,i =0 represents the position of the embryo The point has inherited the haplotype of the grandfather, P n,i = 1 means that the embryo has inherited the haplotype of the grandmother at this locus; M n,i means the haplotype of the i-th locus of the nth embryo is inherited from the maternal ancestor Type, M n,i =0 indicates that the embryo has inherited the haplotype of the maternal grandfather at this locus, and M n,i =1 indicates that the embryo has inherited the haplotype of the maternal grandmother at this locus.
ii)基于隐马尔科夫模型计算每个位点的单体型隐含状态和基因型观察值的最大联合似然概率,公式如下:ii) Calculate the maximum joint likelihood probability of the haplotype hidden state and genotype observation value of each locus based on the hidden Markov model, the formula is as follows:
Figure PCTCN2020121432-appb-000005
Figure PCTCN2020121432-appb-000005
其中,m表示位点数;P(V 1)是第一个位点父系或母系祖先单体型的初始概率;P(V i|V i-1)是第i-1位点到其相邻的第i个位点单体型状态的转移概率,通过计算两个位点间的重组率获得; Among them, m represents the number of sites; P(V 1 ) is the initial probability of the paternal or maternal ancestor haplotype at the first site; P(V i |V i-1 ) is the i-1th site to its neighbor The transfer probability of the haplotype state at the i-th site is obtained by calculating the recombination rate between the two sites;
通过千人基因组第三阶段(1000Genomes Project Phase 3)的遗传图谱来估计位点间的重组率;P(G i|V i)是给定位点祖先单体型状态(V i)后的基因型(G i)概率,通过孟德尔遗传规律计算获得; Estimate the recombination rate between sites through the genetic map of 1000 Genomes Project Phase 3; P(G i |V i ) is the genotype after the ancestral haplotype status (V i) of the site is given (G i ) Probability, calculated by Mendelian genetic law;
iii)利用Viterbi算法计算隐马尔科夫状态V=(V 1,V 2,...,V m)m个位点祖先单体型状态最可能的组成,获得每个子代最大可能的染色体水平单体型组成。 iii) Using the Viterbi algorithm to calculate the hidden Markov state V=(V 1 ,V 2 ,...,V m )m sites the most likely composition of the haplotype state of the ancestors, and obtain the maximum possible chromosome level for each offspring Haplotype composition.
IV.子代对象基因组的重构IV. Reconstruction of the genome of the offspring object
本发明提供了用于重构子代基因组的方法。在一个实施方案中,在基因型填充前进行单体型定相,推断样本的单体型。之后,对定相后得到的单体型(phased haplotypes)中缺少的等位基因,进行基因型填充。The present invention provides a method for reconstructing the genome of the progeny. In one embodiment, haplotype phasing is performed before genotype filling to infer the haplotype of the sample. After that, genotype filling is performed for alleles that are missing in the phased haplotypes obtained after phasing.
实施基因型填充是因为子代对象基因组中存在缺失的数据。基因型缺失是指基因型未知的位点,即,样本中没有被测序数据覆盖到的区域或缺失序列信息数据的位点,也称作,缺失数据。基因型数据缺失可以分为遗传性缺失和检测性缺失。遗传性缺失是指,由个体遗传信息的变异(例如,位点所在DNA片段的真实缺失)导致的基因型缺失。检测性缺失是指,由于检测技术的局限、错误等因素导致的序列信息丢失。各种基因型检测技术都会产生检测性的基因型缺失。例如,芯片探针杂交测序技术中,会由于探针杂交捕获效率而产生基因型缺失。在本文中,基因型缺失包括上述两种缺失,也包括根据噪音遗传数据清除而从子代基因组序列信息中去除的基因分型质量差的位点。Genotype filling is performed because of missing data in the genome of the offspring object. A genotype deletion refers to a site with an unknown genotype, that is, an area in a sample that is not covered by sequencing data or a site with missing sequence information data, also known as missing data. The lack of genotype data can be divided into genetic deletion and detection deletion. Hereditary deletion refers to a genotypic deletion caused by a variation of an individual's genetic information (for example, a true deletion of a DNA fragment at a locus). Loss of detection refers to the loss of sequence information due to the limitations of detection technology, errors and other factors. Various genotyping techniques will produce detectable genotype deletions. For example, in chip-probe hybridization sequencing technology, genotype deletion occurs due to the efficiency of probe hybridization and capture. In this article, genotypic deletions include the above two types of deletions, as well as sites with poor genotyping quality that are removed from the offspring genome sequence information based on noise genetic data removal.
在本发明方法中,经过噪音遗传数据清除后,子代(例如胚胎)基因组上缺失数据的位点在一些实施方案中,将达到全基因组的例如至少1/2或更高,例如4/5或以上。对于微量子 代核酸样本,例如来自IVF的胚胎培养液cfDNA,由于存在量少、片段小、DNA扩增均一性差等特点,会出现较高的基因型缺失。在一个实施方案中,子代在基因型填充前可以存在多达大约6/7的位点基因型缺失。In the method of the present invention, after the noise genetic data is cleared, the sites with missing data on the genome of the offspring (such as embryos) will, in some embodiments, be at least 1/2 or higher, such as 4/5, of the whole genome. or above. For a small amount of progeny nucleic acid samples, such as IVF embryo culture fluid cfDNA, due to the characteristics of small amount, small fragments, and poor DNA amplification uniformity, high genotypic deletions will occur. In one embodiment, the progeny may have up to about 6/7 locus genotype deletions before genotype filling.
在一个实施方案中,本发明的重构子代对象基因组的方法包括:结合来自相关个体的家系基因型信息,对清除噪音遗传数据后的子代基因型进行单体型定相和缺失基因型的填充。缺失的基因型可以是子代未被扩增出来的位点基因型、通过清除噪音遗传数据标记为缺失数据的位点、或两者。In one embodiment, the method for reconstructing the genome of a progeny object of the present invention includes: combining family genotype information from related individuals, and performing haplotype phasing and missing genotypes on the genotypes of the progeny after removing the noise genetic data Of filling. The missing genotype can be the genotype of the locus where the offspring is not amplified, the locus marked as missing data by removing the noise genetic data, or both.
在一些优选的实施方案中,本发明提供了一种对子代对象基因组重构的方法,其特征在于,包括步骤:In some preferred embodiments, the present invention provides a method for reconstructing the genome of a progeny object, which is characterized in that it comprises the steps:
(a)提供用于所述分析处理的数据集,所述数据集包括:来自子代对象的第1数据集、来自所述子代对象的父亲的第2数据集、和/或来自所述子代对象的母亲的第3数据集;其中,所述数据集为对子代对象及其父母的核酸或核酸扩增产物进行遗传学检测与分型分析从而获得的相应基因分型信息数据集,所述子代对象优选为未出生的子代对象;(a) Provide a data set for the analysis and processing, the data set including: a first data set from a progeny subject, a second data set from a father of the progeny subject, and/or from the The third data set of the mother of the offspring object; wherein the data set is the corresponding genotyping information data set obtained by performing genetic testing and typing analysis on the nucleic acid or nucleic acid amplification products of the offspring object and its parents , The offspring object is preferably an unborn offspring object;
(b)对所述数据集进行噪音遗传数据清除,去除分型质量差的位点基因型,所述进行质控的数据集为包括第1数据集在内的一个或多个数据集;(b) Performing noise genetic data removal on the data set to remove genotypes of sites with poor typing quality, and the data set for quality control is one or more data sets including the first data set;
(c)对步骤(b)获得的分型数据进行单体型定相;和(c) Perform haplotype phasing on the typing data obtained in step (b); and
(d)进行子代缺失基因型的填充,从而获得子代对象的全基因组基因型的信息重构。(d) Fill in the missing genotypes of the progeny, so as to obtain the information reconstruction of the whole genome genotype of the progeny object.
在所述方法的一个优选实施方案中,在步骤(d)中,还包括:增加基于家系和/或群体的分型数据进行基因型填充,从而获得子代对象的全基因组基因型的信息重构。所述家系为除所述子代对象父母之外的有遗传关系的亲属,例如兄弟姊妹、祖父母和/或外祖父母。所述群体分型数据可以是来自例如HapMap和1000Genomes的参考单体型及单体型频率信息。In a preferred embodiment of the method, in step (d), it further includes: adding family and/or population-based typing data for genotype filling, so as to obtain information about the whole genome genotype of the offspring object. Structure. The pedigree is genetically related relatives other than the parents of the offspring object, such as siblings, grandparents and/or maternal grandparents. The population typing data may be reference haplotype and haplotype frequency information from, for example, HapMap and 1000Genomes.
本领域技术人员理解,可以针对多条不同的子代染色体,重复本发明的去燥和基因型填充,优选地,根据子代的染色体数,确定重复次数,以获得全基因组信息重构。例如,对于二倍体子代,重复23次(对于雌性个体)或24次(对于雄性个体)。Those skilled in the art understand that the desiccation and genotype filling of the present invention can be repeated for multiple different progeny chromosomes. Preferably, the number of repetitions is determined according to the number of progeny chromosomes to obtain the reconstruction of the whole genome information. For example, for diploid offspring, repeat 23 times (for female individuals) or 24 times (for male individuals).
在优选的实施方案中,基于系谱的基因型填充包括:基于父母高密度多态分子标记信息及通过单体型定相构建的子代中父源和母源单体型组成,利用血源同一(Identity By Descent,IBD)策略,填充子代中缺失的基因型。In a preferred embodiment, pedigree-based genotype filling includes: based on parental high-density polymorphic molecular marker information and the composition of paternal and maternal haplotypes in the offspring constructed by haplotype phasing, using the same blood source (Identity By Descent, IBD) strategy to fill in missing genotypes in the offspring.
在另一优选的实施方案中,基于群体的基因型填充包括:基于群体连锁不平衡规律及HapMap、1000Genomes等参考单体型及单体型频率信息,分析子代的全基因组水平的基因 型信息。所述分析方法可以选自下组:最大化算法(Expectation Maximization,E-M)、隐马尔科夫模型(Hidden Markov Model,HMM)、马尔科夫链蒙特卡洛(Markov chain Monte Carlo,MCMC)、Coalescent理论、或其组合。基于群体连锁不平衡规律的基因型填充算法包括但不限于IMPUTE(2)、MaCH、Beagle、Minimac。In another preferred embodiment, population-based genotype filling includes: analyzing the genome-wide genotype information of the progeny based on the population linkage disequilibrium law and reference haplotype and haplotype frequency information such as HapMap, 1000Genomes, etc. . The analysis method can be selected from the following groups: Expectation Maximization (EM), Hidden Markov Model (HMM), Markov Chain Monte Carlo (MCMC), Coalescent Theory, or a combination thereof. Genotype filling algorithms based on the law of population linkage disequilibrium include but are not limited to IMPUTE(2), MaCH, Beagle, Minimac.
优选地,对于基于家系信息未被成功填充的基因型信息,利用基于群体的基因型填充,进行子代基因组信息补全。Preferably, for the genotype information that is not successfully filled based on family information, population-based genotype filling is used to complete the offspring genome information.
在本发明中,可以基于系谱分析,相较于父本和母本样本的基因型,进行子代样本的基因型填充。在一个优选的方案中,可以进一步加入其他家系个体的基因型,来进行系谱分析和子代对象的基因型填充,例如子代的兄弟姊妹或来自相同父本和母本的胚胎(包括,IVF产生的胚胎及其培养液),和/或子代的祖父母和外祖父母。例如,在一个实施方案中,可以结合胚胎子代单体型互推,来补全子代缺失的基因型。In the present invention, based on pedigree analysis, the genotype filling of offspring samples can be performed compared with the genotypes of the paternal and maternal samples. In a preferred scheme, the genotypes of individuals from other families can be further added for genealogical analysis and genotype filling of offspring objects, such as siblings of offspring or embryos from the same parent and maternal parent (including IVF production Embryos and their culture media), and/or grandparents and grandparents of offspring. For example, in one embodiment, the haplotypes of embryonic progeny can be combined to complement the genotypes of the progeny that are missing.
在大量的研究中,本发明人发现,结合血缘同一(IBD)区域识别和相关个体的基因型信息(尤其是,父母本的高密度多态位点(SNP)基因型信息),进行子代中缺失的基因型的填充,可以获得准确率较高的等位基因估计,例如至少90%以上的准确率,甚至高达99%以上的准确率。In a large number of studies, the inventors found that combining the identification of the same blood (IBD) region and the genotype information of related individuals (especially, the parental high-density polymorphic locus (SNP) genotype information), the offspring Filling in the missing genotypes in the middle can obtain allele estimates with higher accuracy, for example, at least 90% or more accuracy, or even as high as 99% or more.
因此,在一些优选的实施方案中,本发明方法的步骤(d)子代缺失基因型的填充包括:结合血源同一区域识别,也即确定的某一区域胚胎来自父母的单体型组成情况,同时结合父母高密度多态位点基因型信息,填充子代中缺失的基因型位点信息;Therefore, in some preferred embodiments, step (d) filling in the missing genotypes of the offspring of the method of the present invention includes: identifying the same region in combination with the blood source, that is, determining the haplotype composition of the embryo in a certain region from the parent , Combined with the genotype information of the parents' high-density polymorphic loci to fill in the genotype information missing in the offspring;
以及任选地,对于基于家系信息未被成功填充的基因型信息,利用群体参考单体型信息及群体水平等位基因连锁不平衡(LD)规律填补全基因组水平的基因型信息;And optionally, for the genotype information that has not been successfully populated based on the family information, use the population reference haplotype information and the population-level allelic linkage disequilibrium (LD) law to fill in the genome-wide genotype information;
优选地,所述群体参考单体型信息是HapMap、1000Genomes、HRC(Haplotype Reference Consortium);Preferably, the population reference haplotype information is HapMap, 1000Genomes, HRC (Haplotype Reference Consortium);
优选地,所述群体水平等位基因连锁不平衡(LD)规律的基因型填充算法是例如IMPUTE(2)、MaCH、Beagle、Minimac算法;Preferably, the genotype filling algorithm of the population-level allelic linkage disequilibrium (LD) law is, for example, IMPUTE(2), MaCH, Beagle, Minimac algorithm;
优选地,使用最大化算法(Expectation Maximization,E-M)、隐马尔科夫模型(Hidden Markov Model,HMM)、马尔科夫链蒙特卡洛(Markov chain Monte Carlo,MCMC)、Coalescent理论、或其组合实施基因型填充。Preferably, a maximization algorithm (Expectation Maximization, EM), Hidden Markov Model (Hidden Markov Model, HMM), Markov chain Monte Carlo (MCMC), Coalescent theory, or a combination thereof is used to implement Genotype filling.
在一些实施方案中,所述重构优选包括:In some embodiments, the reconstruction preferably includes:
-确定需要补全的胚胎染色体区段,-Determine the embryonic chromosome segments that need to be completed,
-利用群体参考单体型信息,寻找与子代在此区段的单体型最为相似的群体单体型区段;-Use the population reference haplotype information to find the population haplotype segment that is most similar to the haplotype of the offspring in this segment;
-基于群体中该区段的单体型的基因型信息,补全胚胎中缺少的信息。-Based on the genotype information of the haplotype of the segment in the population, complete the missing information in the embryo.
在基于群体的基因型填充中,可以考虑选择与子代对象在遗传背景上较为接近的参考群体模板。例如,当子代为中国人个体时,可以考虑利用1000Genomes phases3的中国人群参考单体型信息。可以采用本领域已知的基于群体的基因型填充软件,包括但不限于,MACH软件包,进行子代全基因组基因型重构。In population-based genotype population, you can consider selecting a reference population template that is closer to the offspring object in genetic background. For example, when the offspring are Chinese individuals, you can consider using 1000 Genomes Phases3 Chinese population reference haplotype information. The population-based genotype filling software known in the art, including but not limited to the MACH software package, can be used to reconstruct the whole genome genotype of the progeny.
通过基于群体的基因型填充,可以进一步增加对子代基因组的覆盖度;但是,同时基因型估计准确率会下降。因此,在优选的实施方案中,可以基于如下因素,确定是否进行基于群体的基因型填充:(1)期望的基因型填充准确率,(2)期望的基因组覆盖度,(3)期望的目标区域覆盖与否。Through population-based genotype filling, the coverage of the offspring's genome can be further increased; however, at the same time, the accuracy of genotype estimation will decrease. Therefore, in a preferred embodiment, whether to perform population-based genotype filling can be determined based on the following factors: (1) desired accuracy of genotype filling, (2) desired genome coverage, (3) desired target Regional coverage or not.
在本发明中,基于家系的子代(如胚胎)基因组重构的准确率达到90%以上,优选95%以上,更优选97%以上,最优选99%以上。优选地,经过家系的基因型填充后,子代的基因组覆盖度达到60%以上,例如,70%以上、80%以上。In the present invention, the accuracy of genome reconstruction of progeny (such as embryos) based on the family reaches more than 90%, preferably more than 95%, more preferably more than 97%, and most preferably more than 99%. Preferably, after the genotype of the family is filled, the genome coverage of the offspring reaches more than 60%, for example, more than 70%, more than 80%.
在结合基于群体的基因型填充后,在本发明中,子代(如胚胎)基因组重构的准确率达到90%以上,优选95%以上,更优选97%以上,最优选99%以上。优选地,经过家系的基因型填充后,子代的基因组覆盖度进一步提高。After combining the population-based genotype filling, in the present invention, the accuracy of the progeny (such as embryo) genome reconstruction reaches 90% or more, preferably 95% or more, more preferably 97% or more, and most preferably 99% or more. Preferably, after the genotype of the family is filled, the genome coverage of the offspring is further improved.
除了上述优选方案,应当理解,适用于基于家系样本的遗传特性的基因型填充和基于群体遗传特性的基因型填充的其它方法,也在本发明的考虑之中。In addition to the above-mentioned preferred solutions, it should be understood that other methods suitable for genotype filling based on the genetic characteristics of family samples and genotype filling based on population genetic characteristics are also considered in the present invention.
一般,基因型填充包括两步:Generally, genotype population includes two steps:
(1)从目标位点/区域的非缺失位点中,总结这个区域的基因型规律,并分类。即,分析参考样本(例如,家系样本)的各区域的单体型组成;(1) From the non-deletion sites in the target site/region, summarize the genotype rules of this region and classify them. That is, analyze the haplotype composition of each region of the reference sample (for example, the family sample);
(2)根据子代样本缺失位点的上下其它非缺失位点,判断该区域属于哪种单体型,然而根据所属单体型的基因型来补充子代样本的缺失位点。(2) Determine which haplotype the region belongs to based on other non-deleted sites above and below the missing site of the offspring sample, but supplement the missing site of the offspring sample based on the genotype of the haplotype.
例如,可以首先根据待填充样本上的未缺失位点的基因型信息,判断样本与参考单体型集合中的哪种单体型最为相似,然后,将对应的最相似的单体型赋予该样本,从而重构样本的完整基因型。For example, you can first determine which haplotype in the sample is most similar to the reference haplotype set based on the genotype information of the unmissed site on the sample to be filled, and then assign the corresponding most similar haplotype to the haplotype Sample, thereby reconstructing the complete genotype of the sample.
基于家系样本遗传特性的基因型填充,一般可以通过比对待填充的子代样本和例如父本和母本的单体型,找到两者之间共有的单体型,然而,可以将匹配上的参考模板上的位点复制到子代的目标数据集中,进行子代样本的基因型重构。Genotype filling based on the genetic characteristics of family samples can generally find the haplotypes shared between the two by comparing the offspring samples to be filled with the haplotypes of the father and mother. However, the matching haplotypes can be found. The sites on the reference template are copied to the target data set of the offspring, and the genotype of the offspring samples is reconstructed.
基于群体的基因型填充,一般可以比对待填充的子代样本和例如参考群体单体型,找到两者之间共有的单体型,然而,可以将匹配上的参考模板上的位点复制到子代的目标数据集中,进行子代样本的基因型重构。Population-based genotype filling can generally compare the haplotypes of the offspring samples to be filled with the reference population haplotypes, and find the haplotypes shared between the two. However, the sites on the matched reference template can be copied to The target data set of the offspring is used to reconstruct the genotype of the offspring samples.
在一些优选的实施方案中,基于家系的胚胎子代基因型填充,包括如下步骤:In some preferred embodiments, population based on family-based embryo progeny genotypes includes the following steps:
(1)胚胎父母源单体型构建:(1) Construction of the haplotype of embryo parental origin:
如图1所示,对于特定的一条染色体,基于系谱信息(父母)及孟德尔遗传规律和基因连锁与交换理论的多位点连锁分析策略来构建染色体水平的单体型。区分胚胎父亲(母亲)的两条单体型,同时构建胚胎的父源和母源单体型组成,即明确胚胎遗传了父母的哪一条单体型。当子代信息中有些杂合位点无法进行单体型分型时,可以加入更多子代如胚胎的兄弟姐妹或者胚胎的祖父母(外祖父母信息)的基因型信息,以增加可以进行单体型分型的位点数。As shown in Figure 1, for a specific chromosome, a chromosome-level haplotype is constructed based on pedigree information (parents), Mendelian inheritance rules, and gene linkage and exchange theory of multi-locus linkage analysis. Distinguish the two haplotypes of the embryo's father (mother), and construct the paternal and maternal haplotypes of the embryo at the same time, that is, it is clear which haplotype the embryo has inherited from the parent. When there are some heterozygous sites in the offspring information that cannot be haplotyped, you can add more offspring such as embryo's siblings or embryo's grandparents (maternal grandparent information) genotype information to increase the ability to perform haplotypes The number of loci for typing.
(2)胚胎的基因组信息补全:(2) Completion of embryo genome information:
胚胎扩增样本和胚胎父母gDNA样本经基因组序列信息检测后获取基因分型信息。但胚胎培养液中的微量胚胎DNA(或者包含微量子代DNA的其它样本)由于全基因组扩增不均一性,导致胚胎基因组不能被全部扩增。以SNP芯片为例,芯片上只有大概1/5左右的位点能通过芯片基因分型的质控,再加上本发明的扩增效率和基因型分型错误质控后,胚胎基因组上缺失的位点基因型信息会更多。因此,在单体型构建的基础上,结合血源同一(Identity By Descent,IBD)方法及父母高密度多态位点基因分型信息,对胚胎某一染色体上的缺失基因型信息进行填充。The amplified embryo samples and the gDNA samples of the embryo parent's gDNA can obtain genotyping information after genomic sequence information detection. However, the small amount of embryonic DNA (or other samples containing small amounts of progeny DNA) in the embryo culture medium cannot be fully amplified due to the heterogeneity of the whole genome amplification. Taking the SNP chip as an example, only about 1/5 of the sites on the chip can pass the quality control of the chip genotyping, plus the amplification efficiency and genotyping error quality control of the present invention, the embryo genome is missing The locus genotype information will be more. Therefore, on the basis of haplotype construction, combined with the identity By Descent (IBD) method and parental high-density polymorphic locus genotyping information, the missing genotype information on a certain chromosome of the embryo is filled in.
图1展示了基因型填充的一个实例。如图1中的6.1和6.2,单体型构建后明确胚胎遗传自父亲和母亲的单体型分别为..A...C.G...T和..G...T.C...A。基于父母在这一单体型上更全的等位基因信息,进行第一阶段基因型填充,补全后的胚胎单体型信息为G.A.A.C.GA..T和C.G.T.T.CA..A,补全了胚胎中缺失的3个位点基因型信息,即G/C、A/T、A/A。以此类推,可以获得整条染色体水平的子代单体型信息。例如,若家系中待测子代有兄弟姐妹,如图1中的6.3所示,则基于兄弟姐妹的单体型,进行第二阶段基因型填充,待测胚胎与其兄弟姐妹父源单体型相同,母源单体型不同,可以进一步补全胚胎中缺失的另外3个基因型,分别为A/C、C/T、C/A,补全后的单体型信息为G.A.AACCGAC.T和C.G.TCTTCAA.A.。Figure 1 shows an example of genotype filling. As shown in Figure 1, 6.1 and 6.2, after the haplotypes are constructed, it is clear that the haplotypes inherited from the father and mother of the embryo are respectively ..A...C.G...T and ..G...T.C...A. Based on the parent's more complete allele information on this haplotype, perform the first stage of genotype filling. The completed embryo haplotype information is GAACGA..T and CGTTCA..A, which are completed. The genotype information of the 3 locus missing in the embryo, namely G/C, A/T, A/A. By analogy, the offspring haplotype information can be obtained at the level of the entire chromosome. For example, if the offspring to be tested in the family has siblings, as shown in Figure 1, 6.3, based on the haplotypes of the siblings, the second stage genotype filling is performed, and the embryos to be tested and their siblings and their paternal haplotypes The same, the maternal haplotype is different, the other 3 genotypes missing in the embryo can be further complemented, namely A/C, C/T, C/A, and the completed haplotype information is GAAACCGAC.T And CGTCTTCAA.A..
在一个实施方案中,在基于家系的基因型填充后,进一步包括基于群体,进行胚胎的基因型填充,以补全父母和胚胎都缺失的基因型信息。在一个优选的方案中,所述填充包括:In one embodiment, after the family-based genotype fill-in, it further includes population-based genotype fill-in of the embryo, so as to complement the genotype information missing from both the parent and the embryo. In a preferred solution, the filling includes:
利用根据上述本发明方法构建好的子代单体型,基于群体连锁不平衡规律及HapMap and1000Genomes等参考单体型及单体型频率信息,补全胚胎某一染色体中在父母基因组信息中也缺失的位点基因型信息。具体而言,可以利用群体参考单体型信息,寻找跟胚胎单体型最相似的群体单体型区段,然后基于群体中这一单体型区段的其它基因型信息来补全胚胎中缺失的信息。如图1中的6.4,胚胎中的两条单体型为G.A.AACCGAC.T和C.G.TCTTCAA.A,匹配到群体中最相似且频率最高的单体型为GTACAACCGACGT和CGGATCTTCAACA,从而补全胚胎中三个缺失的位点基因型信息T/G、C/A和G/C。可以用于此填充的估算方法包括但不局限于最大化算法(Expectation Maximization,E-M)、隐马尔科夫模型(Hidden Markov Model,HMM),马尔科夫链蒙特卡洛(Markov chain Monte Carlo,MCMC)和Coalescent理论。Using the haplotypes of the progeny constructed according to the above-mentioned method of the present invention, based on the population linkage disequilibrium law and HapMap and1000Genomes and other reference haplotypes and haplotype frequency information, a certain chromosome of the embryo is also missing in the parental genome information. Genotype information of the locus. Specifically, the population reference haplotype information can be used to find the population haplotype segment that is most similar to the embryo haplotype, and then based on other genotype information of this haplotype segment in the population to complete the embryo Missing information. As shown in Figure 1, the two haplotypes in the embryo are GAAACCGAC.T and CGTCTTCAA.A, and the most similar and most frequent haplotypes in the population are GTACAACCGACGT and CGGATCTTCAACA, thus complementing the three embryos. The genotype information T/G, C/A and G/C of the missing loci. The estimation methods that can be used for this filling include, but are not limited to, the Expectation Maximization (EM), Hidden Markov Model (HMM), and Markov chain Monte Carlo (MCMC). ) And Coalescent theory.
在一些实施方案中,为了实现较好的基因型填充,本发明方法可以包括:In some embodiments, in order to achieve better genotype filling, the method of the present invention may include:
(1)基因填充前(pre-imputation)的基因型(genotypes)质控,以过滤除去低质量的变异位点和样本,质控方法包括如上描述的本发明噪音遗传数据清除方法。(1) Quality control of genotypes before gene filling (pre-imputation) to filter and remove low-quality mutation sites and samples. The quality control method includes the noise genetic data removal method of the present invention as described above.
(2)确定分析中使用的基因组坐标系统,例如,采用UCSC版本(例如,“hg19”)的基因组坐标系统。(2) Determine the genome coordinate system used in the analysis, for example, the genome coordinate system using the UCSC version (for example, "hg19").
(3)选择参考模板(reference panel),基于家系的参考模板和/或基于群体的参考模板,进行子代的基因型填充。(3) Select a reference panel, and fill in the genotype of the offspring based on the family-based reference template and/or the population-based reference template.
V.本发明方法的示例性实施方案V. Exemplary embodiment of the method of the present invention
在一个优选的实施方案中,本发明的方法包括以下步骤:In a preferred embodiment, the method of the present invention includes the following steps:
1.胚胎核酸样本无创获取。胚胎核酸样本可以取自胚胎培养液中的游离DNA。1. Non-invasive acquisition of embryonic nucleic acid samples. Embryonic nucleic acid samples can be taken from free DNA in embryo culture medium.
2.培养液中胚胎微量DNA的全基因组扩增。扩增方法采用单细胞扩增策略,具体方法不受限制,包括但不限于扩增前引物延伸PCR(Primer extension preamplification PCR,PEP-PCR)、退变寡核苷酸引物PCR(Degenerate oligonucleotide primer-PCR,DOP-PCR)、多重置换扩增技术(Multiple Displacement Amplification,MDA)、多次退火环状循环扩增技术(Multiple Annealing and Looping Based Amplification Cycles,MALBAC)、平末端或黏性末端连接建库等方法。2. Whole genome amplification of embryonic trace DNA in culture medium. The amplification method adopts a single-cell amplification strategy, and the specific method is not limited, including but not limited to primer extension PCR (PEP-PCR) before amplification, and degenerate oligonucleotide primer PCR (Degenerate oligonucleotide primer- PCR, DOP-PCR), Multiple Displacement Amplification (MDA), Multiple Annealing and Looping Based Amplification Cycles (MALBAC), blunt-end or sticky-end connection construction And other methods.
3.提取胚胎父母等家系gDNA。3. Extract gDNA from parents of embryos and other families.
4.将胚胎培养液经全基因组扩增后的DNA产物和胚胎父母等家系的gDNA进行遗传学检测与分析。检测方法可采用核酸芯片、二代测序等平台,遗传分析利用基因分型方法获得父母和胚胎的基因型信息。4. Perform genetic testing and analysis on the DNA products of embryo culture fluid after whole genome amplification and the gDNA of embryo parents and other families. Detection methods can use nucleic acid chips, second-generation sequencing and other platforms, and genetic analysis uses genotyping methods to obtain genotype information of parents and embryos.
5.对胚胎的DNA分析数据进行质控和过滤。5. Quality control and filtering of embryo DNA analysis data.
1)微量核酸全基因组扩增效率质控:微量核酸全基因组扩增效率不均一性是微量核酸(例如,来自单细胞的微量核酸)扩增技术的特点,而扩增效率低的区域会影响该区域碱基位点的基因分型质量。鉴于此,本发明人利用多样本全基因组扩增产物构建参考测序数据集确定微量核酸全基因组扩增效率分布模式。具体做法是将多个相应扩增产物参考样本进行低深度测序(如0.06X左右),获取测序数据比对到参考基因组上的BAM文件,并将多个参考样本的BAM文件合并成一个大的BAM库。
Figure PCTCN2020121432-appb-000006
这里,N表示测序read数,L表示read长度;
1) Quality control of the whole-genome amplification efficiency of trace nucleic acid: The unevenness of the whole-genome amplification efficiency of trace nucleic acid is the characteristic of the amplification technology of trace nucleic acid (for example, trace nucleic acid from single cell), and the region with low amplification efficiency will affect The genotyping quality of the base site in this region. In view of this, the inventors used multi-sample whole-genome amplification products to construct a reference sequencing data set to determine the distribution pattern of the whole-genome amplification efficiency of trace nucleic acids. The specific method is to perform low-depth sequencing of multiple reference samples of corresponding amplification products (such as about 0.06X), obtain the sequencing data and compare them to the BAM files on the reference genome, and merge the BAM files of multiple reference samples into one large BAM library.
Figure PCTCN2020121432-appb-000006
Here, N represents the number of sequencing reads, and L represents the length of reads;
2)
Figure PCTCN2020121432-appb-000007
DP i表示第i个位点的绝对深度(total depth),也就是该位点的read数。在这个实例里,参考样本BAM库的基因组平均测序深度为27。本发明人的研究发现,相比微量核酸全基因组扩增效率≥1X(DP i≥27,位点绝对深度≥基因组平均深度)的位点,扩增效率<1X(DP i<27)位点的孟德尔遗传错误率要高出将近6倍左右(图2)。因此,为尽可能减少微量核酸全基因组扩增效率对胚胎全基因组基因分型的影响,本发明人利用多个微量核酸扩增产物(不同扩增方法需要有相应的扩增参考样本)的二代测序数据绘制全基因组扩增效率分布图谱,在此基础上识别扩增效率低(<1X)的位点基因型信息,并将其标为缺失数据。
2)
Figure PCTCN2020121432-appb-000007
DP i represents the absolute depth of the i-th position (total depth), that is, the number of reads at that position. In this example, the average sequencing depth of the genome of the reference sample BAM library is 27. The inventor’s research found that compared to sites with a trace nucleic acid genome amplification efficiency ≥ 1X (DP i ≥27, absolute locus depth ≥ genomic average depth) sites, amplification efficiency <1X (DP i <27) sites The Mendelian genetic error rate is nearly 6 times higher (Figure 2). Therefore, in order to minimize the impact of the whole genome amplification efficiency of the trace nucleic acid on the genotyping of the whole genome of the embryo, the present inventors used multiple trace nucleic acid amplification products (different amplification methods require corresponding amplification reference samples). Generation sequencing data draws a distribution map of the amplification efficiency of the whole genome, and on this basis, identifies the genotype information of the loci with low amplification efficiency (<1X), and marks it as missing data.
3)胚胎错误基因型位点识别,并标记为缺失数据:胚胎微量DNA经全基因组扩增后,除有些位点由于扩增效率低导致位点未被扩增或基因分型质量差外,有些位点还会由于扩增偏差导致两个等位基因中的一个被优势扩增,甚至另一个完全扩增失败,导致等位基因脱扣(Allele Dropout,ADO)问题,从而影响该位点的基因分型。在此,先利用步骤6.1构建子代的父源和母源单体型,利用孟德尔遗传规律和染色体干涉,识别胚胎中出现ADO及其它基因型错误的位点,并将其标为缺失数据。3) Identify the wrong genotype sites of embryos and mark them as missing data: After the embryonic trace DNA is amplified by the whole genome, except for some sites that are not amplified due to low amplification efficiency, or the quality of genotyping is poor, At some sites, due to amplification bias, one of the two alleles is predominantly amplified, or even the other one fails to amplify completely, causing allele dropout (ADO) problems, thereby affecting the site Genotyping. Here, first use step 6.1 to construct the paternal and maternal haplotypes of the offspring, use Mendelian inheritance and chromosomal interference to identify the sites where ADO and other genotype errors occur in the embryo, and mark them as missing data .
6.利用统计遗传学与计算生物学算法进行胚胎全基因组基因型信息重构。具体步骤如下:6. Use statistical genetics and computational biology algorithms to reconstruct the genotype information of the embryo's whole genome. Specific steps are as follows:
6.1胚胎父母源单体型构建:采用诸如Lander-Green算法、Elston-Stewart算法和Idury-Elston算法,对于某一条染色体,基于系谱信息(父母)及孟德尔遗传规律和基因连锁与交换理论的多位点连锁分析策略来构建染色体水平的单体型(图1)。区分胚胎父亲(母亲)的两条单体型,同时构建胚胎的父母源单体型组成,明确胚胎遗传了父母的哪一条单体型。可选地,在分析中加入更多子代样本、和/或其他家系个体的基因型信息。具体方法或框架包括但不限于似然法策略(求最大概率的单体型组成)、遗传规则策略(求最小重组数的单体型组成)及最大期望(Expectation Maximisation,EM)算法。其中似然法策略包括但不限于Lander-Green算法和Viterbi动态规划算法、Elston-Stewart算法和贝叶斯网络算法,优选方案为Lander-Green算法和Viterbi动态规划算法;遗传规则方法包括零重组假说策略和最小重组假说策略,软件载体包括但不限于ZAPLO、HAPLORE。若只有一个子代信息如只有一个胚胎信息,有些杂合位点可能无法进行单体型分型。如果有更多子代如胚胎的兄弟姐妹或者胚胎的祖父母(外祖父母信息),可以进行单体型分型的位点会更多。这里只有一方亲本也可进行单体型分型,但可分型的位点数会减少,从而影响基因组信息填充的准确性。6.1 Construction of embryonic parental haplotype: using such as Lander-Green algorithm, Elston-Stewart algorithm and Idury-Elston algorithm, for a certain chromosome, based on genealogical information (parents) and Mendelian inheritance laws and gene linkage and exchange theories. Loci linkage analysis strategy to construct haplotypes at the chromosome level (Figure 1). Distinguish the two haplotypes of the embryo's father (mother), and construct the parental haplotype composition of the embryo at the same time, to clarify which haplotype the embryo inherits from the parent. Optionally, add more progeny samples and/or genotype information of other family individuals to the analysis. Specific methods or frameworks include, but are not limited to, the likelihood method strategy (seeking the haplotype composition with the maximum probability), the genetic rule strategy (the haplotype composition with the minimum recombination number), and the Expectation Maximisation (EM) algorithm. The likelihood method includes but is not limited to Lander-Green algorithm and Viterbi dynamic programming algorithm, Elston-Stewart algorithm and Bayesian network algorithm. The preferred schemes are Lander-Green algorithm and Viterbi dynamic programming algorithm; genetic rule method includes zero recombination hypothesis Strategy and minimum reorganization hypothesis strategy, software carrier includes but not limited to ZAPLO, HAPLORE. If there is only one offspring information, such as only one embryo information, some heterozygous loci may not be haplotyped. If there are more offspring, such as the siblings of the embryo or the grandparents of the embryo (maternal grandparent information), there will be more sites for haplotype typing. Here only one parent can also be haplotyped, but the number of sites that can be typed will be reduced, which will affect the accuracy of genome information filling.
6.2胚胎的基因组信息补全。在步骤6.1单体型构建的基础上,结合血源同一(Identity By Descent,IBD)方法及父母高密度多态位点基因分型信息,对胚胎某一染色体上的缺失基因型信息进行填充。6.2 Completion of embryo genome information. On the basis of step 6.1 haplotype construction, combined with the identity By Descent (IBD) method and parental high-density polymorphic locus genotyping information, the missing genotype information on a certain chromosome of the embryo is filled in.
6.3基于家系其它成员(例如,待测子代的兄弟姐妹)的信息重构单体型,同时基于孟德尔遗传规律及例如待测子代的兄弟姐妹的单体型信息,进一步对胚胎的缺失基因型信息进行填充。一般地,家系其它成员(例如,待测子代的兄弟姐妹)越多,则能填充的基因型越多,准确率也越高。6.3 The haplotype is reconstructed based on the information of other members of the family (for example, the siblings of the offspring to be tested), and the haplotype information of the siblings of the offspring to be tested is also used to further the lack of embryos based on Mendelian inheritance rules Genotype information is filled in. Generally, the more other members of the family (for example, the siblings of the offspring to be tested), the more genotypes that can be filled, and the higher the accuracy rate.
6.4任选地,补全父母和胚胎都缺失的基因型信息。利用步骤6.1构建好的单体型,基于群体连锁不平衡规律及HapMap和1000Genomes等参考单体型及单体型频率信息,补全胚胎某一染色体中在父母基因组信息中也缺失的位点基因型信息。6.4 Optionally, complete the genotype information missing from both the parent and the embryo. Using the haplotype constructed in step 6.1, based on the population linkage disequilibrium law and the reference haplotype and haplotype frequency information such as HapMap and 1000Genomes, complete the locus genes in a chromosome of the embryo that are also missing in the parental genome information Type information.
6.5重复6.1、6.2、6.3、6.4步骤24次(22条常染色体+X染色体+Y染色体)构建胚胎全基因组基因型信息。6.5 Repeat steps 6.1, 6.2, 6.3, and 6.4 24 times (22 autosomes + X chromosome + Y chromosome) to construct the whole genome genotype information of the embryo.
在上述实施方案中,优选地,DNA产物若采用全基因组测序策略进行分析,胚胎父母采用较高深度的全基因组测序获取准确的全基因组基因型信息,如测序深度≥20x。在上述实施方案中,优选地,利用单细胞扩增不均一的特性,胚胎除可较高深度的全基因组测序方法获取能被扩增的位点基因型信息外,也可采用低深度的全基因组测序方法获取相对低密度的基 因型信息,如测序深度可低至2x,甚至更低。In the above-mentioned embodiment, preferably, if the DNA product is analyzed using a whole-genome sequencing strategy, the parents of the embryo use higher-depth whole-genome sequencing to obtain accurate whole-genome genotype information, such as sequencing depth ≥20x. In the above-mentioned embodiment, preferably, taking advantage of the heterogeneous characteristics of single cell amplification, in addition to the higher-depth whole-genome sequencing method to obtain the genotype information of the sites that can be amplified, the embryos can also use low-depth whole-genome sequencing methods. Genome sequencing methods obtain relatively low-density genotype information, such as sequencing depth can be as low as 2x, or even lower.
实施本发明方法的计算机产品、系统和设备Computer product, system and equipment for implementing the method of the present invention
本发明还提供了用于实施清除噪音遗传数据、定相单体型和/或重构子代基因组的计算机产品、系统和设备The present invention also provides computer products, systems and equipment for the implementation of removing noise genetic data, phasing haplotypes and/or reconstructing progeny genomes
计算机产品Computer products
在一个方面,本发明提供一种用于重构子代基因组(尤其是单体型)的装置,所述装置包含:In one aspect, the present invention provides a device for reconstructing progeny genomes (especially haplotypes), the device comprising:
非暂时性计算机可读储存介质,所述介质上带有用于执行本发明子代基因组信息重构方法的指令,包括:A non-transitory computer-readable storage medium carrying instructions for executing the method for reconstructing offspring genome information of the present invention, including:
-用于接收输入的一个或多个指令,所述输入包括子代的基因组序列信息,-One or more instructions for receiving input, the input including the genome sequence information of the offspring,
-用于对输入的基因组序列信息进行质控和过滤的一个或多个指令;-One or more instructions for quality control and filtering of the input genome sequence information;
-用于重构子代的基因组信息以确定子代单体型的一个或多个指令。-One or more instructions for reconstructing the genomic information of the offspring to determine the haplotype of the offspring.
优选地,所述装置包括如下模块:Preferably, the device includes the following modules:
(1)序列信息数据获取模块:用于获取子代和/或相关个体的原始序列信息数据;(1) Sequence information data acquisition module: used to acquire the original sequence information data of the offspring and/or related individuals;
(2)基因型分析模块:用于分析模块(1)的原始序列信息数据的基因型;(2) Genotype analysis module: used to analyze the genotype of the original sequence information data of module (1);
(3)质控过滤模块:用于对模块(2)获得的基因型信息进行质控和过滤;(3) Quality control filtering module: used to perform quality control and filtering on the genotype information obtained by module (2);
(4)单体型定相模块,用于对模块(3)的经质控和过滤后的基因型进行单体型定相;(4) Haplotype phasing module, used to perform haplotype phasing on the genotype after quality control and filtering of module (3);
(5)基因型填充模块:用于对模块(4)的定相的单体型进一步进行子代基因型的重构;(5) Genotype filling module: used to further reconstruct the genotype of the progeny from the phased haplotype of module (4);
(6)任选地,报告输出模块:对步骤(1)-(5)得到的数据进行加工处理整合,生成报告。(6) Optionally, report output module: process and integrate the data obtained in steps (1)-(5) to generate a report.
在一个优选实施方案中,本发明提供了一种装置,其包括:In a preferred embodiment, the present invention provides a device comprising:
至少一个处理器和至少一个存储器,所述至少一个存储器在其上存储有代码,当由所述至少一个处理器执行时,所述代码导致所述装置能够执行本发明的方法。优选地,当由所述至少一个处理器执行时,所述代码导致所述装置至少执行:At least one processor and at least one memory, the at least one memory has a code stored thereon, and when executed by the at least one processor, the code causes the apparatus to be able to execute the method of the present invention. Preferably, when executed by the at least one processor, the code causes the apparatus to execute at least:
-接收序列信息数据,例如,子代和/或相关个体的原始序列信息数据,-Receive sequence information data, for example, original sequence information data of offspring and/or related individuals,
-根据接受的原始序列信息数据,分析基因型;-Analyze the genotype based on the received original sequence information data;
-对分析得到的基因型信息进行质控和过滤;-Quality control and filtering of the genotype information obtained from the analysis;
-对经质控和过滤后的基因型进行单体型定相;-Haplotype phasing of genotypes after quality control and filtering;
-对定相的单体型进一步进行子代基因型的重构。-Further reconstruct the genotype of the progeny of the phased haplotype.
在一个实施方案中,本发明也提供了一种计算机可读存储介质,其上存储有代码以便由装置使用,当由处理器执行时,所述代码导致所述装置能够执行本发明子代基因组信息重构方法。优选地,当由所述至少一个处理器执行时,所述代码导致所述装置至少执行:In one embodiment, the present invention also provides a computer-readable storage medium having code stored thereon for use by a device, and when executed by a processor, the code causes the device to execute the progeny genome of the present invention. Information reconstruction method. Preferably, when executed by the at least one processor, the code causes the apparatus to execute at least:
-接收序列信息数据,例如,子代和/或相关个体的原始序列信息数据,-Receive sequence information data, for example, original sequence information data of offspring and/or related individuals,
-根据接受的原始序列信息数据,分析基因型;-Analyze the genotype based on the received original sequence information data;
-对分析得到的基因型信息进行质控和过滤;-Quality control and filtering of the genotype information obtained from the analysis;
-对经质控和过滤后的基因型进行单体型定相;-Haplotype phasing of genotypes after quality control and filtering;
-对定相的单体型进一步进行子代基因型的重构。-Further reconstruct the genotype of the progeny of the phased haplotype.
系统system
在一个方面,本发明提供一种用于重构子代基因组(尤其是单体型)的系统,所述系统包含装置(设备或模块),所述装置被配置成能够实现本发明方法,例如,配置成:In one aspect, the present invention provides a system for reconstructing progeny genomes (especially haplotypes). The system includes a device (device or module) configured to implement the method of the present invention, such as , Configured as:
-接收输入,所述输入包括子代和相关个体的基因组序列信息,-Receive input, the input includes genomic sequence information of offspring and related individuals,
-对输入的基因组序列信息进行质控和过滤;-Quality control and filtering of the input genome sequence information;
-优选地,确定子代单体型以重构子代的基因组信息。-Preferably, the haplotype of the progeny is determined to reconstruct the genomic information of the progeny.
在一个优选实施方案中,所述装置为上述的本发明用于重构子代基因组(尤其是单体型)的装置。In a preferred embodiment, the device is the aforementioned device of the present invention for reconstructing progeny genomes (especially haplotypes).
在本发明的系统中,可以进一步包括:In the system of the present invention, it may further include:
-扩增装置,用于对子代和/或相关个体的核酸样本进行扩增,优选全基因组扩增;-Amplification device, used to amplify nucleic acid samples of progeny and/or related individuals, preferably whole genome amplification;
-序列信息检测装置,用于对扩增产物进行序列信息检测,包括但不限于,多态性位点(例如SNP)检测、测序检测。-Sequence information detection device, used for sequence information detection of amplified products, including but not limited to polymorphic loci (such as SNP) detection and sequencing detection.
设备equipment
又在一个方面,本发明提供了一种对子代基因组重构进行分析处理的设备,包括:In yet another aspect, the present invention provides a device for analyzing and processing progeny genome reconstruction, including:
扩增单元,用于对待测样本和子代父母家系的DNA样本进行全基因组扩增;Amplification unit, used for whole genome amplification of the DNA sample of the test sample and the parent family of the offspring;
检测与分析单元,用于对扩增单元获得的扩增产物进行遗传性检测与分析;The detection and analysis unit is used for genetic detection and analysis of the amplified products obtained by the amplification unit;
数据处理单元,用于对扩增单元获得的扩增产物的检测与分析数据进行质控和过滤,去除未被扩增出来或基因型分型错误的位点基因型;和The data processing unit is used for quality control and filtering of the detection and analysis data of the amplified products obtained by the amplification unit, and remove the genotypes of the sites that are not amplified or genotyping errors; and
子代全基因组基因型的信息重构单元;其中,所述子代全基因组基因型的信息重构单元用于执行单体型定相和基因型的填充,并输出所得到的子代全基因组基因型的信息重构结果。Information reconstruction unit for the whole genome genotype of the offspring; wherein the information reconstruction unit for the whole genome genotype of the offspring is used to perform haplotype phasing and genotype filling, and output the obtained whole genome of the offspring The results of genotype information reconstruction.
优选地,本发明的系统将包含用于基因组序列信息查询的工具,和用于使计算机可以分 析所得数据的编程的储存器或介质。序列信息查询数据(包括,例如测序数据集,SNP数据集,基因变异位点数据集、基因型分析数据集),可以是存储的数据集,或为“即时生效或在运行(on the fly)”形式。如本文中所用,“数据集”涵盖这两种类型的数据来源。Preferably, the system of the present invention will include tools for querying genomic sequence information, and a programmed storage or medium for the computer to analyze the obtained data. Sequence information query data (including, for example, sequencing data set, SNP data set, gene mutation site data set, genotype analysis data set), can be a stored data set, or "on the fly" "form. As used in this article, "data set" covers these two types of data sources.
用于基因组序列信息查询的工具不受特别限制。在一个优选方案中,采用高密度SNP芯片。在另一优选方案中,采用高通量测序设备,获取子代的相关个体的较高深度测序数据。The tools used for querying genome sequence information are not particularly limited. In a preferred solution, a high-density SNP chip is used. In another preferred solution, a high-throughput sequencing device is used to obtain high-depth sequencing data of related individuals of the offspring.
本发明可以通过计算机执行。因此,本发明还提供经编程可以执行以上方法的计算机。计算机典型地包括:与计算机通讯界面连接的CPU,系统记忆(RAM),非暂时性存储器(ROM),和一个或多个其他存储装置如硬板、软盘、CD ROM drive。计算机还可以包括展示装置,例如打印机、CRT监测器或LCD展示器,以及输入装置,例如键盘、鼠标、笔、触摸屏或语音激活系统。输入装置可以接收数据,例如通过界面直接从序列信息查询工具接收。The present invention can be executed by a computer. Therefore, the present invention also provides a computer programmed to perform the above method. A computer typically includes: a CPU connected to a computer communication interface, a system memory (RAM), a non-transitory memory (ROM), and one or more other storage devices such as a hard board, a floppy disk, and a CD ROM drive. The computer may also include a display device, such as a printer, a CRT monitor, or an LCD display, and an input device, such as a keyboard, mouse, pen, touch screen, or voice activation system. The input device can receive data, for example, directly from a sequence information query tool through an interface.
本发明的方法、计算机产品、系统和设备的应用Application of the method, computer product, system and equipment of the present invention
根据本发明的方法、计算机产品(尤其是本发明的上述装置)、系统和设备能够用于植入前的胚胎和/或孕早期的胎儿的疾病检测或疾病易感性检测,包括但不限于:非整倍性检测、单基因遗传病检测、染色体结构重排检测、多基因疾病遗传风险评估。The method, computer product (especially the above-mentioned device of the present invention), system and equipment according to the present invention can be used for disease detection or disease susceptibility detection of pre-implantation embryos and/or fetuses in early pregnancy, including but not limited to: Aneuploidy detection, single gene genetic disease detection, chromosome structure rearrangement detection, polygenic disease genetic risk assessment.
在一个实施方案中,所述用途包括:诊断常见疾病或癌症的易感性,包括:例如,将根据本发明方法重构的子代单体型与已知的疾病相关单体型进行比较。此类单体型与疾病的相关性在本领域正在建立中。例如“国际HapMap consortium”是在人群体中作图定位SNP单体型的全基因组范围的变异,有利于疾病相关性研究(international,HapMap consortium,2005)。因此,将本发明的基因组重构方法与这些单体型分析组合,也形成本发明的一个方面。In one embodiment, the use includes: diagnosing common diseases or cancer susceptibility, including: for example, comparing the progeny haplotype reconstructed according to the method of the present invention with known disease-related haplotypes. The relationship between such haplotypes and diseases is being established in the art. For example, "International HapMap consortium" maps and locates the genome-wide variation of SNP haplotypes in the human population, which is conducive to disease-related research (international, HapMap consortium, 2005). Therefore, combining the genome reconstruction method of the present invention with these haplotype analyses also forms an aspect of the present invention.
本发明的有益技术效果The beneficial technical effect of the present invention
(1)在本发明中,首次发现一种可准确、无创获取胚胎全基因组基因型信息的方法,从而使无创地进行植入前胚胎多基因疾病遗传风险评级成为可能。(1) In the present invention, a method for accurately and non-invasively obtaining genome-wide genotype information of embryos is discovered for the first time, thereby making it possible to non-invasively perform genetic risk rating of preimplantation embryo polygenic diseases.
(2)本发明首次发现,在全基因组扩增技术基础上,结合芯片或二代、三代等测序技术,利用胚胎父母等家系高密度的基因多态位点信息及统计遗传学与计算生物学算法,可对胚胎基因组中未被扩增的位点及ADO等基因分型错误的位点进行基因型填充,从而获得胚胎全基因组信息。(2) The present invention finds for the first time that on the basis of whole genome amplification technology, combined with chip or second- and third-generation sequencing technologies, the use of high-density gene polymorphism information and statistical genetics and computational biology of embryonic parents and other families The algorithm can fill in the genotype of the unamplified sites in the embryo's genome and the sites with ADO and other genotyping errors, so as to obtain the embryo's whole genome information.
(3)本发明首次发现,通过对胚胎的DNA分析数据进行质控和过滤,可滤掉基因分型质 量差的位点信息,特别是单细胞全基因组扩增效率差的位点,从而提高基因组重构的准确率。(3) The present invention finds for the first time that quality control and filtering of embryo DNA analysis data can filter out information on sites with poor genotyping quality, especially sites with poor single-cell whole-genome amplification efficiency, thereby improving The accuracy of genome reconstruction.
(4)本发明首次发现,基于家系和群体的基因填充方法的整合应用,可最大程度的获取胚胎整个基因组的位点信息。(4) The present invention finds for the first time that the integrated application of gene filling methods based on family and population can obtain the locus information of the entire genome of the embryo to the greatest extent.
缩略词描述Abbreviation description
Figure PCTCN2020121432-appb-000008
Figure PCTCN2020121432-appb-000008
Figure PCTCN2020121432-appb-000009
Figure PCTCN2020121432-appb-000009
实施例Example
描述以下实施例以辅助对本发明的理解。不意在且不应当以任何方式将实施例解释成限制本发明的保护范围。The following examples are described to assist the understanding of the present invention. The examples are not intended and should not be construed in any way to limit the scope of protection of the present invention.
实施例1 芯片平台 Embodiment 1 Chip platform
1)样本收集:1) Sample collection:
所收集的样本分别是一个家庭中父亲的外周血1ml,母亲的外周血1ml,以EDTA抗凝管收集;以及将所述父亲的精子利用卵胞浆内单精子显微注射技术(IntraCytoplasmic Sperm Injection;ICSI)使所述母亲的卵子体外受精,使用GM培养基(Quinn's Advantage加蛋白卵裂培养液(Quinn's Advantage Protein Plus Cleavage Medium)(厂家:SAGE,产品号:ART-1526)在37℃,5%CO 2,5%O 2培养箱内培养所述受精卵,至约第五天时长成囊胚,吸取约20ul囊胚培养液。制备了来自相同父母的4例囊胚的培养液。 The collected samples were 1ml of the father’s peripheral blood and 1ml of the mother’s peripheral blood in a family, collected with an EDTA anticoagulation tube; and the father’s sperm was collected using the technique of IntraCytoplasmic Sperm Injection ;ICSI) to fertilize the mother’s eggs in vitro using GM medium (Quinn's Advantage Protein Plus Cleavage Medium) (manufacturer: SAGE, product number: ART-1526) at 37°C, 5 The fertilized egg was cultured in a 5% CO 2 and 5% O 2 incubator, and grew into a blastocyst on the fifth day, and about 20 ul of the blastocyst culture fluid was sucked. The culture fluid of 4 blastocysts from the same parents was prepared.
2)DNA提取、扩增和定量2) DNA extraction, amplification and quantification
父亲的外周血样本和母亲的外周血样本分别采取常规全血基因组提取步骤提取全基因组DNA。此步骤所用试剂盒是市售的DNeasy Blood&Tissue Kit(50)(厂家QIAGEN,货号69504),按照制造商的说明书实施全基因组DNA的提取。The father’s peripheral blood samples and the mother’s peripheral blood samples were taken to extract the whole genome DNA by conventional whole blood genome extraction steps. The kit used in this step is the commercially available DNeasy Blood&Tissue Kit (50) (manufacturer QIAGEN, article number 69504), and the extraction of whole genome DNA is carried out according to the manufacturer's instructions.
将来自相同父母的4例囊胚的各培养液样本(体积分别约为10-15微升)转移至5微升裂解液(pH7.8的30mM Tris-Cl,2mM EDTA,20mM KCl,0.2%Triton X-100)中,用记号笔在各采集管上标记样本名称。微型离心机离心30秒。样本可立即进入下一步全基因组扩增步骤或放入-20℃或-80℃冷冻保存。本发明的扩增方法参照序康医疗科技(苏州)有限公司的MALBAC单细胞全基因组扩增试剂盒(货号KT110700150)的说明书进行囊胚培养液的全 基因组扩增。Transfer samples of each culture medium of 4 blastocysts from the same parents (the volume is about 10-15 microliters respectively) to 5 microliters of lysis buffer (pH 7.8 30mM Tris-Cl, 2mM EDTA, 20mM KCl, 0.2% In Triton X-100), mark the sample name on each collection tube with a marker. Centrifuge for 30 seconds in a microcentrifuge. The sample can immediately enter the next step of whole genome amplification or put it into -20℃ or -80℃ frozen storage. The amplification method of the present invention refers to the instruction of the MALBAC single-cell whole-genome amplification kit (Cat. No. KT110700150) of Xukang Medical Technology (Suzhou) Co., Ltd. to perform whole-genome amplification of the blastocyst culture medium.
使用Qubit dsDNA HS Assay Kit(Invitrogen,Q32584)对各囊胚培养液的全基因组扩增产物进行定量。定量结果表明,每个样本中的核酸总量均约为500-1000ng。Qubit dsDNA HS Assay Kit (Invitrogen, Q32584) was used to quantify the whole genome amplification products of each blastocyst culture medium. The quantitative results show that the total amount of nucleic acid in each sample is about 500-1000ng.
3)原始数据的获得3) Obtaining raw data
自北京博奥晶典生物技术有限公司获得人类全基因组SNP芯片板,名称为CBC-PMRA(Capital Biotechnology-Precision Medicine Research Array)SNP芯片900K。该高密度SNP芯片板包含GWAScatalog数据库中的约900,000个SNP位点,平均覆盖了亚洲人、尤其是中国人的全基因组,因此可满足全基因组关联分析和基因分型的需要。Obtained a human genome SNP chip board from Beijing Boao Jingdian Biotechnology Co., Ltd., named CBC-PMRA (Capital Biotechnology-Precision Medicine Research Array) SNP chip 900K. The high-density SNP chip board contains about 900,000 SNP sites in the GWAScatalog database, covering the entire genome of Asians, especially Chinese, on average, so it can meet the needs of genome-wide association analysis and genotyping.
将上述步骤2)获得的DNA使用SNP芯片900K根据生产商的说明书进行操作,获得了用于基因分型的原始数据。The DNA obtained in step 2) above was operated on the SNP chip 900K according to the manufacturer's instructions to obtain the original data for genotyping.
4)基因分型分析4) Genotyping analysis
选择Thermo Fisher Scientific公司的Axiom TM Analysis Suite软件作为对上述步骤3)的自SNP芯片900K获得的原始数据进行分析的平台。使用Axiom TM Analysis Suite软件中的Genotyping功能模块实施基因分型分析,选择出基因型质量符合PolyHighResolution、NoMinorHom、MonoHighResolution、Hemizygous标准的位点。 The Axiom TM Analysis Suite software of Thermo Fisher Scientific was selected as the platform for analyzing the original data obtained from the SNP chip 900K in the above step 3). Use the Genotyping function module in the Axiom TM Analysis Suite software to perform genotyping analysis, and select the sites whose genotype quality meets the standards of PolyHighResolution, NoMinorHom, MonoHighResolution, and Hemizygous.
结果显示,自各样本获得的基因型质量符合PolyHighResolution、NoMinorHom、MonoHighResolution、Hemizygous标准的位点数见下表1。The results show that the genotype quality obtained from each sample meets the PolyHighResolution, NoMinorHom, MonoHighResolution, Hemizygous standards for the number of loci in Table 1 below.
表1Table 1
Figure PCTCN2020121432-appb-000010
Figure PCTCN2020121432-appb-000010
由表1可见,各囊胚培养液的胚胎SNP基因型信息基本上是其父母SNP基因型信息的1/4。It can be seen from Table 1 that the embryonic SNP genotype information of each blastocyst culture medium is basically 1/4 of the parental SNP genotype information.
5)由于囊胚培养液的胚胎DNA通过上述步骤2)所述的使用MALBAC进行全基因组DNA扩增存在全基因组DNA扩增效率不均一的问题,进一步对各囊胚培养液样本的SNP基因分型数据实施了质量控制。具体操作如下:5) Since the embryonic DNA of the blastocyst culture fluid is used for whole-genome DNA amplification using MALBAC as described in the above step 2), there is a problem of uneven amplification efficiency of the whole-genome DNA. Further, the SNP gene analysis of each blastocyst culture fluid sample Quality control is implemented for type data. The specific operations are as follows:
i)采用MALBAC技术扩增囊胚培养液的胚胎DNA,获得全基因组扩增产物样本,进行超声打断,打断片段分布于200-800bp,然后分别采用Illumina连接建库法进行二代测序文库的构建,试剂盒可在市面上购买,依据说明书进行文库构建(厂家:江苏康为世纪生物科技有限公司,商品名:二代测序快速DNA建库试剂盒NGS Fast DNA Library Prep Set for Illumina,货号:CW2585M),而后通过Illumina公司的NextSeq550测序平台进行全基因组测序,每个样本的平均测序深度为0.06X;i) Use MALBAC technology to amplify embryonic DNA in blastocyst culture fluid to obtain a whole genome amplification product sample, perform ultrasonic interruption, and the interrupted fragments are distributed in 200-800bp, and then use Illumina connection library method to carry out the second-generation sequencing library. The kit can be purchased on the market, and the library is constructed according to the instructions (manufacturer: Jiangsu Kangwei Century Biotechnology Co., Ltd., trade name: NGS Fast DNA Library Prep Set for Illumina, catalog number : CW2585M), and then use Illumina's NextSeq550 sequencing platform to perform whole-genome sequencing, and the average sequencing depth of each sample is 0.06X;
ii)利用BWA-MEM算法将获得的测序序列比对到参考基因组上。在本实施例中,使用的是hg19的参考基因组。获得了463个BAM文件。然后将所述463个BAM文件合并成一个大的BAM文件库。ii) Align the obtained sequencing sequence to the reference genome using the BWA-MEM algorithm. In this example, the reference genome of hg19 is used. Obtained 463 BAM files. Then the 463 BAM files are merged into a large BAM file library.
iii)将全基因组上每个位点的绝对深度相加除以位点个数计算出该基因组的平均测序深度。可以用以下公式计算出该基因组的平均测序深度:iii) Calculate the average sequencing depth of the genome by adding the absolute depth of each site on the whole genome and dividing by the number of sites. The average sequencing depth of the genome can be calculated using the following formula:
Figure PCTCN2020121432-appb-000011
Figure PCTCN2020121432-appb-000011
这里,N表示测序read次数,L表示read长度,3×10 9是人类基因组大小。 Here, N represents the number of sequencing reads, L represents the read length, and 3×10 9 is the size of the human genome.
iv)若SNP位点的绝对深度,也即覆盖该位点的read数,大于等于基因组的平均测序深度,则表示该位点通过扩增效率质控;将未符合这一质控的位点基因型标记为缺失数据。iv) If the absolute depth of the SNP locus, that is, the number of reads covering the locus, is greater than or equal to the average sequencing depth of the genome, it means that the locus has passed the quality control of amplification efficiency; the locus that does not meet this quality control The genotype is marked as missing data.
6)利用发明详述部分所述的孟德尔遗传规律识别错误的基因型位点,如ADO(allele dropout)等,将其标为缺失数据。6) Use the Mendelian genetic law described in the detailed description of the invention to identify the wrong genotype locus, such as ADO (allele dropout), etc., and mark it as missing data.
7)利用胚胎单体型相互推导及染色体干涉理论,进一步识别囊胚培养液中基因型错误的位点,并将其标为缺失数据。7) Using the mutual derivation of embryo haplotypes and chromosomal interference theory, further identify the sites of genotype errors in the blastocyst culture medium and mark them as missing data.
其中所述胚胎单体型相互推导是指利用步骤8)的基因型定相方法,并利用多个胚胎的基因型数据获取来推导胚胎最大可能的单体型组成。然后,基于染色体干涉抑制策略,识别1cM遗传距离内发生两次交换重组的位点为错误基因型。The mutual derivation of embryo haplotypes refers to the use of the genotype phasing method of step 8) and the acquisition of genotype data of multiple embryos to deduce the maximum possible haplotype composition of the embryo. Then, based on the chromosomal interference suppression strategy, the site where two crossover recombination occurred within 1cM genetic distance was identified as the wrong genotype.
通过步骤5)、6)和7)的质控后获得的位点数见表2。See Table 2 for the number of sites obtained after quality control in steps 5), 6) and 7).
表2Table 2
Figure PCTCN2020121432-appb-000012
Figure PCTCN2020121432-appb-000012
Figure PCTCN2020121432-appb-000013
Figure PCTCN2020121432-appb-000013
由表2可见,囊胚培养液的胚胎基因型信息经过进一步的质控后,其SNP基因型信息下降至大约为其父母SNP基因型信息的1/7。It can be seen from Table 2 that after further quality control, the embryo genotype information of the blastocyst culture fluid dropped to about 1/7 of the SNP genotype information of its parents.
8)基于系谱的单体型定相及基因型填充 18) haplotype phasing pedigree genotype 1 populated based on:
i)基于系谱结构构建每个位点的二进制基因流向量V i,V i=(P 1,i,M 1,i,P 2,i,M 2,i,...,P n,i,M n,i)。这里,i表示位点;1...n表示胚胎数;P n,i表示第n个胚胎第i个位点遗传自父系祖先的单体型,P n,i=0表示该胚胎该位点遗传了祖父的单体型,P n,i=1表示该胚胎该位点遗传了祖母的单体型;M n,i表示第n个胚胎第i个位点遗传自母系祖先的单体型,M n,i=0表示该胚胎该位点遗传了外祖父的单体型,M n,i=1表示该胚胎该位点遗传了外祖母的单体型。 i) Construct a binary gene flow vector V i for each site based on the pedigree structure, V i = (P 1,i ,M 1,i ,P 2,i ,M 2,i ,...,P n,i ,M n,i ). Here, i represents the locus; 1...n represents the number of embryos; P n,i represents the haplotype inherited from the paternal ancestor at the i-th locus of the nth embryo, and P n,i =0 represents the position of the embryo The point has inherited the haplotype of the grandfather, P n,i = 1 means that the embryo has inherited the haplotype of the grandmother at this locus; M n,i means the haplotype of the i-th locus of the nth embryo is inherited from the maternal ancestor Type, M n,i =0 indicates that the embryo has inherited the haplotype of the maternal grandfather at this locus, and M n,i =1 indicates that the embryo has inherited the haplotype of the maternal grandmother at this locus.
ii)基于隐马尔科夫模型计算每个位点的单体型隐含状态和基因型观察值的最大联合似然概率,公式如下:ii) Calculate the maximum joint likelihood probability of the haplotype hidden state and genotype observation value of each locus based on the hidden Markov model, the formula is as follows:
Figure PCTCN2020121432-appb-000014
Figure PCTCN2020121432-appb-000014
这里,m表示位点数;P(V 1)是第一个位点父系或母系祖先单体型的初始概率;P(V i|V i-1)是第i-1位点到其相邻的第i个位点单体型状态的转移概率,通过计算两个位点间的重组率获得。 Here, m represents the number of sites; P(V 1 ) is the initial probability of the paternal or maternal ancestor haplotype at the first site; P(V i |V i-1 ) is the i-1th site to its neighbor The transfer probability of the haplotype state at the i-th site is obtained by calculating the recombination rate between the two sites.
本发明通过千人基因组第三阶段(1000Genomes Project Phase 3)的遗传图谱来估计位点间的重组率;P(G i|V i)是给定位点祖先单体型状态(V i)后的基因型(G i)概率,通过孟德尔遗传规律计算。 The present invention estimates the recombination rate between sites through the genetic map of the 1000 Genomes Project Phase 3 (1000 Genomes Project Phase 3); P(G i |V i ) is the ancestor haplotype status (V i ) of the site The probability of genotype (G i ) is calculated by Mendelian law of inheritance.
iii)利用Viterbi算法计算隐马尔科夫状态V=(V 1,V 2,...,V m)m个位点祖先单体型状态最可能的组成,也即每个胚胎最大可能的染色体水平单体型组成; iii) Use the Viterbi algorithm to calculate the hidden Markov state V=(V 1 ,V 2 ,...,V m ) The most likely composition of the ancestral haplotype state of m sites, that is, the largest possible chromosome of each embryo Horizontal haplotype composition;
iv)结合血源同一(Identity By Descent,IBD)区域识别,也即确定某一区域胚胎来自父母的单体型组成情况,同时结合父母高密度多态位点基因型信息,估计胚胎中未被扩增出来或基因型分型错误的位点的基因型(染色体水平);iv) Combining identity by descent (IBD) region recognition, that is, determining the haplotype composition of a certain region embryo from the parents, and combining the parental high-density polymorphic locus genotype information, it is estimated that the embryo is not The genotype (chromosome level) of the amplified or genotyped locus;
v)由于存在22条常染色体和1条X染色体和/或1条Y染色体,因此需重复上述步骤i)-iv)共23次或24次,以实现芯片水平胚胎全基因组基因型信息(芯片上所有位点)的重建。v) Since there are 22 autosomes and 1 X chromosome and/or 1 Y chromosome, it is necessary to repeat the above steps i)-iv) 23 or 24 times in total to achieve the embryo genome-wide genotype information at the chip level (chip All sites above) reconstruction.
胚胎基因组重构后获得的多态位点信息见表3。Table 3 shows the polymorphic site information obtained after embryo genome reconstruction.
表3table 3
Figure PCTCN2020121432-appb-000015
Figure PCTCN2020121432-appb-000015
由表3可见,胚胎基因组重构后其基因组覆盖度可以由原来的14%(1/7)升高到70%左右。It can be seen from Table 3 that the genome coverage of embryos can be increased from 14% (1/7) to about 70% after reconstruction.
将该实验结果与配对活检样本进行了比较。活检样本为囊胚培养液对应的囊胚滋养层细胞,具体实验步骤为①活检的囊胚移至无钙、镁离子的体外操作培养液(如G-PGD含5%HSA)内;②在倒置显微镜下(200X),用持卵针固定住胚胎;③给透明带切口或打孔,孔径为35-40μm;④用内径为35-40μm的针吸取一个有细胞核的细胞;⑤将胚胎转移出操作液,在囊胚培养液中清洗并培养。注明患者姓名及胚胎编号;⑥DNA提取、扩增、定量及基因型分析步骤同囊胚培养液。The results of this experiment were compared with paired biopsy samples. The biopsy sample is the blastocyst trophoblast cells corresponding to the blastocyst culture medium. The specific experimental steps are as follows: ①Transfer the biopsy blastocyst to an in vitro operation culture medium without calcium and magnesium ions (such as G-PGD containing 5% HSA); ②In Under an inverted microscope (200X), fix the embryo with an egg-holding needle; ③ Cut or perforate the zona pellucida with a diameter of 35-40 μm; ④ Use a needle with an inner diameter of 35-40 μm to absorb a cell with a nucleus; ⑤ Transfer the embryo The operating fluid is removed, washed and cultured in the blastocyst culture fluid. Indicate the patient's name and embryo number; ⑥DNA extraction, amplification, quantification and genotype analysis procedures are the same as the blastocyst culture medium.
选取活检样本中能被成功分型且囊胚培养液样本中未被成功分型的位点加以验证比较后发现,本实施例的等位基因重构准确率可达到约99.2%(表4)。After selecting the sites in the biopsy sample that can be successfully typed and the blastocyst culture fluid sample that has not been successfully typed for verification and comparison, it is found that the allelic remodeling accuracy rate of this example can reach about 99.2% (Table 4) .
表4 芯片平台胚胎基因组重构准确率Table 4 Accuracy of embryo genome reconstruction on chip platform
Figure PCTCN2020121432-appb-000016
Figure PCTCN2020121432-appb-000016
9)对于不在CBC-PMRA SNP芯片900K上的位点的基因型,利用基于群体信息的基因型填充策略来预测。9) For the genotypes of the sites not on the CBC-PMRA SNP chip 900K, use the genotype filling strategy based on population information to predict.
在本实施例中,进一步利用1000Genomes Phase3中的中国人群参考单体型信息,基于利用系谱信息已定相好的单体型信息,采用隐马尔科夫模型(Hidden Markov Model,HMM),利用MACH软件包来进行胚胎全基因组基因型预测 2In this embodiment, the Chinese population reference haplotype information in 1000 Genomes Phase 3 is further used, based on the haplotype information that has been phased using genealogical information, Hidden Markov Model (HMM) is used, and MACH software is used. package to predict embryo genotype 2 genome.
实施例2 二代测序Example 2 Second-generation sequencing
1)样本收集:1) Sample collection:
同上述实施例1。Same as the above embodiment 1.
2)DNA提取、扩增和定量2) DNA extraction, amplification and quantification
同上述实施例1。Same as the above embodiment 1.
3)二代测序文库的构建3) Construction of second-generation sequencing library
对来自上述步骤2)的DNA样本进行超声打断,打断片段分布于200-800bp,然后分别采用Illumina连接建库法进行二代测序文库的构建,试剂盒可在市面上购买,依据说明书进行文库构建(厂家:江苏康为世纪生物科技有限公司,商品名:二代测序快速DNA建库试剂盒NGS Fast DNA Library Prep Set for Illumina,货号:CW2585M)。The DNA samples from the above step 2) were ultrasonically interrupted, and the interrupted fragments were distributed in 200-800bp, and then the Illumina ligation library method was used to construct the second-generation sequencing library. The kits can be purchased in the market and proceed according to the instructions. Library construction (manufacturer: Jiangsu Kangwei Century Biotechnology Co., Ltd., trade name: NGS Fast DNA Library Prep Set for Illumina, Item No.: CW2585M).
4)全基因组测序4) Whole genome sequencing
对步骤3)的二代测序文库利用Illumina公司的NovaSeq 6000测序平台进行全基因组测序,其中,对于父母gDNA样本,全基因组平均测序深度为20X,对于胚胎培养液样本,全基因组平均测序深度为2X。For the next-generation sequencing library of step 3), use the NovaSeq 6000 sequencing platform of Illumina to perform whole-genome sequencing. For parent gDNA samples, the average sequencing depth of the whole genome is 20X, and for embryo culture samples, the average sequencing depth of the whole genome is 2X. .
获得原始的fastq read测序文件,保存于服务器上。Obtain the original fastq read sequencing file and save it on the server.
5)质量控制、过滤和校正及基因型分析5) Quality control, filtering and calibration, and genotype analysis
从全基因组测序中发现的变异体会在多个步骤进行过滤,以进行质量控制。采用Genome Analysis Tool kit(GATK)最优策略进行质量控制、过滤和校正,具体参照GATK官网,GermlineSNPs+Indel的Best Practices步骤(https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145),步骤如下:The variants found in whole-genome sequencing will be filtered in multiple steps for quality control. Use the Genome Analysis Toolkit (GATK) optimal strategy for quality control, filtering and calibration, refer to the GATK official website, GermlineSNPs+Indel's Best Practices steps (https://software.broadinstitute.org/gatk/best-practices/workflow? id=11145), the steps are as follows:
i)对原始的fastq文件利用fastp软件进行数据质控过滤;i) Use fastp software to perform data quality control and filtering on the original fastq files;
ii)利用BWA-MEM算法将获得的序列比对到参考基因组上。在本实施例中,使用的是hg19的参考基因组;ii) Align the obtained sequence to the reference genome using the BWA-MEM algorithm. In this embodiment, the reference genome of hg19 is used;
iii)利用Picard的SortSam命令和Samtools软件对经比对后的文件进行排序和索引,最后获得BAM文件;iii) Use Picard's SortSam command and Samtools software to sort and index the compared files, and finally obtain the BAM file;
iv)利用Picard的MarkDuplicates命令进行去重;iv) Use Picard's MarkDuplicates command to remove duplicates;
v)利用GATK的BaseRecalibrator和ApplyBQSR命令获取Base Quality Score Recalibration(BQSR)进行碱基质量重校正;v) Use the BaseRecalibrator and ApplyBQSR commands of GATK to obtain Base Quality Score Recalibration (BQSR) for base quality recalibration;
vi)用GATK的Haplotypecaller方法进行单个样本的基因变异检测;vi) Use GATK's Haplotypecaller method to detect genetic mutations in a single sample;
vii)用GATK的CombineGVCF和GenotypeGVCFs方法进行多样本联合基因变异检测;vii) Use GATK's CombineGVCF and GenotypeGVCFs methods for multi-sample joint gene mutation detection;
viii)利用VariantRecalibrator和ApplyVQSR方法获取Variant Quality Score Recalibration(VQSR)进行变异质量重校正。viii) Use VariantRecalibrator and ApplyVQSR methods to obtain Variant Quality Score Recalibration (VQSR) for recalibration of variant quality.
其中,所述对位点的质控过滤原则是:选取①VQSR为“PASS”的点;②父母亲的位点测序深度DP>20且基因型信息不为“./.”的点;和③胚胎位点测序深度DP>5且基因型信息不为“./.”的点。所述“./.”为无法进行基因分型的位点。Wherein, the quality control filtering principle of the site is: ① the point where the VQSR is "PASS"; ② the point where the sequencing depth of the parent's site is DP>20 and the genotype information is not "./."; and ③ Embryo site sequencing depth DP>5 and genotype information is not the point "./.". The "./." is a site where genotyping cannot be performed.
6)利用孟德尔遗传规律识别错误的基因型位点,如ADO等,并将其标为缺失数据。同实施例1。6) Use Mendelian genetic law to identify the wrong genotype locus, such as ADO, etc., and mark it as missing data. The same as in Example 1.
7)利用胚胎单体型相互推导及染色体干涉理论,进一步识别囊胚培养液中基因型错误的位点,并将其标为缺失数据。同实施例1。7) Using the mutual derivation of embryo haplotypes and chromosomal interference theory, further identify the sites of genotype errors in the blastocyst culture medium and mark them as missing data. The same as in Example 1.
最后通过质控的位点数见表5,其中获得的父母亲都有基因型信息且可用于连锁分析的位点为1608593个位点。Finally, the number of loci that passed the quality control is shown in Table 5. The obtained parents have genotype information and the loci that can be used for linkage analysis are 1608593 loci.
表5 全基因组测序检测到的总基因变异数Table 5 The total number of gene variants detected by whole-genome sequencing
Figure PCTCN2020121432-appb-000017
Figure PCTCN2020121432-appb-000017
8)基于系谱的单体型定相及基因型填充:8) Haplotype phasing and genotype filling based on genealogy:
同上述实施例1。Same as the above embodiment 1.
最后获得的胚胎多态位点信息见表6。See Table 6 for the information of polymorphic sites of embryos finally obtained.
表6Table 6
Figure PCTCN2020121432-appb-000018
Figure PCTCN2020121432-appb-000018
将该实验结果与配对活检样本(具体实验步骤同实施例1)进行比较后发现,除有一个样本(来自囊胚培养液6的样本)可能由于步骤2)的扩增后DNA浓度太小(该样本DNA浓度为166ng/μl,其它样本都在200ng/μl以上)导致等位基因预测准确率略低之外,其它胚胎的基因组重构准确率基本要高于97%(表7)。同实施例1的芯片数据相比,由于全基因组测序会包含更多低等位基因频率的位点,因此基因型填充准确率要相对低一些。Comparing the experimental results with the paired biopsy samples (the specific experimental procedures are the same as in Example 1), it is found that except for one sample (the sample from the blastocyst culture fluid 6) may be due to the too small DNA concentration after amplification in step 2) ( The DNA concentration of this sample was 166ng/μl, and other samples were above 200ng/μl), which resulted in slightly lower accuracy of allele prediction, and the accuracy of genome reconstruction of other embryos was basically higher than 97% (Table 7). Compared with the chip data of Example 1, since whole genome sequencing will contain more low allele frequency sites, the accuracy of genotype filling is relatively low.
表7 二代测序平台胚胎基因组重构准确率Table 7 The accuracy of embryo genome reconstruction on the second-generation sequencing platform
Figure PCTCN2020121432-appb-000019
Figure PCTCN2020121432-appb-000019
参考文献references
1.Abecasis GR,Cherny SS,Cookson WO,&Cardon LR(2002)Merlin—rapid analysis of dense genetic maps using sparse gene flow trees.Nat Genet 30(1):97-101.1.Abecasis GR, Cherny SS, Cookson WO, & Cardon LR (2002) Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30(1): 97-101.
2.Scott LJ,Mohlke KL,Bonnycastle LL,Willer CJ,Li Y,Duren WL,Erdos MR,Stringham HM,Chines PS,Jackson AU等人(2007)A genome-wide association study of Type 2 Diabetes in Finns detects multiple susceptibility variants.Science 316(5829):1341-1345.2. Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR, Stringham HM, Chines PS, Jackson AU et al. (2007) A genome-wide association study of Type 2 Diabetes in multiple detections susceptibility variants.Science 316(5829):1341-1345.
本发明提及的所有文献都在本申请中引用作为参考,就如同每一篇文献被单独引用作为参考那样。此外应理解,在阅读了本发明的上述讲授内容之后,本领域技术人员可以对本发明作各种改动或修改,这些等价形式同样落于本申请所附权利要求书所限定的范围。All documents mentioned in the present invention are cited as references in this application, as if each document was individually cited as a reference. In addition, it should be understood that after reading the above teaching content of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of the present application.

Claims (16)

  1. 清除来自子代的噪音遗传数据的方法,所述方法包括步骤:A method for removing noisy genetic data from offspring, the method comprising the steps:
    (a)提供来自子代的基因组序列信息,其中所述子代的基因组序列信息获自包含约0.1pg-40ng DNA,例如,1-40ng DNA、20-40ng DNA、0.1-40pg DNA、1-40pg DNA、10-40pg DNA的子代微量核酸样本;例如,所述子代微量核酸样本是胚胎培养液、囊胚培养液、囊胚腔液、母体血浆或母体其他类型体液中胎儿的无细胞DNA、和/或囊胚滋养层细胞、卵裂期胚胎细胞、母体血液或母体其他类型体液中的胎儿细胞;(a) Provide genomic sequence information from the offspring, where the genomic sequence information of the offspring is obtained from about 0.1pg-40ng DNA, for example, 1-40ng DNA, 20-40ng DNA, 0.1-40pg DNA, 1- 40pg DNA, 10-40pg DNA trace nucleic acid sample of the offspring; for example, the trace nucleic acid sample of the offspring is embryo culture fluid, blastocyst culture fluid, blastocyst cavity fluid, maternal plasma or other types of maternal body fluids of the fetus without cells DNA, and/or fetal cells in blastocyst trophoblast cells, cleavage embryo cells, maternal blood or other types of maternal body fluids;
    (b)对步骤(a)的子代的基因组序列信息进行质控和过滤,其中所述质控包括选自实施微量核酸全基因组扩增效率的质控、识别孟德尔遗传错误的质控、识别染色体干涉抑制的质控、多个子代单体型相互推导的质控和它们的组合。(b) Quality control and filtering of the genomic sequence information of the progeny of step (a), wherein the quality control includes a quality control selected from the group consisting of implementing a whole genome amplification efficiency of trace nucleic acid, a quality control identifying Mendelian genetic errors, Identify the quality control of chromosomal interference inhibition, the quality control of multiple progeny haplotypes and their combination.
  2. 根据权利要求1所述的清除来自子代的噪音遗传数据的方法,其中步骤(a)提供的来自子代的基因组序列信息对其基因组的覆盖度不超过约30%,例如对其基因组的覆盖度为约30%、25%、20%、15%,例如,其中步骤(a)的来自子代的基因组序列信息是通过对所述子代微量核酸样本实施选自下组的全基因组扩增:扩增前引物延伸PCR、退变寡核苷酸引物PCR、多重置换扩增技术、多次退火环状循环(MALBAC)扩增技术、平末端或黏性末端连接建库等方法、或它们的组合,优选地为MALBAC扩增技术,然后通过选自核酸芯片、扩增和/或测序的技术,检测子代的基因组序列而获得的序列信息。The method for removing noisy genetic data from progeny according to claim 1, wherein the genome sequence information from the progeny provided in step (a) does not exceed about 30% of its genome, for example, the coverage of its genome The degree is about 30%, 25%, 20%, 15%, for example, the genome sequence information from the progeny in step (a) is performed by performing whole-genome amplification selected from the following group on the progeny trace nucleic acid sample : Pre-amplification primer extension PCR, degenerated oligonucleotide primer PCR, multiple displacement amplification technology, multiple annealing loop cycle (MALBAC) amplification technology, blunt-end or sticky-end connection library building methods, or these The sequence information obtained by detecting the genomic sequence of the progeny through a technology selected from nucleic acid chip, amplification and/or sequencing, is preferably the MALBAC amplification technology.
  3. 根据权利要求2所述的清除来自子代的噪音遗传数据的方法,其中所述的核酸芯片、扩增和/或测序的技术是单苷酸多态性位点微阵列核酸芯片、MassARRAY飞行质谱芯片、MLPA多重连接扩增技术、二代测序、三代测序、或它们的组合;例如,所述单苷酸多态性位点微阵列核酸芯片是SNP基因分型芯片;例如,所述二代测序包括全基因组测序、全外显子组测序和靶向基因组区域的测序,优选地为全基因组测序,例如,低深度的全基因组测序,例如测序深度低至2x甚至1x以下。The method for removing noisy genetic data from progeny according to claim 2, wherein the nucleic acid chip, amplification and/or sequencing technology is mononucleotide polymorphism site microarray nucleic acid chip, MassARRAY flight mass spectrometry Chip, MLPA multiple connection amplification technology, second-generation sequencing, third-generation sequencing, or a combination thereof; for example, the single-nucleotide polymorphism site microarray nucleic acid chip is a SNP genotyping chip; for example, the second-generation Sequencing includes whole-genome sequencing, whole-exome sequencing, and sequencing of targeted genomic regions, preferably whole-genome sequencing, for example, low-depth whole-genome sequencing, for example, the sequencing depth is as low as 2x or even 1x.
  4. 根据权利要求1-3中任一项所述的清除来自子代的噪音遗传数据的方法,其中步骤(b)所述的微量核酸全基因组扩增效率的质控是如下实施的:利用多个微量核酸样本的全基因组扩增产物的参考测序数据来识别扩增效率低的位点基因型,并将该位点基因型标为缺失数据,例如,将多个微量核酸样本的全基因组扩增产物作为参考样本进行低深度测序,例如测序深 度不高于约0.5X,不高于约0.4X,不高于约0.3X,不高于约0.2X,不高于约0.1X,例如测序深度为约0.06X,将自所述参考样本获得的测序数据比对到人类参考基因组(例如,hg19或hg38)上,使用如下公式计算位点扩增效率The method for removing noisy genetic data from offspring according to any one of claims 1 to 3, wherein the quality control of the whole genome amplification efficiency of trace nucleic acid in step (b) is implemented as follows: The reference sequencing data of the whole genome amplification product of the trace nucleic acid sample is used to identify the genotype of the locus with low amplification efficiency, and the genotype of the locus is marked as missing data, for example, the whole genome amplification of multiple trace nucleic acid samples The product is used as a reference sample for low-depth sequencing, for example, the sequencing depth is not higher than about 0.5X, not higher than about 0.4X, not higher than about 0.3X, not higher than about 0.2X, not higher than about 0.1X, for example, the sequencing depth Is about 0.06X, compare the sequencing data obtained from the reference sample to the human reference genome (for example, hg19 or hg38), and use the following formula to calculate the site amplification efficiency
    Figure PCTCN2020121432-appb-100001
    其中,
    Figure PCTCN2020121432-appb-100002
    Figure PCTCN2020121432-appb-100001
    among them,
    Figure PCTCN2020121432-appb-100002
    其中DP i表示第i个位点的绝对深度,N表示测序read次数,L表示read长度, Where DP i represents the absolute depth of the i-th site, N represents the number of sequencing reads, and L represents the read length,
    当DP i≥基因组平均深度时,位点扩增效率≥1,则表示该位点通过微量核酸全基因组扩增效率的质控;将未符合这一质控的位点基因型标记为缺失数据。 When DP i ≥ the average depth of the genome, and the site amplification efficiency ≥ 1, it means that the site has passed the quality control of the whole genome amplification efficiency of trace nucleic acid; the genotype of the site that does not meet this quality control is marked as missing data .
  5. 根据权利要求1-4中任一项所述的清除来自子代的噪音遗传数据的方法,其中步骤(b)所述的染色体干涉抑制理论是当一段遗传距离内两个分子标记位点出现两次交换或重组,则判定这一重组区段内的分子标记发生基因型分型错误,并将所述分子标记位点标记为缺失数据,例如,其中所述一段遗传距离是1个厘摩(cM)以下的任一距离。The method for removing noisy genetic data from offspring according to any one of claims 1 to 4, wherein the chromosomal interference suppression theory of step (b) is that when two molecular marker sites appear within a genetic distance For the second exchange or recombination, it is determined that the molecular marker in this recombination segment has a genotyping error, and the molecular marker site is marked as missing data, for example, where the genetic distance is 1 centimeter ( cM) Any distance below.
  6. 子代的单体型定相的方法,所述方法包括权利要求1-3中任一项所述的步骤(a)和步骤(b)、以及如下步骤:A method for phasing the haplotypes of the progeny, the method comprising steps (a) and (b) according to any one of claims 1 to 3, and the following steps:
    (c)对质控和过滤后的子代基因组序列信息(例如,子代的基因型信息)基于系谱信息及孟德尔遗传规律和基因连锁与交换理论的多位点连锁分析策略来定相子代的单体型,例如染色体水平的子代单体型,其中所述系谱信息为所述子代的遗传学父亲的基因组序列信息(例如,遗传学父亲的基因型信息)和/或所述子代的遗传学母亲的基因组序列信息(例如,遗传学母亲的基因型信息),任选地,所述系谱信息还包括所述子代的其他家系个体的基因组序列信息(例如,基因型信息)。(c) The quality control and filtered progeny genome sequence information (for example, the genotype information of the progeny) is based on the pedigree information, Mendelian inheritance laws and the multi-locus linkage analysis strategy of gene linkage and exchange theory to phase the progeny The haplotype of the progeny, such as the haplotype of the progeny at the chromosome level, wherein the pedigree information is the genome sequence information of the genetic father of the progeny (for example, the genotype information of the genetic father) and/or the Genome sequence information of the genetic mother of the offspring (for example, the genotype information of the genetic mother), optionally, the pedigree information also includes the genome sequence information of other pedigree individuals of the offspring (for example, genotype information) ).
  7. 根据权利要求6所述的子代的单体型定相的方法,其中所述系谱信息获自包含至少约100ng DNA(例如100ng-1000ng DNA)的所述家系个体的核酸样本;例如,所述家系个体核酸样本是来自所述家系个体的血液、唾液、口腔拭子、尿液、指甲、毛囊、皮屑、细胞、组织、体液的核酸样本,并且所述系谱信息对所述家系个体的全基因组的覆盖度不少于约90%,例如对其全基因组的覆盖度为约90%、95%、98%、99%或以上,例如,其中所述系谱信息是通过对所述家系个体的基因组DNA(例如全血gDNA、口腔上皮细胞gDNA、尿路上皮细胞gDNA、甲床gDNA、毛囊gDNA和皮屑gDNA、优选地全血gDNA)进行全基因组测序 获得的数据,优选地,对所述gDNA采用高深度的全基因组测序策略,例如测序深度为至少20X、至少30X、至少40X、至少50X、至少60X、至少70X、至少80X。The method for phasing the haplotype of progeny according to claim 6, wherein the genealogical information is obtained from a nucleic acid sample of the pedigree individual containing at least about 100 ng DNA (for example, 100 ng-1000 ng DNA); for example, the Nucleic acid samples of family individuals are nucleic acid samples from blood, saliva, buccal swabs, urine, nails, hair follicles, dander, cells, tissues, and body fluids from the family individuals, and the genealogical information is a complete reference to the family individuals. The coverage of the genome is not less than about 90%, for example, the coverage of the whole genome is about 90%, 95%, 98%, 99% or more. Genomic DNA (such as whole blood gDNA, oral epithelial cell gDNA, urothelial cell gDNA, nail bed gDNA, hair follicle gDNA and dander gDNA, preferably whole blood gDNA) is the data obtained by whole-genome sequencing, preferably, the gDNA adopts a high-depth whole-genome sequencing strategy, for example, the sequencing depth is at least 20X, at least 30X, at least 40X, at least 50X, at least 60X, at least 70X, at least 80X.
  8. 根据权利要求6或7所述的子代的单体型定相的方法,其中步骤(c)是利用统计遗传学与计算生物学算法实施的,例如,基于所述系谱信息使用选自似然法策略(求最大概率的单体型组成)、遗传规则策略(求最小重组数的单体型组成)、最大期望(Expectation Maximisation,EM)算法和它们的组合,获得子代最大可能的单体型组成,The method for phasing haplotypes of offspring according to claim 6 or 7, wherein step (c) is implemented using statistical genetics and computational biology algorithms, for example, based on the pedigree information using likelihood Method strategy (seeking the haplotype composition with the maximum probability), genetic rule strategy (seeking the haplotype composition with the minimum recombination number), the Expectation Maximisation (EM) algorithm and their combination to obtain the largest possible monomer of the offspring Type composition,
    例如,基于所述子代的遗传学父亲和/或母亲的基因组序列信息,使用选自似然法策略(求最大概率的单体型组成)、遗传规则策略(求最小重组数的单体型组成)、最大期望(Expectation Maximisation,EM)算法和它们的组合,获得子代遗传自父亲和/或母亲的最大可能的单体型组成,优选地,进一步使用其他家系个体的基因组序列信息实施所述算法,获得子代更多的单体型组成,For example, based on the genetic sequence information of the father and/or mother of the offspring, a strategy selected from the likelihood method (seeking the haplotype composition with the greatest probability) and the genetic rule strategy (seeking the haplotype with the smallest number of recombinations) is used. Composition), Expectation Maximisation (EM) algorithms and their combinations, to obtain the largest possible haplotype composition inherited from the father and/or mother of the offspring. Preferably, the genome sequence information of other family individuals is further used to implement the Using the algorithm to obtain more haplotype composition of offspring,
    优选地,所述的似然法策略是Lander-Green算法和Viterbi动态规划算法、Elston-Stewart算法和贝叶斯网络算法,更优选地是Lander-Green算法和Viterbi动态规划算法,Preferably, the likelihood method strategy is Lander-Green algorithm and Viterbi dynamic programming algorithm, Elston-Stewart algorithm and Bayesian network algorithm, more preferably Lander-Green algorithm and Viterbi dynamic programming algorithm,
    优选地,所述遗传规则策略包括零重组假说策略和最小重组假说策略,实施所述遗传规则策略的软件载体是例如ZAPLO或HAPLORE,Preferably, the genetic rule strategy includes a zero recombination hypothesis strategy and a minimum recombination hypothesis strategy, and the software carrier implementing the genetic rule strategy is, for example, ZAPLO or HAPLORE,
  9. 根据权利要求6-8中任一项所述的子代的单体型定相的方法,其中步骤(c)是如下实施的:The method for haplotype phasing of progeny according to any one of claims 6-8, wherein step (c) is implemented as follows:
    i)基于系谱结构构建每个位点的二进制基因流向量V i,V i=(P 1,i,M 1,i,P 2,i,M 2,i,...,P n,i,M n,i),这里,i表示位点;1...n表示胚胎数;P n,i表示第n个胚胎第i个位点遗传自父系祖先的单体型,P n,i=0表示该胚胎该位点遗传了祖父的单体型,P n,i=1表示该胚胎该位点遗传了祖母的单体型;M n,i表示第n个胚胎第i个位点遗传自母系祖先的单体型,M n,i=0表示该胚胎该位点遗传了外祖父的单体型,M n,i=1表示该胚胎该位点遗传了外祖母的单体型; i) Construct a binary gene flow vector V i for each site based on the pedigree structure, V i = (P 1,i ,M 1,i ,P 2,i ,M 2,i ,...,P n,i ,M n,i ), where i denotes the locus; 1...n denotes the number of embryos; P n,i denotes the haplotype inherited from the paternal ancestor at the i locus of the nth embryo, P n,i =0 means that the embryo has inherited the haplotype of the grandfather at this locus, P n,i =1 means that the embryo has inherited the haplotype of the grandmother at this locus; M n,i means the i-th locus of the nth embryo Haplotype inherited from maternal ancestors, M n,i =0 means that the embryo has inherited the haplotype of the maternal grandfather at this locus, M n,i =1 means that the embryo has inherited the haplotype of the maternal grandmother at this locus;
    ii)基于隐马尔科夫模型计算每个位点的单体型隐含状态和基因型观察值的最大联合似然概率,公式如下:ii) Calculate the maximum joint likelihood probability of the haplotype hidden state and genotype observation value of each locus based on the hidden Markov model, the formula is as follows:
    Figure PCTCN2020121432-appb-100003
    Figure PCTCN2020121432-appb-100003
    其中,m表示位点数;P(V 1)是第一个位点父系或母系祖先单体型的初始概率;P(V i|V i-1)是第i-1位点到其相邻的第i个位点单体型状态的转移概率,通过计算两个位点间的重组率获得; Among them, m represents the number of sites; P(V 1 ) is the initial probability of the paternal or maternal ancestor haplotype at the first site; P(V i |V i-1 ) is the i-1th site to its neighbor The transfer probability of the haplotype state at the i-th site is obtained by calculating the recombination rate between the two sites;
    通过千人基因组第三阶段(1000 Genomes Project Phase 3)的遗传图谱来估计位点间的重组率;P(G i|V i)是给定位点祖先单体型状态(V i)后的基因型(G i)概率,通过孟德尔遗传规律计算获得; Estimate the recombination rate between sites through the genetic map of 1000 Genomes Project Phase 3; P(G i |V i ) is the gene after the ancestral haplotype state (V i ) of the site is given Type (G i ) probability, calculated by Mendelian genetic law;
    iii)利用Viterbi算法计算隐马尔科夫状态V=(V 1,V 2,...,V m)m个位点祖先单体型状态最可能的组成,获得每个子代最大可能的染色体水平单体型组成。 iii) Using the Viterbi algorithm to calculate the hidden Markov state V=(V 1 ,V 2 ,...,V m )m sites the most likely composition of the haplotype state of the ancestors, and obtain the maximum possible chromosome level for each offspring Haplotype composition.
  10. 重构子代基因组的方法,所述方法包括权利要求6-9中任一项所述的步骤(a)、步骤(b)和步骤(c)、以及步骤A method for reconstructing progeny genomes, the method comprising steps (a), step (b), and step (c), and steps according to any one of claims 6-9
    (d)进行子代缺失基因型的填充,(d) Fill in the missing genotypes of the offspring,
    例如,结合血源同一区域识别,也即确定的某一区域胚胎来自父母的单体型组成情况,同时结合父母高密度多态位点基因型信息,填充子代中缺失的基因型位点信息;For example, combining the identification of the same area of blood source, that is, determining the haplotype composition of the embryo in a certain area from the parent, and combining the genotype information of the parent's high-density polymorphic locus to fill in the missing genotype locus information in the offspring ;
    以及任选地,对于基于家系信息未被成功填充的基因型信息,利用群体参考单体型信息及群体水平等位基因连锁不平衡(LD)规律填补全基因组水平的基因型信息;And optionally, for the genotype information that has not been successfully populated based on the family information, use the population reference haplotype information and the population-level allelic linkage disequilibrium (LD) law to fill in the genome-wide genotype information;
    优选地,所述群体参考单体型信息是HapMap、1000Genomes、HRC(Haplotype Reference Consortium);Preferably, the population reference haplotype information is HapMap, 1000Genomes, HRC (Haplotype Reference Consortium);
    优选地,所述群体水平等位基因连锁不平衡(LD)规律的基因型填充算法是例如IMPUTE(2)、MaCH、Beagle、Minimac算法;Preferably, the genotype filling algorithm of the population-level allelic linkage disequilibrium (LD) law is, for example, IMPUTE(2), MaCH, Beagle, Minimac algorithm;
    优选地,使用最大化算法(Expectation Maximization,E-M)、隐马尔科夫模型(Hidden Markov Model,HMM)、马尔科夫链蒙特卡洛(Markov chain Monte Carlo,MCMC)、Coalescent理论、或其组合实施基因型填充。Preferably, a maximization algorithm (Expectation Maximization, EM), Hidden Markov Model (Hidden Markov Model, HMM), Markov chain Monte Carlo (MCMC), Coalescent theory, or a combination thereof is used to implement Genotype filling.
  11. 一种装置、系统和设备,其特征在于,包含质控过滤模块,能够执行权利要求1-5中任一项所述的清除来自子代的噪音遗传数据。A device, system and equipment, characterized by comprising a quality control filtering module, capable of performing the removal of noise genetic data from offspring according to any one of claims 1 to 5.
  12. 根据权利要求11所述的装置、系统和设备,其特征在于,还包含单体型定相模块,能够执行权利要求6-9中任一项所述的单体型定相。The device, system, and equipment according to claim 11, further comprising a monolithic phasing module capable of performing the monolithic phasing of any one of claims 6-9.
  13. 根据权利要求11或12所述的装置、系统和设备,其特征在于,还包含基因型填充 模块,能够执行权利要求10中所述的基因型填充。The device, system and equipment according to claim 11 or 12, further comprising a genotype filling module, capable of performing the genotype filling described in claim 10.
  14. 根据权利要求11-13中任一项所述的装置、系统和设备,其特征在于,包含The apparatus, system and equipment according to any one of claims 11-13, characterized in that it comprises
    扩增单元,能够执行对DNA样本的全基因组扩增,例如,能够执行对子代的DNA样本进行全基因组扩增和/或用于对子代的遗传学父母的DNA样本进行全基因组扩增;The amplification unit can perform whole-genome amplification of DNA samples, for example, it can perform whole-genome amplification of DNA samples of offspring and/or used to perform whole-genome amplification of DNA samples of offspring’s genetic parents ;
    原始遗传数据获取单元,能够执行对获得的全基因组扩增产物进行基因组的序列遗传信息的读取,例如,读取核酸芯片或二代测序后的序列信息;The original genetic data acquisition unit can read the sequence genetic information of the genome of the obtained whole-genome amplification product, for example, read the sequence information after the nucleic acid chip or the second-generation sequencing;
    质控过滤单元,能够执行对原始遗传数据的质控和过滤,将质量不符合要求的数据去除,例如,将扩增效率低的位点基因型标记为缺失数据;和能够识别原始遗传数据中的错误的基因型位点,并将其标记为缺失数据;和The quality control filtering unit can perform quality control and filtering of the original genetic data, and remove the data that does not meet the quality requirements, for example, mark the genotype of the site with low amplification efficiency as missing data; and can identify the original genetic data The wrong genotype locus of and mark it as missing data; and
    单体型定相单元,能够执行对单体型的定相。The haplotype phasing unit can perform the phasing of the haplotype.
  15. 根据权利要求14所述的装置、系统和设备,其特征在于,包含基因型填充单元,能够执行基因型的填充。The apparatus, system and equipment according to claim 14, characterized by comprising a genotype filling unit capable of performing genotype filling.
  16. 根据权利要求1-10中任一项所述的方法的用途或者根据权利要求11-15中任一项所述的设备或系统的用途,用于选自植入前的胚胎和/或孕早期的胎儿的多基因疾病遗传风险评级、非整倍体检测、单基因遗传病检测、染色体结构重排检测和它们的组合。The use of the method according to any one of claims 1-10 or the use of the device or system according to any one of claims 11-15 for selected from pre-implantation embryos and/or early pregnancy The genetic risk rating of polygenic diseases, aneuploidy detection, single-gene genetic disease detection, chromosome structure rearrangement detection and their combination of fetus
PCT/CN2020/121432 2019-10-18 2020-10-16 Method and system for clearing noisy genetic data, phasing haplotype, and reconstructing offspring genome, and use thereof WO2021073604A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202080005425.5A CN112840404A (en) 2019-10-18 2020-10-16 Methods, systems, and uses for eliminating noisy genetic data, haplotype phasing, and reconstructing progeny genomes

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910995291 2019-10-18
CN201910995291.5 2019-10-18

Publications (1)

Publication Number Publication Date
WO2021073604A1 true WO2021073604A1 (en) 2021-04-22

Family

ID=75537714

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/121432 WO2021073604A1 (en) 2019-10-18 2020-10-16 Method and system for clearing noisy genetic data, phasing haplotype, and reconstructing offspring genome, and use thereof

Country Status (2)

Country Link
CN (1) CN112840404A (en)
WO (1) WO2021073604A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114606302A (en) * 2022-04-08 2022-06-10 复旦大学附属中山医院 Method for extracting oral mucosa nucleic acid to perform whole genome high-throughput sequencing

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023225951A1 (en) * 2022-05-26 2023-11-30 深圳华大生命科学研究院 Method for detecting fetal genotype on basis of haplotype
CN117230175B (en) * 2023-06-21 2024-05-28 广州序源医学科技有限公司 Embryo preimplantation genetics detection method based on third generation sequencing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335625A (en) * 2015-11-04 2016-02-17 和卓生物科技(上海)有限公司 Genetics detecting device of embryo before implantation
CN107723364A (en) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 A kind of screening method of susceptibility gene of colorectal cancer

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101790731B (en) * 2007-03-16 2013-11-06 纳特拉公司 System and method for cleaning noisy genetic data and determining chromsome copy number

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335625A (en) * 2015-11-04 2016-02-17 和卓生物科技(上海)有限公司 Genetics detecting device of embryo before implantation
CN107723364A (en) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 A kind of screening method of susceptibility gene of colorectal cancer

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
"Current Protocols in Human Genetics", 1 May 2011, JOHN WILEY & SONS, INC., Hoboken, NJ, USA, ISBN: 978-0-471-14290-4, ISSN: 1934-8266, article STEPHEN TURNER, ARMSTRONG LOREN L., BRADFORD YUKI, CARLSON CHRISTOPHER S., CRAWFORD DANA C., CRENSHAW ANDREW T., DE ANDRADE MARIZA: "Quality Control Procedures for Genome-Wide Association Studies", pages: 1.19.1 - 1.19.18, XP055241966, DOI: 10.1002/0471142905.hg0119s68 *
ANDERSON CARL A, PETTERSSON FREDRIK H, CLARKE GERALDINE M, CARDON LON R, MORRIS ANDREW P, ZONDERVAN KRINA T: "Data quality control in genetic case-control association studies", NATURE PROTOCOLS, NATURE PUBLISHING GROUP, ENGLAND, 1 September 2010 (2010-09-01), England, pages 1564 - 1573, XP055801113, Retrieved from the Internet <URL:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025522/pdf/ukmss-33586.pdf> [retrieved on 20210504], DOI: 10.1038/nprot.2010.116 *
ANDRIES T. MAREES, HILDE DE KLUIVER, SVEN STRINGER, FLORENCE VORSPAN, EMMANUEL CURIS, CYNTHIA MARIE-CLAIRE, ESKE M. DERKS: "A tutorial on conducting genome-wide association studies: Quality control and statistical analysis", INTERNATIONAL JOURNAL OF METHODS IN PSYCHIATRIC RESEARCH, vol. 27, no. 2, 1 June 2018 (2018-06-01), pages e1608, XP055677508, ISSN: 1049-8931, DOI: 10.1002/mpr.1608 *
CATHY C. LAURIE, KIMBERLY F. DOHENY, DANIEL B. MIREL, ELIZABETH W. PUGH, LAURA J. BIERUT, TUSHAR BHANGALE, FREDERICK BOEHM, NEIL E: "Quality control and quality assurance in genotypic data for genome-wide association studies", GENETIC EPIDEMIOLOGY, LISS, NEW YORK, NY,, US, vol. 34, no. 6, 1 September 2010 (2010-09-01), US, pages 591 - 602, XP055470592, ISSN: 0741-0395, DOI: 10.1002/gepi.20516 *
ELEONORA PORCU, SERENA SANNA, CHRISTIAN FUCHSBERGER, LARS G FRITSCHE: "Genotype Imputation in Genome-Wide Association Studies.", CURRENT PROTOCOLS IN HUMAN GENETICS, no. supplement 78, 31 July 2013 (2013-07-31), XP009527334, DOI: 10.1002/0471142905.hg0125s78 *
HUANG ZHICONG, LIN HUANG, FELLAY JACQUES, KUTALIK ZOLTÁN, HUBAUX JEAN-PIERRE: "SQC: secure quality control for meta-analysis of genome-wide association studies", BIOINFORMATICS, OXFORD UNIVERSITY PRESS , SURREY, GB, vol. 33, no. 15, 1 August 2017 (2017-08-01), GB, pages 2273 - 2280, XP055801107, ISSN: 1367-4803, DOI: 10.1093/bioinformatics/btx193 *
VERMA SHEFALI S., DE ANDRADE MARIZA, TROMP GERARD, KUIVANIEMI HELENA, PUGH ELIZABETH, NAMJOU-KHALES BAHRAM, MUKHERJEE SHUBHABRATA,: "Imputation and quality control steps for combining multiple genome-wide datasets", FRONTIERS IN GENETICS, vol. 5, 11 December 2014 (2014-12-11), XP055801112, DOI: 10.3389/fgene.2014.00370 *
WINKLER THOMAS W, DAY FELIX R, CROTEAU-CHONKA DAMIEN C, WOOD ANDREW R, LOCKE ADAM E, MÄGI REEDIK, FERREIRA TERESA, FALL TOVE, GRAF: "Quality control and conduct of genome-wide association meta-analyses", NATURE PROTOCOLS, NATURE PUBLISHING GROUP, GB, vol. 9, no. 5, 1 May 2014 (2014-05-01), GB, pages 1192 - 1212, XP055801109, ISSN: 1754-2189, DOI: 10.1038/nprot.2014.071 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114606302A (en) * 2022-04-08 2022-06-10 复旦大学附属中山医院 Method for extracting oral mucosa nucleic acid to perform whole genome high-throughput sequencing

Also Published As

Publication number Publication date
CN112840404A (en) 2021-05-25

Similar Documents

Publication Publication Date Title
US20200362415A1 (en) System and method for cleaning noisy genetic data from target individuals using genetic data from genetically related individuals
US10266893B2 (en) System and method for cleaning noisy genetic data and determining chromosome copy number
US20200190591A1 (en) System and method for cleaning noisy genetic data and determining chromosome copy number
US9639657B2 (en) Methods for allele calling and ploidy calling
EP2437191B1 (en) Method and system for detecting chromosomal abnormalities
US10083273B2 (en) System and method for cleaning noisy genetic data and determining chromosome copy number
WO2021073604A1 (en) Method and system for clearing noisy genetic data, phasing haplotype, and reconstructing offspring genome, and use thereof
US20140206552A1 (en) Methods for preimplantation genetic diagnosis by sequencing
US20110092763A1 (en) Methods for Embryo Characterization and Comparison
WO2013052557A2 (en) Methods for preimplantation genetic diagnosis by sequencing
JP7362789B2 (en) Systems, computer programs and methods for determining genetic relationships between sperm donors, oocyte donors and their respective conceptuses
WO2023246949A1 (en) Non-invasive method for determining parentage before birth by using microhaplotypes
US20160371432A1 (en) Methods for allele calling and ploidy calling
JP7446343B2 (en) Systems, computer programs and methods for determining genome ploidy
JP2022537442A (en) Systems, computer program products and methods using density of single nucleotide mutations to verify copy number variation in human embryos
US20240185957A1 (en) Methods for allele calling and ploidy calling
Sanchez-Mazas et al. Genetic variability and epigenetic alterations in Down syndrome

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20877266

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20877266

Country of ref document: EP

Kind code of ref document: A1