CN114402392A

CN114402392A - System and method for validating copy number variation in human embryos using single nucleotide variation density

Info

Publication number: CN114402392A
Application number: CN202080058226.0A
Authority: CN
Inventors: J·伯克; B·里斯; J·D·布拉塞克; M·J·拉奇
Original assignee: CooperSurgical Inc
Current assignee: CooperSurgical Inc
Priority date: 2019-06-21
Filing date: 2020-06-19
Publication date: 2022-04-26
Also published as: EP3987522A1; JP2022537442A; WO2020257605A1; KR20220064951A; CA3143705A1; AU2020297585A1; US20200399701A1

Abstract

A method for validating a genomic variant region in an embryo is disclosed. Embryo sequencing data is received by one or more processors. The received embryo sequencing data is aligned by the one or more processors to a reference genome. The one or more processors identify genomic variant regions in the aligned embryo sequencing data. A number of Single Nucleotide Variations (SNVs) are counted by one or more processors in the identified genomic variation region. Normalizing, by the one or more processors, the number of counts of SNVs in the identified genomic variation region to the SNV baseline count for the reference region of the identified genomic variation region to generate a normalized SNV density for the genomic variation region. Verifying, by the one or more processors, the identified genomic variation region if the normalized SNV density in the identified genomic variation region satisfies a tolerance criterion.

Description

System and method for validating copy number variation in human embryos using single nucleotide variation density

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. provisional patent application 62/865,126 filed on 21.6.2019, the entire contents of which are incorporated herein by reference.

Introduction by referenceThe disclosures of any patents, patent applications, and publications cited herein are hereby incorporated by reference in their entirety.

Technical Field

Embodiments disclosed herein relate generally to systems and methods for identifying Copy Number Variation (CNV) in human embryos. More specifically, there is a need for an optimized system and method to validate CNV calls to human embryos prior to implantation into the mother.

Background

In Vitro Fertilization (IVF) is an assisted reproductive technology that is becoming increasingly popular with older women, couples with difficulty in conception, and as a means of assisting pregnancy. The fertilization process involves extracting an egg, taking a sperm sample, and then manually combining the egg and sperm in a laboratory environment. The embryo is then implanted into the uterus of the host to bring the embryo to term.

IVF procedures are expensive and cause significant emotional/physical harm to the patient, and therefore genetic screening of embryos prior to implantation is becoming more common in patients undergoing IVF procedures. For example, IVF embryos are currently typically screened for genetic abnormalities (e.g., CNV, SNV, etc.) and other conditions that may affect transplant viability (i.e., embryo implantation viability). As with any diagnostic test, the accuracy of the final diagnosis is critical and can be affected by many factors, such as the data acquisition and analysis techniques used. In particular, bioinformatics analysis of low coverage (-0.1X) genomic sequencing data can lead to misidentification of segment and mosaic aneuploidy and Copy Number Variation (CNV) due to sequencing artifacts and noise in the sequencing data.

Therefore, there is a need for systems and methods that are capable of independently verifying genetic abnormalities identified in embryos.

Disclosure of Invention

The present specification describes various exemplary embodiment systems and methods optimized to verify CNV calls made to a human embryo prior to implantation in the mother.

In one aspect, a method for validating a region of genomic variation in an embryo is disclosed. Embryo sequencing data is received by one or more processors. The received embryo sequencing data is aligned (align) by the one or more processors to a reference genome. Identifying, by one or more processors, genomic variant regions in the aligned embryo sequencing data. A plurality of Single Nucleotide Variations (SNVs) in the identified genomic variation region are calculated by one or more processors. Normalizing, by the one or more processors, the number of counts of SNVs in the identified genomic variation region to the SNV baseline count of the reference region corresponding to the identified genomic variation region to generate a normalized SNV density for the genomic variation region. Verifying, by the one or more processors, the identified genomic variation region if the normalized SNV density in the identified genomic variation region satisfies a tolerance criterion.

In another aspect, a system for validating a region of genomic variation in an embryo is disclosed. The system includes a data store, a computing device, and a display. The data storage is used for storing embryo sequencing data. A computing device is communicatively connected to the data store and carries the alignment engine, the genomic variation invoker, and the verification engine.

The alignment engine is configured to receive the embryo sequencing data and align it with a reference genome. The genomic variation invoker is configured to identify a genomic variation region in the aligned embryo sequencing data. The verification engine is configured to: counting Single Nucleotide Variations (SNVs) in the identified genomic variation region and normalizing the counts of SNVs in the identified genomic variation region to a baseline count of SNVs for a reference region corresponding to the identified genomic variation region to generate a normalized SNV density for the identified genomic variation region, and validating the identified genomic variation region if the normalized SNV density in the identified genomic variation region meets a tolerance criterion.

A display is communicatively connected to the computing device and configured to display a report containing the genomic variant region result from the validation engine.

Drawings

For a more complete understanding of the principles disclosed herein and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

figure 1 is a graphical depiction of how overall sequencing coverage normalized density correlation detects true biological changes in copy number (i.e., CNV) better than correlation based on artificial changes in sequencing coverage, according to various embodiments.

Fig. 2 is a graphical depiction of SNV density from clinical embryo samples compared to the average SNV density of 100 normal (no CNV-containing) embryo samples, according to various embodiments.

Fig. 3 is an illustration of how count-based CNV calls are validated using SNV density, in accordance with various embodiments.

Fig. 4 is an exemplary flow diagram showing a method for validating a CNV call made to an embryo, in accordance with various embodiments.

Fig. 5 is a schematic diagram of a system for verifying CNV calls made to embryos, in accordance with various embodiments.

Fig. 6 is a block diagram illustrating a computer system for performing the methods provided herein, in accordance with various embodiments.

It should be understood that the drawings are not necessarily drawn to scale and that the objects in the drawings are not necessarily drawn to scale relative to each other. The accompanying drawings are included to provide a further understanding of the various embodiments of the apparatus, system, and method disclosed herein. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Further, it should be understood that the drawings are not intended to limit the scope of the present teachings in any way.

Detailed Description

However, the present disclosure is not limited to these exemplary embodiments and applications or to the manner in which the exemplary embodiments and applications operate or are described herein.

Moreover, the figures may show simplified or partial views, and the dimensions of elements in the figures may be exaggerated or otherwise not in proportion. Furthermore, when the terms "over," "attached," "connected," "coupled," or the like are used herein, an element (e.g., a material, a layer, a substrate, etc.) can be "over," "attached," "connected," or "coupled" to another element, whether the element is directly over, attached, connected, or coupled to the other element or one or more intervening elements may be present between the element and the other element. Further, when a list of elements (e.g., elements a, b, c) is referred to, such reference is intended to include any one of the listed elements by itself, any combination of less than all of the listed elements, and/or combinations of all of the listed elements. The divisions in the description are for convenience of examination only and do not limit any combination of the elements discussed.

Unless defined otherwise, scientific and technical terms used in connection with the present teachings described herein shall have the meanings that are commonly understood by those of ordinary skill in the art. Furthermore, unless the context requires otherwise, singular terms shall include the plural and plural terms shall include the singular. Generally, the terms and techniques used in connection with cell and tissue culture, molecular biology, and protein and oligonucleotide or polynucleotide chemistry and hybridization described herein are those well known and commonly used in the art. Standard techniques are used, for example, for nucleic acid purification and preparation, chemical analysis, recombinant nucleic acid, and oligonucleotide synthesis. Enzymatic reactions and purification techniques are performed according to the manufacturer's instructions or as commonly done in the art or as described herein. The techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the present specification. See, e.g., Sambrook et al, molecular cloning: a laboratory manual (third edition, cold spring harbor laboratory press, cold spring harbor, new york 2000). The terminology used in conjunction with, and the laboratory procedures and techniques described herein, are well known and commonly used in the art.

DNA (deoxyribonucleic acid) is a nucleotide chain consisting of 4 nucleotides; a (adenine), T (thymine), C (cytosine) and G (guanine), RNA (ribonucleic acid) is composed of 4 nucleotides; A. u (uracil), G and C. Certain nucleotide pairs specifically bind to each other in a complementary manner (referred to as complementary base pairing). That is, adenine (a) pairs with thymine (T) (however, in the case of RNA, adenine (a) pairs with uracil (U)), and cytosine (C) pairs with guanine (G). When a first nucleic acid strand is bound to a second nucleic acid strand consisting of nucleotides complementary to the nucleotides in the first strand, the two strands combine to form a duplex. As used herein, "nucleic acid sequencing data," "nucleic acid sequencing information," "nucleic acid sequence," "genomic sequence," "gene sequence" or "fragment sequence" or "nucleic acid sequencing reads" refer to any information or data indicative of the order of nucleotide bases (e.g., adenine, guanine, cytosine, and thymine/uracil) in a DNA or RNA molecule (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.). It should be understood that the present teachings contemplate sequence information obtained using all of the various technologies, platforms, or techniques available, including but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide recognition systems, pyrosequencing, ion-or pH-based detection systems, electronic signature-based systems, and the like.

"Polynucleotide", "nucleic acid" or "oligonucleotide" refers to a linear polymer of nucleosides (including deoxyribonucleosides, ribonucleosides, or analogs thereof) linked by internucleoside linkages. Typically, a polynucleotide comprises at least three nucleosides. Typically oligonucleotides range in size from a few monomeric units, e.g., 3-4, to hundreds of monomeric units. Whenever a polynucleotide (e.g., an oligonucleotide) is represented by a letter sequence (e.g., "ATGCCTG"), it is understood that the nucleotides are arranged in 5'- >3' order from left to right, and "a" represents deoxyadenosine, "C" represents deoxycytidine, "G" represents deoxyguanosine, and "T" represents thymidine, unless otherwise specified. The letters A, C, G and T can be used to refer to the base itself, a nucleoside, or a nucleotide comprising a base, as is standard in the art.

As used herein, the term "cell" is used interchangeably with the term "biological cell". Non-limiting examples of biological cells include eukaryotic cells, plant cells, animal cells, such as mammalian cells, reptile cells, avian cells, fish cells, and the like, prokaryotic cells, bacterial cells, fungal cells, protozoan cells, and the like, cells isolated from tissues, such as muscle, cartilage, fat, skin, liver, lung, neural tissue, and the like, immune cells, such as T cells, B cells, natural killer cells, macrophages, and the like, embryos (e.g., fertilized eggs), oocytes, ova, sperm cells, hybridomas, cultured cells, cells from cell lines, cancer cells, infected cells, transfected and/or transformed cells, reporter cells, and the like. Mammalian cells can be from, for example, humans, mice, rats, horses, goats, sheep, cows, primates, and the like.

A genome is the genetic material of a cell or organism, including an animal, such as a mammal, such as a human. In humans, the genome comprises total DNA, such as genetic, non-coding and mitochondrial DNA. The human genome typically comprises 23 pairs of linear chromosomes: 22 adding sex-determining X and Y chromosomes to the autosomes. 23 pairs of chromosomes included one copy from each parent. The DNA constituting the chromosome is called chromosomal DNA, and is present in the nucleus of a human cell (nuclear DNA). Mitochondrial DNA is located in mitochondria as a circular chromosome, is inherited only from the mother, and is often referred to as the mitochondrial genome, as compared to the nuclear genome of DNA located in the nucleus.

The phrase "next generation sequencing" (NGS) refers to a sequencing technique with increased throughput compared to traditional sanger and capillary electrophoresis based methods, e.g., with the ability to generate hundreds of thousands of relatively small sequence reads at a time. Some examples of next generation sequencing technologies include, but are not limited to, sequencing-by-synthesis, sequencing-by-ligation, and sequencing-by-hybridization. More specifically, the MISEQ, hisseq, and nextsseq systems of Illumina, and the Personal Genome Machine (PGM) and SOLiD sequencing systems of Life Technologies Corp provide massively parallel sequencing of entire or targeted genomes. The SOLID System and related workflows, protocols, chemistry, etc. are described in more detail in PCT publication No. WO 2006/084132 entitled "Reagents, Methods, and library for load-Based Sequencing" filed on International filing date No. 2006, 1/2006, U.S. patent application Ser. No. 12/873,190 entitled "Low-Volume Sequencing System and Method of Use" filed on 31/2010, 8/31, and U.S. patent application Ser. No. 12/873,132 filed on 31/2010, entitled "Fast-extracting Filter Wheel and Method of Use", the entire contents of each of which are incorporated herein by reference.

The phrase "sequencing run" refers to any step or portion of a sequencing experiment performed to determine some information about at least one biomolecule (e.g., a nucleic acid molecule).

The term "read" in reference to nucleic acid sequencing refers to a nucleotide sequence determined for a nucleic acid fragment, such as NGS, that has been sequenced. The reads can be any sequence of any number of nucleotides defining the length of the read.

The phrases "sequencing coverage" or "sequence coverage" are used interchangeably herein and generally refer to the relationship between sequence reads and a reference, such as the entire genome of a cell or organism, a locus in the genome, or the location of a nucleotide in the genome. Coverage can be described in a variety of forms (see, e.g., Sims et al (2014) Nature reviews genetics 15: 121-132). For example, coverage can refer to how much of the genome is sequenced at the base pair level, and can be calculated as NL/G, where N is the number of reads, L is the average read length, G is the length, or the number of bases of the genome (reference). For example, if the reference genome is 1000Mbp and 1 billion reads with an average length of 100bp are sequenced, the redundancy covered will be 10X. Such coverage may be expressed as a "multiple," e.g., 1X, 2X, 3X, etc. (or 1, 2, 3, etc. fold coverage). Coverage can also refer to redundancy of sequencing relative to a reference nucleic acid to describe the frequency with which a reference sequence is covered by reads, e.g., the number of times a single base of any given locus is read during sequencing. Thus, there may be some bases that are uncovered and have a depth of 0 and some any bases that are covered and have a depth between 1 and 50. Overlay redundancy provides an indication of the reliability of the sequence data, also referred to as overlay depth. The redundancy of coverage can be described for "raw" reads that have not been aligned to reference or aligned (e.g., mapping) reads. Coverage can also be considered in terms of the percentage of reference (e.g., genome) that reads cover. For example, if the reference genome is 10Mbp, and the sequence reads map to 8Mbp of the reference, the percentage coverage will be 80%. Sequence coverage can also be described in terms of coverage, which refers to the percentage of reference bases sequenced a given number of times at a particular depth.

As used herein, the phrase "low coverage" with respect to nucleic acid sequencing refers to a sequencing coverage of less than about 10X, or about 0.001X to about 10X, or about 0.002X to about 0.2X, or about 0.01X to about 0.05X.

As used herein, the phrase "low depth" with respect to nucleic acid sequencing refers to a sequencing depth of less than about 10X, or about 0.1X to about 10X, or about 0.2X to about 5X, or about 0.5X to about 2X.

The term "resolution" with respect to a genomic sequence nucleic acid sequence refers to the quality or accuracy and scope of a genomic nucleic acid sequence (e.g., the sequence of the entire genome or a particular region or locus of the genome) obtained by nucleic acid sequencing of a cell (e.g., an embryo or an organism). The resolution of a genomic nucleic acid sequence is largely determined by the depth and breadth of the sequencing process coverage and involves consideration of the number of unique bases read during the sequencing process and the number of times any one base is read during the sequencing process. The phrases "low resolution sequence" or "low resolution sequence data" or "sparse sequence data" are used interchangeably herein with respect to a genomic nucleic acid sequence of a cell (e.g., an embryo or an organism) and refer to nucleotide base sequence information for genomic nucleic acid obtained by a low coverage, low breadth sequencing method.

As used herein, the phrase "genomic signature" may refer to a genomic region (e.g., a gene, protein coding sequence, mRNA, tRNA, rRNA, repeat sequence, inverted repeat sequence, miRNA, siRNA, etc.) or genetic/genomic variation (e.g., single nucleotide polymorphism/variation, insertion/deletion sequence, Copy Number Variation (CNV), inversion, etc.) with some annotated function, which means that a single or group of genes (in DNA or RNA) has undergone a change due to mutation, recombination/crossover, or genetic drift, to a particular species or to a subpopulation within a particular species.

Genomic variations can be identified using a variety of techniques, including but not limited to: array-based methods (e.g., DNA microarrays, etc.), real-time/digital/quantitative PCR instrument methods, and complete or targeted nucleic acid sequencing systems (e.g., NGS systems, capillary electrophoresis systems, etc.). By nucleic acid sequencing, coverage data can be obtained at single base resolution.

The phrase "mosaic embryo" refers to an embryo comprising two or more cytogenetically distinct cell lines. For example, a mosaic embryo may contain a cell line with different types of aneuploidy or a mixture of euploid and genetically abnormal cells that contain DNA with genetic variations that may be detrimental to the embryo's viability during pregnancy.

The phrase "SNV density" of a locus, wherein a locus refers to a dynamic region of interest within a chromosome, refers to a value derived from the number of SNVs identified within a locus divided by the total number of sequence counts identified in the same locus of a sample.

Nucleic acid sequence data generation

Some embodiments of the methods and systems provided herein for analyzing genomic nucleic acids and classifying genomic features include analyzing nucleotide sequences of a genome of a cell and/or organism. Nucleic acid sequence data can be obtained using a variety of methods described herein and/or known in the art. In one example, the genomic nucleic acid sequence of a cell (e.g., an embryonic cell) can be obtained from Next Generation Sequencing (NGS) of a DNA sample extracted from the cell. NGS, also known as second Generation Sequencing, is based on high-throughput, massively parallel Sequencing technologies, involving parallel Sequencing of millions of nucleotides resulting from nucleic acid amplification of a DNA sample (e.g., extracted from An embryo) (see, e.g., Kulski (2016) "Next-Generation Sequencing-An Overview of the History, Tools and 'Omic' Applications," documented in Next Generation Sequencing-progress, Applications and challenges, edited by j.

Nucleic acid samples that require sequencing by NGS can be obtained in a variety of ways, depending on the source of the sample. For example, human nucleic acids can be readily obtained via a buccal swab to collect cells from which the nucleic acids are then extracted. To obtain the optimal amount of DNA from an embryo for sequencing (e.g., for pre-implantation genetic screening), cells (e.g., 5-7 cells) are typically collected by trophectoderm biopsy at the blastocyst stage. Prior to sequencing via NGS, DNA samples require processing including, for example, fragmentation, amplification, and adaptor ligation. Manipulation of nucleic acids in such processes can introduce artifacts in the amplified sequence (e.g., GC bias associated with Polymerase Chain Reaction (PCR) amplification) and limit the size of sequence reads. Thus, NGS methods and systems are associated with error rates that may vary from system to system.

In addition, software used in conjunction with identifying bases in sequence reads (e.g., base calls) can affect the accuracy of sequence data from NGS sequencing. Such artifacts and limitations can make it difficult to sequence and locate long repetitive regions of the genome and to identify polymorphic alleles and aneuploidies in the genome. For example, because about 40% of the human genome consists of repetitive DNA elements, a short single read of the same sequence aligned with a repetitive element in a reference genome often cannot be accurately located to a particular region of the genome. One way to address, and possibly reduce, some of the effects of errors and/or imperfections in sequence determination is to increase sequencing coverage or depth. However, the increase in sequencing coverage is associated with an increase in sequencing time and cost. Paired-end sequencing can also be used, which improves the accuracy of sequence read placement when sequences are mapped to a genome or reference set, e.g., in long repeat regions, and improves the resolution of structural rearrangements (e.g., gene deletions, insertions, and inversions). For example, in some embodiments of the methods provided herein, the use of data obtained from double-ended NGS of nucleic acids from embryos increases read localization by an average of 15%. Paired-end sequencing methods are known in the art and/or described herein, and involve determining the sequence of a nucleic acid fragment in two directions (i.e., reading once from one end of the fragment and reading a second time from the other end of the fragment). Paired-end sequencing also effectively increases sequencing coverage redundancy, particularly coverage of difficult genomic regions, by doubling the number of reads.

Analysis of nucleic acid sequences

In some embodiments of the methods and systems provided herein for analyzing genomic nucleic acids and classifying genomic features, nucleic acid sequences obtained from a cell (e.g., an embryonic cell or organism) are used to reconstruct the genome (or portion thereof) of the cell/organism using a genomic localization approach. In general, genomic mapping involves matching a sequence to a reference genome (e.g., the human genome) in a process called alignment. Examples of human reference genomes that may be used for the localization process include the release of genome reference consortiums, such as GRCh37(hg19) released in 2009 and GRCh38(hg38) released in 2013 (see, e.g., https:// genome. ucsc. edu/cgi-bin/hgGatewaydb:. hg19https:// www.ncbi.nlm.nih.gov/assembly/GCF _ 000001405.39). By alignment, sequence reads are assigned to genomic loci, typically using a computer program, for sequence matching. Many alignment programs are publicly available, including Bowtie (see, e.g., http:// Bowtie-bio. sourceform. net/manual. shtml) and BWA (see, e.g., http:// bio-bw. sourceform. net /). Sequences that have been processed (e.g., to remove PCR repeats and low quality sequences) and matched to a locus are often referred to as aligned sequences or alignment reads.

When sequence reads are mapped to genomic references, Sequence Nucleotide Variations (SNVs) or Single Nucleotide Polymorphisms (SNPs) can be identified. It should also be noted that the terms SNV and SNP are used according to various embodiments. Although these two terms may be distinguishable to one of ordinary skill in the art, these terms may be used interchangeably according to various embodiments herein. Thus, use of one term should include both terms as it applies to the process of analyzing the received sequencing data. Single nucleotide variations/polymorphisms are the result of variations at a single nucleotide position in the genome. Several different NGS analysis procedures for SNV detection are publicly available, known in the art, and/or described herein. The method utilizes BCFTOOLS (open source) to digest the aligned sequencing data and generate SNV/genotype calls for downstream processes. The detection and identification of genomic features, such as chromosomal abnormalities, e.g., aneuploidy, CNV, by genomic localization of sequences from a cell or organism sample nucleic acid presents particular challenges, particularly when sequence data is obtained from low coverage and low depth sequencing methods because the entire genome is not interrogated, and the interrogated locations in the genome are particularly susceptible to bias and error, because methods for generating sequencing data include, but are not limited to: whole genome amplification, library preparation and selection of next generation sequencing systems and methods. Computer programs and systems are known in the art and/or described herein for increasing the ease and/or accuracy of sequence data interpretation when identifying certain genomic features. Systems and methods for automatically detecting chromosomal abnormalities, including segment duplications/deletions, mosaic features, aneuploidies, and some forms of polyploidy, are described, for example, in U.S. patent application publication No. 2020/0111573, which is incorporated herein by reference. Such methods include denoising/normalization (denoising raw sequence reads and normalizing genomic sequence information to correct for locus effects) as well as machine learning and artificial intelligence that interprets (or decodes) site scores as karyotypes. For example, after sequencing is complete, the raw sequence data is demultiplexed (due to a given sample), reads are aligned to a reference genome (e.g., HG19), and the total number of reads in each 100 kilobase pair (base pair) is counted. The data were normalized for GC content and depth and tested against a baseline generated for samples of known results. The statistical deviation from the 2 copy number (euploid if present, if absent) is then reported as aneuploidy. Using this approach, meiotic and mitotic aneuploidies can be distinguished from each other based on CNV metrics. Based on the deviation from normal, a karyotype is generated with the total number of chromosomes present, any aneuploidies present, and the mosaicism level (if applicable) of these aneuploidies.

Artifacts, variations in coverage, and errors that can occur in NGS also present challenges to accurately identify genomic variations using low coverage sequencing data. Therefore, methods are needed that can verify whether genomic variations identified from low coverage sequencing data are actually genuine genomic variations to ensure that they are properly called.

Provided herein are improved, efficient, fast, and cost-effective methods and systems for validating genomic variant calls (particularly CNV calls) made using low-coverage sequencing data.

Validating CNV calls using SNV density

The systems and methods disclosed herein relate to the determination of true biological changes in copy number (i.e., CNV) that are better detected using total sequencing coverage normalized density correlation than correlation based on artificial changes in sequencing coverage. Historically, SNV density data has not previously been used to validate CNV calls at sequencing coverage levels below 15X. In the original form, the variation in SNV density between different loci is typically greater than that due to copy number variation. This drawback is addressed by incorporating a normalization step to eliminate variation in SNV density between different loci, thereby making SNV density useful for validating CNV calls made using low coverage genomic sequencing data. This is a significant improvement over traditional methods (requiring sequencing coverage levels of data of 15X or higher) because the higher the required sequencing coverage level, the higher the cost and time-consuming (low throughput) of the analysis.

As shown in fig. 1, the read circle 102 represents the correlation between the total sequencing coverage normalized density when there is a true biological variation in the embryo (also observed in CNV profiles-see red arrows pointing to CNV profile 104). The normalized CNV bit score (Y-axis) and SNV density score (X-axis) for those individual bits shown by the quasi-linear relationship represented by line 106 are more correlated than in the presence of real biological changes, and their correlation with the SNV density found in circle 108 and the subsequent slope-decreasing trend line 110, than if the signal indicated by the CNV bits were artifact or noise. Thus, the method utilizes these correlation values between CNV bit scores and SNV scores when determining whether changes identified in the CNV method are validated by the methods described in this disclosure.

Fig. 2 is a graphical depiction of SNV density from a clinical embryo sample 204 compared to the average SNV density of 100 normal (CNV-free) embryo samples 202, according to various embodiments.

The normalization operation disclosed herein takes advantage of the fact that: the SNV density in samples without CNV calls follows a consistent pattern that can be used to normalize the SNV density. Thus, as shown in fig. 2, normalization of SNV density may involve dividing the SNV density 204 (from the clinical embryo sample) of the locus by the average SNV density 202 in a normal sample baseline group (i.e., 100 normal female embryos). The normalization function is shown in equation 1.

Equation 1:

D_norm(locus,baseline sample)＝(Sample SNV Density at Locus)/(Average Baseline SNV Density at Locus)

the resulting normalized SNV density can then be used to validate count-based CNV calls.

As shown in fig. 3, potential CNV calls were made using a count-based approach to chromosome 1 (deletion) 302, chromosome 7 (duplication) 304, chromosome 14 (duplication) 306, and chromosome 21 (duplication) 308. These CNV calls are validated against a normalized SNV density map, which includes a preset confidence interval for verifying whether the potential CNV call is in fact authentic. In this case, all four CNV calls were verified as true CNV calls because the graph shows that the SNV density in the chromosomal location of the CNV call falls outside of the preset confidence interval.

Fig. 4 is an exemplary flow diagram illustrating a method for verifying CNV calls made to an embryo in accordance with various embodiments.

In step 402, embryo sequencing data is received by one or more processors. In various embodiments, the embryo may be a human embryo. In various embodiments, the embryo is a non-human embryo.

In step 404, the received embryo sequencing data is aligned to a reference genome by one or more processors. In various embodiments, the reference genome can be a whole genome obtained from a single individual. In various embodiments, the reference genome can be a composite whole genome from multiple individuals. Examples of reference genomes that can be used for the alignment process include, but are not limited to, genomes published from the genome reference consortium, such as GRCh37(hg19) published in 2009 and GRCh38(hg38) published in 2013 (see, e.g., https:// genome. ucsc. edu/cgi-bin/hgGatewaydb ═ hg19https:// www.ncbi.nlm.nih.gov/ensemble/GCF _ 000001405.39).

In step 406, genomic variant regions in the aligned embryo sequencing data are identified by the one or more processors. In various embodiments, the genomic variant region is a CNV region identified using a count-based CNV calling method. In various embodiments, the genomic variation region is an aneuploidy region. In various embodiments, the genomic variant region is a polyploidy region. In various embodiments, the genomic variant region comprises a sequence segment representing the entire chromosome. In various embodiments, the genomic variant region comprises a sequence segment that represents only a portion of a chromosome.

In step 408, the number of SNVs in the region of genomic variation identified by the SNVs is counted by the one or more processors.

In step 410, the count number of SNVs in the identified genomic variant region is normalized against a baseline count of SNVs for a reference region corresponding to the identified genomic variant region to generate, by the one or more processors, a normalized SNV density for the genomic variant region. In various embodiments, the baseline count of SNVs is obtained from sequencing data derived from one or more normal (non-CNV) samples. In various embodiments, the identified variant region and the reference region cover the same corresponding genomic segment (or genomic location). In various embodiments, the identified genomic variant region and the reference region comprise sequence segments that represent an entire chromosome. In various embodiments, the identified genomic variant region and the reference region comprise sequence segments that represent only a portion of a chromosome.

In step 412, the identified genomic variant region is validated by the one or more processors if the normalized SNV density score in the identified genomic variant region satisfies the tolerance criteria. In various embodiments, if the SNV density of the identified genomic variation region is outside of a preset confidence interval of the average SNV density under the NULL hypothesis (NULL hypothesisis), then there is no true copy number variation. In various embodiments, the preset confidence interval is about 90%. In various embodiments, the preset confidence interval is about 95%. In various embodiments, the preset confidence interval is about 96%, about 97%, about 98%, and about 99%.

If the SNV density is higher than the preset upper confidence limit, the verification is repeated, and if the SNV density is lower than the preset lower confidence limit, the verification is deleted. The preset confidence interval is defined according to a normality hypothesis (C ± Z sigma/sqrt (N)), where C is the center or expected value of the average SNV density under the null hypothesis, N is the number of windows overlapping the identified genomic variation region, sigma is the global standard deviation of the normalized SNV density on all autosomes, and Z is the X percentile of the standard normal distribution. The "+" sign indicates the addition of the upper limit of the confidence interval with the value, and the "-" sign indicates the subtraction of the lower limit of the confidence interval with the value.

In various embodiments, the tolerance criterion is an expected SNV density from a reference region of the mosaic embryo.

In various embodiments, identified genomic variant regions (true copy number variants comprising mosaic level percentage m) are validated if their SNV density is above the lower limit (for duplications) or below the upper limit (for deletions) of the preset confidence interval of the mosaic embryo substitution hypothesis. In various embodiments, the preset confidence interval is about 90%. In various embodiments, the preset confidence interval is about 95%. In various embodiments, the preset confidence interval is about 96%, about 97%, about 98%, and about 99%.

The preset confidence interval for the substitution hypothesis is defined according to the normality hypothesis (C ± Z sigma/sqrt (N)), where C is the center or expected value of the average SNV density under the substitution hypothesis, C ═ E (SNV density | m) ═ 1.0 ± 0.5 × m/100, and N is the number of windows overlapping the identified genomic variation region, sigma is the global standard deviation of the normalized SNV density across all autosomes, and Z is the X-th percentile of the standard normal distribution. The "+" sign indicates the addition of the upper limit of the confidence interval with the value, and the "-" sign indicates the subtraction of the lower limit of the confidence interval with the value.

In various embodiments, the identified genomic variant region is validated if it includes a number of SNVs that exceeds a preset number of variances (variance number) of SNVs above or below the baseline count of SNVs for the reference region.

The system 500 includes a genome sequencer 502, a data store 504, a computing device/analysis server 506, and a display 514.

The genome sequence analyzer 502 may be connected via a serial bus (if both are configured)As an integrated instrumentation platform) or communicatively coupled to the data storage unit 504 via a network connection (if both are distributed/split devices). The genomic sequence analyzer 502 may be configured to process and analyze one or more genomic sequence data sets obtained from an embryo sample, including a plurality of fragment sequence reads. In various embodiments, the genomic sequence analyzer 902 can process and analyze the genomic sequence from a next generation sequencing platform and sequencer, for example

Sequencer, MiSeq^TM，NextSeq ^TM500/550 (high output), HiSeq 2500^TM(fast running), HiSeq^TM3000/4000 and NovaSeq.

In various embodiments, the processed and analyzed genome sequence data set may then be stored in the data storage unit 504 for subsequent processing. In various embodiments, one or more sets of raw genomic sequence data may also be stored in data storage unit 504 prior to processing and analysis. Thus, in various embodiments, the data storage unit 504 is configured to store one or more sets of genomic sequence data. In various embodiments, the set of genomic sequence data processed and analyzed may be fed in real-time to computing device/analysis server 506 for further downstream analysis.

In various embodiments, data storage unit 504 is communicatively connected to computing device/analytics server 506. In various embodiments, the data storage unit 904 and the computing device/analytics server 506 may be part of an integrated apparatus. In various embodiments, the data store 504 may be carried by a device different from the computing device/analytics server 506. In various embodiments, the data storage unit 904 and the computing device/analytics server 506 may be part of a distributed network system. In various embodiments, computing device/analytics server 506 may be communicatively connected to data storage unit 504 via a network connection, which may be a "hardwired" physical network connection (e.g., the internet, a LAN, a WAN, a VPN, etc.) or a wireless network connection (e.g., Wi-Fi, WLAN, etc.). In various embodiments, the computing device/analytics server 506 may be a workstation, mainframe computer, distributed computing node ("cloud computing" or part of a distributed network system), personal computer, mobile device, or the like.

In various embodiments, the computing device/analytics server 506 may be configured to carry an alignment engine 508, a genomic variant invoker 510, and a verification engine 512.

Alignment engine 508 can be configured to receive embryo sequencing data and align it with a reference genome. In various embodiments, the reference genome can be a whole genome obtained from a single individual. In various embodiments, the reference genome can be a composite whole genome from multiple individuals. Examples of reference genomes that can be used for the alignment process include, but are not limited to, genomes published from the genome reference consortium, such as GRCh37(hg19) published in 2009 and GRCh38(hg38) published in 2013 (see, e.g., https:// genome. ucsc. edu/cgi-bin/hgGatewaydb:hg19 https:// www.ncbi.nlm.nih.gov/assembiy/GCF _ 000001405.39).

The genomic variation invoker 510 may be configured to identify genomic variation regions in the aligned embryo sequencing data. In various embodiments, the genomic variant region is a CNV region identified using a count-based CNV calling method. In various embodiments, the genomic variation region is an aneuploidy region. In various embodiments, the genomic variant region is a polyploidy region. In various embodiments, the genomic variant region comprises a sequence segment representing the entire chromosome. In various embodiments, the genomic variant region comprises a sequence segment that represents only a portion of a chromosome.

The verification engine 512 can be configured to count a number of Single Nucleotide Variations (SNVs) in the identified genomic variation region and normalize the SNV counts against a baseline count of SNVs for a reference region corresponding to the identified genomic variation region to generate a normalized SNV density for the identified genomic variation region, and verify the identified genomic variation region if the SNV density in the identified genomic variation region meets a tolerance criterion.

In various embodiments, the baseline count of SNVs is obtained from sequencing data derived from one or more normal (non-CNV) samples. In various embodiments, the identified variant region and the reference region cover the same corresponding genomic segment (or genomic location). In various embodiments, the identified genomic variant region and the reference region comprise sequence segments that represent an entire chromosome. In various embodiments, the identified genomic variant region and the reference region comprise sequence segments that represent only a portion of a chromosome.

In various embodiments, if the SNV density of the identified genomic variation region is outside of a preset confidence interval of the average SNV density under the null hypothesis, there is no true copy number variation. In various embodiments, the preset confidence interval is about 90%. In various embodiments, the preset confidence interval is about 95%. In various embodiments, the preset confidence interval is about 96%, about 97%, about 98%, and about 99%.

If the SNV density is larger than the upper limit of the preset confidence level, the verification is repeated, and if the SNV density is lower than the lower limit of the preset confidence level, the verification is deleted. The preset confidence interval is defined according to a normality hypothesis (C ± Z sigma/sqrt (N)), where C is the center or expected value of the average SNV density under the null hypothesis, N is the number of windows overlapping the identified genomic variation region, sigma is the global standard deviation of the normalized SNV density on all autosomes, and Z is the X percentile of the standard normal distribution. The "+" sign indicates the addition of the upper limit of the confidence interval with the value, and the "-" sign indicates the subtraction of the lower limit of the confidence interval with the value.

In various embodiments, the tolerance criteria are derived from the expected SNV density of the reference region of the mosaic embryo.

In various embodiments, the identified genomic variant region is validated if the identified genomic variant region includes a number of SNVs that exceeds a preset number of variances of SNVs above or below the baseline count of SNVs for the reference region.

After verification of the identified genomic variant region has been performed, the results may be displayed as a result or summary on a display or client 514, the display or client 514 being communicatively connected to the computing device/analysis server 506. In various embodiments, the display or client 514 may be a thin client computing device. In various embodiments, the display or client 514 may be a web browser (e.g., INTERNET EXPLORER) with functionality operable to control the operation of the genomic sequence analyzer 502^TM，FIREFOX^TM，SAFARI^TMEtc.) that may be used to control the operation of the genomic sequence analyzer 502, the data store 504, the alignment engine 508, the genomic variation invoker 510, and the validation engine 512.

Results of the experiment

TABLE 1

	True positive	True negative	False positive	False negative
					Total number of	51	338	11	19

As shown in table 1 above, a total of 70 triploid samples and 349 diploid samples with known facts (SNP arrays) were interrogated for the presence of female triploidy by the methods disclosed herein. Results are as described above, where "true positive" is defined as successfully termed a disease state (polyploid), "true negative" is defined as successfully termed a "euploid" state, "false positive" is defined as incorrectly termed a disease state in a full-bodied embryo, and "false negative" is defined as incorrectly termed a euploid in a disease state embryo.

The table clearly shows the high accuracy of the disclosed method in verifying the presence of authentic CNV in embryos.

Computer-implemented system

In various embodiments, the method of verifying CNV in an embryo using SNV density may be implemented via computer software or hardware. That is, as shown in fig. 5, the methods disclosed herein may be implemented on a computing device/analytics server 506 that includes an alignment engine 508, a data store 504, a genomic variant invoker 510, and a validation engine 512. In various embodiments, computing device/analytics server 506 may be communicatively connected to display device 514 via a direct connection or through an internet connection.

It should be appreciated that the various engines depicted in FIG. 5 may be combined or collapsed into a single engine, component, or module as desired for a particular application or system architecture. Further, in various embodiments, the alignment engine 508, data store 504, genomic variant invoker 510, and verification engine 512 can include additional engines or components as required by a particular application or system architecture.

Fig. 6 is a block diagram illustrating a computer system according to various embodiments. In various embodiments of the present teachings, computer system 600 may include a bus 602 or other communication mechanism for communicating information, and a processor 604 coupled with bus 602 for processing information. In various embodiments, computer system 600 may also include a memory, which may be a Random Access Memory (RAM)606 or other dynamic storage device, coupled to bus 602 for determining instructions to be executed by processor 604. The memory may also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 604. In various embodiments, computer system 600 may also include a Read Only Memory (ROM)608 or other static storage device coupled to bus 602 for storing static information and instructions for processor 604. A storage device 610, such as a magnetic disk or optical disk, may be provided and coupled to bus 602 for storing information and instructions.

In various embodiments, computer system 600 may be coupled via bus 602 to a display 612, such as a Cathode Ray Tube (CRT) or Liquid Crystal Display (LCD), for displaying information to a computer user. An input device 614, including alphanumeric and other keys, may be coupled to bus 602 for communicating information and command selections to processor 604. Another type of user input device is cursor control 616, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 604 and for controlling cursor movement on display 612. This input device 614 typically has two degrees of freedom in two axes, a first axis (i.e., x) and a second axis (i.e., y), which allows the device to specify positions in a plane. However, it should be understood that input devices 614 that allow 3-dimensional (x, y, and z) cursor movement are also contemplated herein.

Consistent with certain embodiments of the present teachings, the results may be provided by computer system 600 in response to processor 604 executing one or more sequences of one or more instructions contained in memory 606. Such instructions may be read into memory 606 from another computer-readable medium or computer-readable storage medium, such as storage device 610. Execution of the sequences of instructions contained in memory 606 may cause processor 604 to perform processes described herein. Alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement the teachings. Thus, implementations of the present teachings are not limited to any specific combination of hardware circuitry and software.

The term "computer-readable medium" (e.g., data store, etc.) or "computer-readable storage medium" as used herein refers to any medium that participates in providing instructions to processor 604 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, and transmission media. Examples of non-volatile media may include, but are not limited to, optical, solid-state, magnetic disks, such as storage device 610. Examples of volatile media may include, but are not limited to, dynamic memory, such as memory 606. Examples of transmission media may include, but are not limited to, coaxial cables, copper wire and fiber optics, including the wires that comprise bus 602.

Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read.

In addition to computer readable media, instructions or data may be provided as signals on transmission media included in a communication device or system to provide a sequence of one or more instructions to processor 604 of computer system 600 for execution. For example, the communication device may include a transceiver having signals indicative of instructions and data. The instructions and data are configured to cause one or more processors to implement the functions outlined in the disclosure herein. Representative examples of data communication transmission connections may include, but are not limited to, telephone modem connections, Wide Area Networks (WANs), Local Area Networks (LANs), infrared data connections, NFC connections, and the like.

It should be understood that the flow charts, diagrams, and accompanying disclosed methods described herein may be implemented using computer system 600 as a standalone device or over a distributed network of shared computer processing resources, such as a cloud computing network.

The methods described herein may be implemented in a variety of ways depending on the application. For example, the methods may be implemented in hardware, firmware, software, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.

In various embodiments, the methods of the present teachings may be implemented as firmware and/or as software programs and applications written in conventional programming languages, such as C, C + +, Python, and the like. If implemented as firmware and/or software, the embodiments described herein may be implemented on a non-transitory computer-readable medium in which a program for causing a computer to perform the above-described methods is stored. It should be understood that the various engines described herein may be provided on a computer system, such as computer system 600, whereby processor 604 will perform the analysis and determinations provided by these engines, subject to instructions provided by any one or combination of memory component 606/608/610 and user input provided via input device 614.

While the present teachings are described in conjunction with various embodiments, the present teachings are not intended to be limited to these embodiments. On the contrary, the present teachings encompass various alternatives, modifications, and equivalents, as will be appreciated by those skilled in the art.

In describing various embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described, and one skilled in the art can readily appreciate that the sequence may be varied and still remain within the spirit and scope of the various embodiments.

Claims

1. A method of validating a genomic variant region in an embryo, comprising:

receiving, by one or more processors, embryo sequencing data;

aligning, by one or more processors, the received embryo sequencing data to a reference genome;

identifying, by one or more processors, genomic variant regions in the aligned embryo sequencing data;

calculating, by one or more processors, a plurality of single nucleotide variations in the identified genomic variation region;

normalizing, by one or more processors, the count number of single nucleotide variations in the identified genomic variation region to a baseline count of single nucleotide variations for a reference region corresponding to the identified genomic variation region to generate a normalized density of single nucleotide variations for the genomic variation region; and

verifying, by the one or more processors, the identified genomic variation region if the normalized single nucleotide variation density in the identified genomic variation region satisfies a tolerance criterion.

2. The method of claim 1, wherein the genomic variant region is a copy number variant region.

3. The method of claim 1, wherein the genomic variant region is an aneuploidy region.

4. The method of claim 1, wherein the genomic variant region is a polyploidy region.

5. The method of claim 1, wherein the reference region is the exact length of the identified genomic variation region.

6. The method of claim 1, wherein the reference region is from a euploid sample.

7. The method of claim 1, wherein the tolerance criterion is an expected single nucleotide variation density from a reference region of a euploid embryo.

8. The method of claim 7, wherein the identified genomic variation region is validated if its normalized single nucleotide variation density is greater than or less than a preset confidence interval for the expected single nucleotide variation density for the reference region.

9. The method of claim 8, wherein the lower preset confidence interval is 95%.

10. The method of claim 1, wherein the tolerance criterion is an expected single nucleotide variation density from a reference region of a mosaic embryo.

11. The method of claim 10, wherein the identified genomic variation region is validated if the normalized single nucleotide variation density of the identified genomic variation region is above a preset confidence interval of the expected single nucleotide variation density of the reference region.

12. The method of claim 11, wherein the preset confidence interval is 95%.

13. The method of claim 1, wherein the tolerance criterion is a predetermined number of variances of the single nucleotide variation above or below a baseline count of single nucleotide variations for the reference region.

14. A non-transitory computer-readable medium storing computer instructions for validating a genomic variant region in an embryo, comprising:

receiving, by one or more processors, embryo sequencing data;

15. A system for validating a region of genomic variation in an embryo, comprising:

a data store for storing embryo sequencing data;

a computing device communicatively coupled to the data store, including,

an alignment engine configured to receive embryo sequencing data and align it with a reference genome,

a genomic variation invoker configured to identify genomic variation regions in the aligned embryo sequencing data, and

a verification engine configured to:

counting single nucleotide variations in the identified genomic variation region and normalizing the count of single nucleotide variations in the identified genomic variation region to a baseline count of single nucleotide variations for a reference region corresponding to the identified genomic variation region to generate a normalized single nucleotide variation density for the identified genomic variation region, and

validating the identified genomic variation region if the normalized single nucleotide variation density in the identified genomic variation region meets a tolerance criterion; and

a display communicatively connected to the computing device and configured to display a report containing the genomic variant region result from the validation engine.

16. The system of claim 15, wherein the genomic variant region is a copy number variant region.

17. The system of claim 15, wherein the genomic variant region is an aneuploidy region.

18. The system of claim 15, wherein the genomic variant region is a polyploidy region.

19. The system of claim 15, wherein the reference region is an exact length of the identified genomic variation region.

20. The system of claim 15, wherein the reference region is from a euploid sample.

21. The system of claim 15, wherein the tolerance criterion is an expected single nucleotide variation density from a reference region of a euploid embryo.

22. The system of claim 21, wherein the identified genomic variation region is validated if its normalized single nucleotide variation density is greater than or less than a preset confidence interval for the expected single nucleotide variation density for the reference region.

23. The system of claim 22, wherein the lower preset confidence interval is 95%.

24. The system of claim 15, wherein the tolerance criterion is an expected single nucleotide variation density from a reference region of a mosaic embryo.

25. The system of claim 24, wherein the identified genomic variation region is validated if the normalized single nucleotide variation density of the identified genomic variation region is above a preset confidence interval of the expected single nucleotide variation density of the reference region.

26. The system of claim 25, wherein the preset confidence interval is 95%.

27. The system of claim 15, wherein the tolerance criterion is a preset number of variances of a single nucleotide variation above or below a baseline count of single nucleotide variations for the reference region.