CN106715711B - Method for determining probe sequence and method for detecting genome structure variation - Google Patents

Method for determining probe sequence and method for detecting genome structure variation Download PDF

Info

Publication number
CN106715711B
CN106715711B CN201480080426.0A CN201480080426A CN106715711B CN 106715711 B CN106715711 B CN 106715711B CN 201480080426 A CN201480080426 A CN 201480080426A CN 106715711 B CN106715711 B CN 106715711B
Authority
CN
China
Prior art keywords
region
candidate
probe
target sample
snp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201480080426.0A
Other languages
Chinese (zh)
Other versions
CN106715711A (en
Inventor
李剑
王煜
李尉
李金良
赵霞
陈仕平
张现东
刘赛军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Publication of CN106715711A publication Critical patent/CN106715711A/en
Application granted granted Critical
Publication of CN106715711B publication Critical patent/CN106715711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Abstract

The invention provides a method for determining a probe sequence based on a reference sequence and a method for detecting genome structural variation. Wherein the method for determining the probe sequence based on the reference sequence comprises the following steps: constructing a first candidate probe set based on the plurality of discrete high-frequency SNP sites, wherein the first candidate probe set is composed of a plurality of candidate probes, and each of the plurality of candidate probes contains at least one discrete high-frequency SNP; comparing a plurality of candidate probes in the first candidate probe set with the reference sequence so as to obtain comparison results; performing first screening on the first candidate probe set based on the comparison result to obtain a second candidate probe set; dividing the reference sequence into a plurality of windows with preset lengths, and respectively allocating the plurality of candidate probes in the second candidate probe set to the matched windows to determine the position information of the plurality of candidate probes; a second screening of a second set of candidate probes is performed based on said positional information and the allele frequencies of the discrete high frequency SNPs to determine said probe sequences.

Description

Method for determining probe sequence and method for detecting genome structure variation
PRIORITY INFORMATION
Is free of
Technical Field
The invention relates to the technical field of genomics and bioinformatics, in particular to a method for determining a probe sequence and a method for detecting genome structural variation.
Background
DNA Copy Number Variation (CNV) and Loss of heterozygosity (LOH) are different types of genomic variations. CNV is a common genomic structural variation, with fragments varying from 1kb to several Mb, mainly represented by deletions and duplications at the sub-microscopic level. LOH refers to the fact that a gene on one chromosome of a pair of chromosomes is deleted, and the matched chromosome still exists, and shows that only homozygote SNP exists in a long region of DNA. When the LOH has not undergone copy number change, i.e., only two copies are inherited from one parent, it is called uniparental diploid (UPD). CNV, LOH, and UPD are associated with many common genetic diseases, cancer, and other complex diseases. The method for accurately, comprehensively, efficiently, quickly, simply and economically detecting the CNV, the LOH and the UPD is established, and has important values for researching chromosome variation events, determining the causes of relevant diseases and adopting corresponding treatment schemes.
There are some inspection technologies, such as PCR technology, including real-time fluorescence quantitative PCR technology and multiple Ligation-dependent Probe Amplification (MLPA), where the real-time fluorescence PCR technology analyzes one or several targets at a time, and the MLPA can analyze 40 sequences at a time, and has high sensitivity, and the detection range is limited by the chromosome and region to which the Probe is directed; the FISH technology is generally used for detecting specific chromosomes and can not detect unknown regions; chip-based technologies, including chip-based Comparative Genomic Hybridization (aCGH) and SNP chip-based technologies, can detect CNVs in the whole genome range, cannot detect polyploids, and have a high missing rate of small fragment loss; and a sequencing technology, which is based on Whole Genome Sequencing (WGS) to detect the structural variation of the whole genome range and the variation of the target region based on the sequencing of the target region, and mainly comprises four methods for analyzing CNV, including: paired-end mapping (paired-end mapping), read-depth analysis (read-depth analysis), split-read strategies (split-read strategies), and sequence assembly comparisons (sequence assembly comparisons).
With the development of sequencing technology, it is necessary to research means for discovering genomic structural abnormalities based on sequencing results, particularly local region sequencing results, including means for discovering chromosomal aneuploidy, CNV, insertion-deletion (indel), LOH, UPD, and SNP.
Disclosure of Invention
One aspect of the present invention provides a method for determining a probe sequence based on a reference sequence, comprising the steps of: constructing a first candidate probe set based on the plurality of discrete high-frequency SNP sites, wherein the first candidate probe set is composed of a plurality of candidate probes, and each of the plurality of candidate probes contains at least one discrete high-frequency SNP; comparing a plurality of candidate probes in the first candidate probe set with the reference sequence so as to obtain comparison results; performing first screening on the first candidate probe set based on the comparison result to obtain a second candidate probe set; dividing the reference sequence into a plurality of windows with preset lengths, and respectively allocating the plurality of candidate probes in the second candidate probe set to the matched windows to determine the position information of the plurality of candidate probes; a second screening of a second set of candidate probes is performed based on said positional information and the allele frequencies of the discrete high frequency SNPs to determine said probe sequences. Wherein, the allele frequency of the discrete high-frequency SNP locus is more than 10 percent, preferably not more than 90 percent, and the physical distance between the discrete high-frequency SNP locus and any other discrete high-frequency SNP locus on the reference genome is not less than the length of the candidate probe, and the length of the candidate probe is 50-250 mers.
The probe obtained by the method for determining the probe sequence is used for hybridizing and capturing a genome to obtain a plurality of genome local regions, and the captured plurality of local regions can represent the whole genome, can reflect whole genome variation information and is used for discovering the occurrence of structural variation in the whole gene range.
In another aspect of the present invention, a method for detecting genomic structural variation, which is suitable for detecting chromosomal aneuploidy, copy number variation and indels, comprises the following steps: sequencing the target sample genomic nucleic acid to obtain a genomic sequencing result, said genomic sequencing result consisting of a plurality of reads, optionally said sequencing comprising screening with a probe, wherein the probe is obtained by a method for determining the sequence of the probe based on a reference sequence as provided by an aspect of the invention. The genome sequencing result can be obtained by extracting genome DNA, constructing a library according to the conventional high-throughput platform instruction manual and sequencing on a computer; the genome sequencing result can also be obtained by capturing the genome of the target sample through a probe and sequencing, wherein the probe can be obtained by the method for determining the sequence of the probe based on the reference sequence provided by one aspect of the invention; dividing a reference genome into m regions, and calculating the coverage depth TD of the target sample genome region i by using the reads falling into the region i in the genome sequencing resultiWherein m and i are natural numbers, i is more than or equal to 1 and less than or equal to m, 10<m; and judging the occurrence of structural variation of the target sample region i based on the difference degree between the coverage depth of the target sample genome region i and the coverage depths of the regions i of k reference samples, wherein k is a natural number and is not less than 2, and the method for obtaining the coverage depth of the region i of each reference sample can refer to the method for obtaining the coverage depth of the target sample region i. By merging adjacent regions with structural variation, it is further detected whether the merged regions have large structural variation or whether the structural variation occurring in the region i spans several regions.
In a further aspect of the invention, there is provided a method suitable for detecting loss of heterozygosity, another genomic structural variation, comprising the steps of: obtaining a genome sequencing result of the target sample, optionally, the genome sequencing result is obtained by capturing a genome of the target sample through a probe and sequencing, the probe is obtained according to the method for determining the probe sequence based on the reference sequence provided by the aspect of the invention; dividing the genome into m' regions, based on the genome assayReading data of a reading segment falling in a region i and a group region i in a sequence result to obtain an SNP set shared by a target sample genome region i and the group region i, respectively calculating the heterozygosity of fragments of SNP sites in the shared SNP set of the target sample and the group, and obtaining a heterozygosity set U of the target sample genome region iiAnd heterozygosity set U of population region i0iComparing the target samples UiAnd group U0iTo determine whether loss of heterozygosity for the target sample region i has occurred; wherein, the allele frequency of each SNP in the common SNP set is more than 0.1, the fragment of one SNP site in the common SNP set is located by taking two SNPs at the upstream and downstream adjacent to the SNP as boundary points, m ' and i are natural numbers, m ' is not less than i and not less than 1, and m ' is not less than 6. The number of samples extracted can truly reflect the group, can be determined according to the accuracy, statistical method, sample data distribution condition and the like required by detection, the group data is composed of a plurality of sample data of the same species, and can be obtained through whole genome sequencing, or according to a method for obtaining target sample data, or from a published database or website, such as thousand-people genome data.
In another aspect of the present invention, a computer-readable storage medium is provided for storing a program for execution by a computer, and it will be understood by those skilled in the art that when the program is executed, all or part of the steps of the methods for detecting genomic structural variation described above can be performed by instruction-related hardware. The storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.
According to a final aspect of the present invention, there is provided an apparatus for detecting genomic structural variation, comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data including an executable program; and a processor, which is connected with the data input unit, the data output unit and the storage unit in a data mode and is used for executing the executable program stored in the storage unit, wherein the execution of the program comprises all or part of the steps of the method for detecting the genome structural variation.
The probes obtained by the method for determining the probe sequences based on the reference sequences are used for capturing and sequencing target regions by using the probes or a solid phase/liquid phase chip containing the probes, so that the structural variation can be detected in the whole genome range at low sequencing cost, the CNV, LOH and UPD can be detected by 23 pairs of chromosomes covering people, and the detection resolution can be adjusted by adjusting the average spacing distribution of the probes, namely increasing/reducing SNP sites according to requirements. The target region capture sequencing and biological information analysis method provided by the invention realize high-resolution, high-accuracy, high-throughput and low-cost CNV, LOH and UPD detection in the whole genome range, and meanwhile, the genome structural variation detection method provided by the invention is also suitable for detection of chromosome aneuploidy variation, SNP and Indel, and is suitable for structural variation analysis and detection based on whole gene sequencing data.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic diagram showing the characteristics of SeTR probes on the whole genome in one embodiment of the present invention, (A) a length distribution diagram of the SeTR probe sequence; (B) physical distance profile of two probes in the SETR probe.
FIG. 2 is a graph showing the results of a test using a SeTR probe according to an embodiment of the present invention, wherein (A) a distribution graph of the depth of coverage of a target region (B) supports reads distributions for a ref base type and a non-ref base type.
Fig. 3 is a schematic diagram of the detection process of CNV, LOH and UPD in one embodiment of the present invention.
FIG. 4 shows R in one embodiment of the present inventioniA baseline view.
FIG. 5 is a schematic diagram of the genomic structural variation of a sample (GM50275) detected in the present invention, the circles from outside to inside being I) chromosomal information, II) riChange in value (wavy lines); III)RhetCorresponding change in P value, IV) RhetThe value changes (dots).
Detailed Description
The following describes embodiments of the present invention in detail. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
It should be noted that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. Further, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
According to an embodiment of the present invention, there is provided a method for determining a probe sequence based on a reference sequence, including the steps of:
the method comprises the following steps: constructing a first candidate Probe set
The method comprises the steps of constructing a first candidate probe set by utilizing discrete high-frequency SNP sites distributed in a genome, wherein each candidate probe in the first candidate probe set comprises at least one discrete high-frequency SNP site, the allele frequency of each discrete high-frequency SNP site is greater than 10%, the physical distance between each discrete high-frequency SNP site and any other discrete high-frequency SNP site on a reference genome is not less than the length of the candidate probe, and the length of the candidate probe is 50-250 mers.
In one embodiment of the invention, the discrete high-frequency SNP is obtained by thousand-person genome data, and the length of the candidate probe is determined to be 100 mers from other published genome data or obtained discrete high-frequency SNP sites with allele frequency less than 90% of further selection.
In one embodiment of the present invention, each candidate probe comprises a discrete high frequency SNP site, and the discrete high frequency SNP site is located in the middle of the candidate probe. Thus, each candidate probe only contains one high-frequency SNP site, and adjacent candidate probes may or may not have overlap. The term "middle section" is used herein in relation to the term "front section" and "rear section", and can be conventionally understood, for example, a sequence with upstream and downstream 1/3 being respectively designated as "front section" and "rear section", and the central 1/3 being "middle section"; further, the discrete high frequency SNP site is located at the midpoint of the candidate probe, where the "midpoint" position, for example, a sequence contains 2n +1 nucleotides, the midpoint is the position of the n +1 th nucleotide, and when a sequence contains 2n nucleotides, the midpoint of the sequence is the position of the n or n +1 th nucleotide, so as to enhance the capture efficiency of the probe to the target discrete high frequency SNP site.
In one embodiment of the invention, prescreening of the first candidate probe set is repeated based on the GC-content and/or the single base weight of the candidate probe sequences in the first candidate probe set, leaving candidate probes in the first candidate probe set with GC-content of 35% to 65% and/or single base gravity of less than 7. The single base repetition degree refers to the number of continuous occurrences of a base type in a sequence, for example, TGAAAAAAAAGC, in which A continuously occurs 8 times, and the A base repetition degree of the sequence is 8. The PCR or hybrid capture process of the sequence is easily influenced by high or low GC content and high heterozygosity of the sequence, GC bias (GC bias) is brought, the capture specificity is reduced, the first candidate probe set reserved by the pre-screening cannot be hybridized with the sequences, and therefore the influence of the GC bias or low-specificity capture on the result is avoided.
Step two: aligning the first candidate probe set with the reference sequence to obtain an alignment result
And comparing the first candidate probe set with the reference sequence to obtain a comparison result and obtain the position information of the first candidate probe set on the reference sequence. The reference sequence used is a known sequence and may be any reference template in a biological class to which the target sample obtained in advance belongs. For example, the target sample is human, HG18 or HG19 provided by the National Center for Biotechnology Information (NCBI) can be selected as the reference sequence, a resource library containing more reference sequences can be further preconfigured, and before sequence comparison, a closer reference sequence is selected according to factors such as sex, race, region, and the like of the target sample, which is beneficial to obtaining a more targeted probe sequence.
Step three: performing a first screening on the first candidate probe set to obtain a second candidate probe set
In one embodiment of the present invention, the candidate probes retained by the first screening satisfy either of the following two conditions: 1) candidate probes in the first candidate probe set that align to a unique location in the reference genome; 2) the mismatches in the alignment to the plurality of positions of the reference sequence in the first candidate probe set are less than 10% for at least two of the plurality of positions of the reference sequence; for example, the length of the candidate probe is 100 mers, the mismatch ratio of 10 base mismatches is 10%, the mismatch ratio is low, the probe can be matched with a target region in a nearly complete complementary manner when being used for hybridization, the capture effect is good, and the specificity is high.
Step four: dividing the reference sequence into a plurality of windows, assigning a second set of candidate probes to the respective matching windows
And dividing the reference sequence into a plurality of windows with preset lengths, and distributing a plurality of candidate probes in the second candidate probe set to the matched windows by using alignment to obtain the position information of each candidate probe on each window.
The length of the windows with the preset length can be consistent or not consistent, and can be overlapped or not overlapped, in one embodiment of the invention, the reference sequence is a reference genome, the reference genome is divided into a plurality of windows with consistent length, the window length is 10Kb, and two adjacent windows are connected but not overlapped.
Step five: (ii) applying a second candidate probe based on said positional information and the allele frequency of the discrete high frequency SNP Second screening to determine the probe sequence
In one embodiment of the invention, performing the second screening comprises two steps, (a) if there are multiple candidate probes located in the same window, determining the candidate probe with the highest allele frequency of the discrete high frequency SNP; (b) if only one candidate probe with the highest allele frequency of the discrete high-frequency SNPs exists, the candidate probe with the highest allele frequency of the discrete high-frequency SNPs is selected as the probe, and if a plurality of candidate probes with the highest allele frequency of the discrete high-frequency SNPs exist, the candidate probe closest to the center of the window among the candidate probes with the highest allele frequency of the discrete high-frequency SNPs is selected as the probe. The distance of the candidate probe from the center of the window may be the distance of the midpoint of the candidate probe from the center of the window. The target position is positioned at the central position of the probe sequence as far as possible, which is beneficial to improving the capture efficiency.
In an embodiment of the present invention, after the second candidate probe set is subjected to the second screening, when the distance between two adjacent candidate probes in the second candidate probe set respectively falling into two adjacent windows on the reference genome is greater than the length of any one of the two adjacent windows, optionally, a short tandem repeat sequence or a part of the short tandem repeat sequence located between the two adjacent candidate probes on the reference genome is further added to the second candidate probe set subjected to the second screening to form a probe sequence together. Thus, when the probe sequences obtained by the design are used for capturing the whole genome, the intervals of the captured regions can be relatively uniformly distributed, and the information of the whole genome can be better and comprehensively reflected by the combination of the captured and determined regions.
According to another embodiment of the present invention, there is provided a method for detecting a genomic structural variation including at least one of a chromosomal aneuploidy, a copy number variation, and an indel, comprising the steps of:
sequencing target sample genome nucleic acid to obtain a genome sequencing result, wherein the genome sequencing result is composed of a plurality of reads, and can be obtained by whole-gene sequencing, such as extracting genome DNA, performing library construction and on-machine sequencing according to an instruction manual of an existing high-throughput platform, such as Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torque, a single-molecule or nanopore sequencing platform and the like to obtain reads; or by capturing the genome of the target sample by a probe and sequencing the captured genome, wherein the probe can be designed and determined by the method for determining the probe provided by the aspect of the invention, and then synthesized or prepared according to the existing method.
Dividing the reference genome into m regions, and calculating the coverage depth TD of the target sample genome region i by using the reading of the falling region i in the reading in the sequencing resultiWherein m and i are natural numbers, i is a region number, i is more than or equal to 1 and less than or equal to m, 10<m。
In one embodiment of the present invention, the coverage depth of the area i is calculated by the formula
Figure BDA0001203857430000061
Or
Figure BDA0001203857430000062
Where i represents the number of the region. Reading of paragraphs to the genomic position can be determined by sequence alignment using various alignment software, such as SOAP (short Oligonucleotide Analysis Package), bwa (Burrows-Wheeler Aligner), samtools, GATK (genome Analysis toolkit), and the like.
And (III) judging the occurrence of structural variation of the target sample region i based on the difference degree of the coverage depth of the target sample genome region i and the coverage depth of the regions i of k reference samples, wherein k is a natural number and is not less than 2.
In one embodiment of the present invention, the comparison of the degree of difference between the depth of coverage of the genomic region i of the target sample and the depth of coverage of the genomic regions i of the k reference samples is performed by comparing the depth of coverage coefficients of the genomic regions i of the target sample and the reference samples, the depth of coverage coefficient R of the genomic region i of the target sampleiComprises the steps of (a) determining the TDiPerforming a first correction to obtain TDaiThe first correction is performed by linear regression of the depth-of-coverage values of 2n consecutive areas including area i, where n is a natural number, 10<n is less than or equal to m/2, in the first aspect of the inventionIn one embodiment, the first corrected linear regression is obtained
Figure BDA0001203857430000063
Wherein, TDjRepresenting the coverage depth of the jth area in the n continuous areas, wherein j is a natural number and is more than or equal to 1 and less than or equal to n; (b) obtaining a first corrected coverage depth TD of the area iaiThen, further to TDaiIs homogenized to obtain
Figure BDA0001203857430000064
Thereby obtaining
Figure BDA0001203857430000065
In one embodiment of the invention, the first correction for the area i covers the depth TDaiObtained by homogenization
Figure BDA0001203857430000066
In one embodiment of the invention, R is obtained for a target sampleiThen further comprises the step of reacting with RiPerforming a second correction to obtain ri
Figure BDA0001203857430000067
Wherein R isaiIs the average of the depth of coverage coefficients for k reference sample genomic regions i,
Figure BDA0001203857430000071
y is a natural number representing the reference sample number, Ri,yThe coverage depth coefficient for genomic region i of the reference sample y is shown.
In another embodiment of the present invention, R is obtained from a target sampleiThen further comprises the step of reacting with RiPerforming a second correction to obtain ri
Figure BDA0001203857430000072
Wherein R isaiThe mean of the depth of coverage coefficients for genomic region i for k reference samples and one target sample,
Figure BDA0001203857430000073
y is a natural number representing the reference sample number, Ri,yThe coverage depth coefficient for genomic region i of the reference sample y is shown.
Calculating the coverage depth coefficient R of the genome region i of the processing target sampleiThe correction, homogenization, etc. of the intermediate values can reduce errors due to fluctuation of experimental conditions, self-differences among samples, etc., so that the final r isiCan truly reflect RiAnd a fluctuation amplitude ratio R around 1iSmall, and multiple sample riConforming to normal distribution; for TD in the above embodimentiPerforming the first correction, followed by normalizing the first corrected value, corresponds to a double averaging procedure, i.e. before intending to represent the coverage depth of the region i by the average of the coverage depths of n consecutive regions including the region i, the calculation of the coverage depth value of each of the n regions is represented by the average of the coverage depths of the n consecutive regions with the region as the first region, which corresponds to correcting the TD by the coverage depth values of 2n consecutive regions including the target region iiThe covering depth of the continuous area can be kept stable. It should be noted that other correction or averaging processes may be used by those skilled in the art to stabilize the coverage depth values of adjacent areas, such as correcting the coverage depth of the target area by the average coverage depth of several areas spaced from the target area by a certain distance. The calculation processing of the coverage depth coefficient of the reference sample genome region i may refer to the calculation processing process of the coverage depth coefficient of the target sample genome region i, and the reference sample data may be pre-calculated for standby or may be obtained by performing the calculation processing process of the target sample synchronously.
In one embodiment of the present invention, the degree of difference between the coverage depth of the target sample genome region i and the coverage depth of the k reference sample regions i is determined by t-testing whether the difference between the coverage depth coefficients is significant. In one embodiment of the inventionIn this manner, the t-test statistic for the genomic region i of the target sample is calculated as
Figure BDA0001203857430000074
Wherein the content of the first and second substances,
Figure BDA0001203857430000075
r representing k reference samplesi,yAverage value of ri,yTo reference the second corrected depth of coverage coefficient for genomic region i of sample y,
Figure BDA0001203857430000076
for the standard deviation of the k reference samples,
Figure BDA0001203857430000077
t based on target sample genomic region iiValue, obtaining a significance level PiWhen P isi<0.05, judging that the structural variation of the region i occurs; otherwise, judging that the region i has no structural variation. In another embodiment of the invention, t is based on the genomic region i of the target sampleiValue and predetermined significance level Pi0Obtaining tiTheoretical value ti0When t isi≥ti0If the structural variation of the region i is judged to occur, otherwise, the structural variation of the region i is judged not to occur, and the predetermined P is determinedi0Less than or equal to 0.05. Predetermining P from a table of t values for t testsi0Then the corresponding t can be foundi0
In one embodiment of the present invention, to detect a larger CNV or an insertion deletion, after performing step (three), merging W regions that are continuous and in the same direction to obtain a primary merged region, merging two primary merged regions when the two primary merged regions are in the same direction and the span between the two primary merged regions does not exceed L regions to obtain a secondary merged region, and detecting structural variation of the secondary merged region; wherein, the equidirectional region refers to a region in which t statistics of the coverage depth of the region are both greater than 0 or both less than 0, W and L are both natural numbers, W is greater than or equal to 2, and L-W is less than or equal to 1. To further detect larger structural variations, one can proceed by analogy, e.g., to further merge eligible secondary merge regions, the merge condition can be analogous to two secondary merge regions being co-directional and not more than L regions or L secondary merge regions being separated by a distance on the reference genome.
In one embodiment of the present invention, the structural variation of the secondary pooling region is detected by determining whether the secondary pooling region has structural variation or not, or whether the structural variation occurring in the region i spans W regions, based on the degree of difference between the coverage depth of the secondary pooling region of the target sample genome and the coverage depth of the corresponding regions on the plurality of reference sample genomes. The obtaining of the coverage depth of the corresponding secondary merging region on the reference sample genome, the calculation of the t statistic of the coverage depth of the secondary merging region on the target sample genome, and the structural variation determination process can be referred to the calculation and determination process of the structural variation of the relatively small region i.
According to yet another embodiment of the present invention, there is provided a method for detecting loss of heterozygosity in a genomic structural variation, comprising the steps of:
(1) sequencing target sample genome nucleic acid to obtain a genome sequencing result, wherein the genome sequencing result is composed of a plurality of reads, and the genome sequencing result can be obtained by whole gene sequencing, such as by extracting genome DNA, performing library construction and on-machine sequencing according to an instruction manual of an existing high-throughput platform, such as by utilizing Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torque, single molecule or nanopore sequencing platforms and the like to obtain reads (reads); or by capturing the genome of the target sample by a probe and sequencing the captured genome, wherein the probe can be designed and determined by the method for determining the probe provided by the aspect of the invention, and then synthesized or prepared according to the existing method.
(2) Dividing a reference genome into m' regions, obtaining SNP sets shared by a target sample genome region i and a population region i based on read information falling in the reference genome region i and population region i data in a sequencing result, and calculating S in the shared SNP sets of the target sample and the population respectivelyObtaining the heterozygosity set U of the target sample genome region i by the heterozygosity of the segment of the NP locusiAnd heterozygosity set U of population region i0iComparing the target samples UiAnd group U0iTo determine whether loss of heterozygosity for the target sample region i has occurred; wherein, the allele frequency of each SNP in the common SNP set is more than 0.1, the fragment of one SNP site in the common SNP set is located by taking two SNPs at the upstream and downstream adjacent to the SNP as boundary points, m ' and i are natural numbers, m ' is not less than i and not less than 1, and m ' is not less than 6.
In one embodiment of the present invention, the heterozygosity of the fragment of a SNP site is expressed by the frequency coefficient of the allele of the SNP site, wherein the frequency coefficient of the allele of the SNP site is RhetMAF/(1-MAF), which is the sub-allele frequency of the high frequency SNP.
In one embodiment of the invention, the target samples U are comparediAnd group U0iTo determine whether loss of heterozygosity in the target sample region i has occurred, comprises determining U using an F-testiVariance of (2)
Figure BDA0001203857430000091
And U0iVariance of (2)
Figure BDA0001203857430000092
If there is a significant difference, if U is presentiAnd U0iIf the variance difference is significant, it is determined that the target sample region i has loss of heterozygosity, otherwise, it is determined that the target sample region i has no loss of heterozygosity.
In one embodiment of the invention, the F-test comprises calculating U separatelyiAnd Ui0Using the obtained target sample UiVariance of (2)
Figure BDA0001203857430000093
And group Ui0Variance of (2)
Figure BDA0001203857430000094
Is obtained by calculationTwo reciprocal statistics FupperAnd FunderObtaining significance level p by using statistics which are reciprocal of each otherFComparison of pFWith a predetermined significance level pF0Size of (1), pF≤pF0The difference between the two variances is obvious, otherwise, the difference is not obvious, the F test comprises a calculation formula,
Figure BDA0001203857430000095
Figure BDA0001203857430000096
Figure BDA0001203857430000097
pF=pupper+(1-punder) Wherein v is the number of SNPs in the SNP set shared by the target sample genome region i and the population region i, q is the number of SNPs in the SNP set shared by the target sample genome region i and the population region i, and Rhet,i,vThe sub-allele frequency coefficient of the v-th SNP in the common SNP set of the target sample genome region i,
Figure BDA0001203857430000098
is the average of the sub-allele frequency coefficients, R, of q SNPs in a common SNP set of a target sample genomic region ihet,i0,vThe sub-allele frequency coefficient of the v-th SNP in the common SNP set of the genome region i of the population sample,
Figure BDA0001203857430000099
is the average of the sub-allelic frequency coefficients of q SNPs in a common SNP set of a population sample genomic region i, pupperAnd punderAre respectively according to FupperAnd FunderObtaining of pF0≤0.05。pF0The setting may be taken as a value of a usual setting, or may be adjusted according to known information grasped, a requirement for detection accuracy, or the like.
In one embodiment of the present invention, to detect a larger LOH, after step (2), W ' regions with loss of heterozygosity and continuity are merged to obtain a three-level merged region, two three-level merged regions are merged to obtain a four-level merged region when the span between the two three-level merged regions does not exceed L ' regions, a heterozygosity set of the four-level merged region of the target sample and a heterozygosity set of the same region of the population are respectively obtained, and the two heterozygosity sets are compared to determine whether the target sample four-level merged region has loss of heterozygosity, wherein W ' and L ' are both natural numbers, W ' is not less than 2, and W '/2 is not less than L '. In one embodiment of the present invention, W' ≧ 4. To detect LOH occurring in larger regions, one can proceed by analogy, such as further pooling of eligible quaternary pooled regions, which can be similar if the distance between two quaternary pooled regions on the reference genome does not exceed L 'regions or L' tertiary pooled regions.
According to still another embodiment of the present invention, there is provided a method for detecting an unipolar diploid, wherein when there is a loss of heterozygosity in a genomic region of a target sample, the copy number of the genomic region is calculated, and when the copy number of the genomic region is the same as the copy number of the genomic region in a normal genome of the same species, the presence of a UPD in the genomic region of the target sample is determined; the presence or absence of LOH in a genomic region can be detected by an LOH detection method according to one aspect of the present disclosure.
It will be understood by those skilled in the art that all or part of the steps of the methods in the above embodiments may be implemented by hardware associated with program instructions, and the program may be stored in a computer-readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.
According to a final embodiment of the present invention, there is also provided an apparatus for detecting genomic structural variation, comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data including an executable program; and the processor is in data connection with the data input unit, the data output unit and the storage unit and is used for executing the executable program stored in the storage unit, and the execution of the program comprises the completion of all or part of the steps of the various methods in the embodiment.
The operation results of the specific probe design method and the structural variation detection method according to the present invention will be described in detail below with reference to specific target individuals. The name definitions or specific parameter settings involved in the following process are selected as:
1. the designed probe is called a selected Target Region probe (SeTR);
2. hereinafter, "depth of coverage", "depth of sequencing" and "depth" may be used interchangeably; hereinafter, "area" and "target area" may be used interchangeably;
3. constructing and sequencing a library according to a small fragment library construction operation instruction and a computer-on sequencing instruction provided by a Hiseq2000 platform, wherein the size of the library is 300bp-350bp, double-end sequencing (pair-end sequencing) and the read length is 91bp (the sequencing type is PE91+8+ 91);
4. the reference genome or reference sequence selected for alignment is the human reference genome (hg19, Build 37).
The examples do not specify particular techniques or conditions, and are carried out according to techniques or conditions described in literature in the art (for example, refer to molecular cloning, a laboratory Manual, third edition, scientific Press, written by J. SammBruke et al, Huang Petang et al) or according to product instructions. The reagents or apparatus used are not indicated by the manufacturer, but are conventional products available commercially, for example from Illumina.
Example 1: chip design, preparation and test
In general, high (> 60%) or low (< 35%) GC content and high heterozygosity tend to adversely affect DNA fragments during PCR or probe capture, and to avoid this, we have designed a special probe, which we call it SeTR. a) The probe sequence is high in uniqueness and stability, low heterozygosity and medium GC (35-60%) content are required, b) a discrete high-frequency SNP marker (SNP marker) is contained, the allele frequency (0.9 > AF >0.1) of each SNP is used for better detecting LOH of the whole genome, and c) the final target region presents relatively uniform distribution.
The flow of SeTR probe design or target region selection is as follows:
1) based on thousand-person gene database (ftp:// ftp. ncbi. nih. gov/1000 genes/ftp/release), selecting candidate SNP set with Allele Frequency (AF) of 10% -90%, and then removing one SNP with physical distance between two SNPs being less than 100pb from the SNP set, thereby forming SNP marker 1 set.
2) Each SNP of the SNP marker 1 set is taken as a midpoint, 50pb of the reference genome sequence is intercepted respectively at the upstream and the downstream, a theoretical probe sequence set of 100bp is formed, and then the probe sequence set is compared back to the reference sequence. If an optimal alignment of a probe sequence has no mismatches and the next optimal alignment has less than 5% mismatches, then its corresponding SNPs are retained, thereby constituting the SNP marker 2 set.
3) Based on the SNP marker 2 set, we pick out SNPmaker that is uniform in the reference genome physical location as the final SNP marker set. In our study, we selected a set of SNP markers with a physical distance of about 10 Kbp.
4) If the distance between two adjacent SNPs is larger than 10Kbp in the final SNP marker set, short repeated Sequences (STR) between the two adjacent SNPs are selected to fill up the uniform.
After designing the SeTR probe, we entrusted Roche to produce SeTR liquid phase chips. The SeTR liquid phase chip contains 278800 probes, with a total size of 41,795,106bp, covering 1.45% of the effective whole genome (2.89G). The average length of the SeTR probes reached 149bp, and the average physical distance between two adjacent probes was 10.6kbp, as shown in Table 1 and FIG. 1.
TABLE 1 distribution of SeTR probes on each chromosome
Figure BDA0001203857430000111
Figure BDA0001203857430000121
The usability of the SeTR chip was tested with 3 quality-qualified DNA samples, YH (Yanhuang specimen, Chinese genomic DNA), HG00537 (one of the thousand human genomic projects) and GM50275 (human fibroblast specimen obtained from Coriell Institute for Medical Research, Cochler Institute of medicine) to ensure that the probe chip could be used for subsequent detection studies. All three samples were sequenced using SeTR capture pooling to obtain sequencing sequences (reads). First, we remove the reads contaminated by linker (adapter) and having lower quality, such as average quality value less than 20, and then call the remaining reads clean reads (clean reads), and align the clean reads to the hg19 reference sequence, so as to obtain 98.13% -99.29 reads aligned to the reference genome, where the alignment to the target region reaches 67.43% -67.87%, and in addition, at least one read covers at least 99.73% -99.95% of the target region, and at least 10 times covers at least 99% of the target region, as shown in table 2, these performances are all better than exome capture (exome capture) chips of the same type, such as exome liquid chip produced by Roche Nimblegen. In addition, the depth distribution of the target region, as shown in fig. 4A, is similar to Poisson distribution (Poisson distribution), and fig. 4B shows that the number of reads supported by the non-reference sequence base type (the non-reference alloy) of most of the high heterozygous sites in the target region is almost equivalent to the number of reads supported by the reference sequence base type (reference alloy), i.e. the number of positive and negative reads supported by a certain high heterozygous site is equivalent (the positive and negative reads are respectively from two homologous chromosomes), which all show that the probe has no obvious bias of haplotype (usually reference sequence base type, ref type) capture, and has better uniformity of target region capture.
TABLE 2 alignment of the three samples
Figure BDA0001203857430000122
Figure BDA0001203857430000131
Example 2: target region library construction and sequencing
1. Test materials, reagents, and instruments
Sample preparation: 15 samples of the target gDNA (human genomic DNA, sample numbers in Table 3, "GM" "headed human fibroblasts), and 24 samples of the reference DNA.
Main reagent instrument: PCR instrument, pipettor, centrifuge, comfortable constant temperature blending instrument, DNA breaking instrument, vortex oscillator, magnetic frame, electrophoresis apparatus, Hiseq2000 sequencer, Nanodrop ultraviolet spectrophotometer, etc., the reagents or instruments used are not indicated by manufacturers, and are all conventional products which can be obtained commercially.
Designing and synthesizing a probe: obtained by the first example, in the whole genome range of human, about 41M of target region was selected, and NimbleGen SeqCap EZ liquid phase probes were customized from Roche, which could capture the corresponding designed target region.
2. Library construction
1) Genomic DNA extraction
About 3-5. mu.g of genomic DNA was extracted from the target sample using QIAGEN DNA extraction Kit (DNA Mini Kit) according to the Kit instructions for subsequent experiments. And (3) carrying out electrophoresis detection on the extracted DNA (30-100ng) to judge whether the DNA is complete and the degradation degree.
2) Genomic DNA disruption and purification
The genomic DNA was disrupted using a covaris E210 instrument (operating with reference to the instrument instructions). The DNA was disrupted to 200 and 250 bp. The disrupted DNA fragment was purified using QIAquick PCR Purification kit (250) according to the instructions of the kit, and the size of the main band was determined by electrophoresis to be within the range of 200-250 bp.
3) End repair, end-plus-A, adapter, pre-amplification
According to the library building requirement, according to the steps of a specification building of a double-end label library and the listed reagents, reaction conditions and the like, carrying out end repair on the DNA fragments after the fragmentation and purification, and purifying; adding a base A to the two ends of the DNA fragment subjected to end repair purification, and purifying the end added product A; and connecting sequencing adapters at two ends of the product A added at the tail end, and purifying the DNA fragment with the adapters by using magnetic beads which can be complementarily combined with the sequencing adapters. Preparing a PCR reaction system, amplifying the DNA fragment with the joint, purifying the PCR product by magnetic beads, and detecting whether the size of the main band of the amplified product is 300-350bp by electrophoresis; the DNA amount is detected by a Nanodrop ultraviolet spectrophotometer, and the total amount is more than 1.0 mu g.
4) Hybridization and elution of SeTR Probe, amplification
According to the commercial NimbleGen SeqCap EZ hybridization elution kit instruction, purchasing or configuring the hybridization and elution related reagents in the kit instruction. A1.5 mL centrifuge tube was prepared and the Cot-1DNA, the universal Block sequence (Block), the Block sequence of the tag (index N Block) and the DNA sample from step 3) were added. Then centrifuging for 1min, vacuum concentrating and drying at 60 deg.C, adding hybridization buffer solution, shaking and centrifuging, placing in 95 deg.C metal dry bath for denaturation for 10min, shaking and centrifuging at high speed. The tubes were then filled with 4.5ul of probe and hybridized on a PCR machine (47 ℃ C., 64-72 hours). Elution was performed after hybridization was completed. Then, PCR is carried out according to the final amplification step of the library construction specification, a PCR reaction system is prepared according to requirements, and reactants such as DNA obtained by hybridization elution, polymerase, substrate, PCR reaction buffer solution, Flowcell primer (primer designed according to fixed sequence on a sequencing chip of a sequencer) and the like are mixed uniformly. The PCR program comprises pre-denaturation at 94 ℃ for 2min, denaturation at 94 ℃ for 15s, annealing at 58 ℃ for 30s, extension at 72 ℃ for 30s, reaction for 15 cycles, and extension at 72 ℃ for 5 min. And after the PCR is finished, taking out a PCR product, centrifuging, and purifying by magnetic beads to obtain a target region library. The concentration of the library was measured by Nanodrop uv spectrophotometry and prepared for sequencing on the machine.
3. Hiseq2000 high throughput sequencing
And (4) carrying out on-machine sequencing on the qualified DNA library according to Hiseq2000 operation instructions. The data size of each sample is about 4.5G, the average sequencing depth reaches 100X, but the efficiency of the capture chip hardly reaches 100%, and the final effective sequencing depth of the target area is 30X-45X through analysis.
Example 3: detection of CNV, LOH and UPD
The general flow is shown in FIG. 3. After sequencing is completed, the off-line data is in a fastq file format. Then, comparing the filtered reads with a reference genome (Hg19, Build 37) by adopting BWA software to obtain a comparison file in an SAM format, then converting the SAM comparison file format into a binary BAM file by using samtools software, performing de-duplication and sequencing processing on the comparison result, and then converting the BAM format into a PILEUP format by using the samtools software again to see a biological information analysis strategy part for specific details.
First, filtering and comparing sequencing data
The sequencing data obtained from the illumina Hiseq2000 machine of the above example was first subjected to a simple data filtration to remove reads contaminated with adapters, having an N content of more than 5% and an average mass value of less than Q20. The filtered data was then aligned to the human reference genome using bwa alignment software (hg19, Build 37), the sequence alignment results, i.e., the alignment files in SAM (sequence alignment/map) format (SAM files for short), were output, then converted to binary BAM files using Samtools software, the PCR-induced repeats (PCR duplicates) were removed and sorted, and the results were re-aligned and re-corrected using GATK software.
Second corrected coverage depth coefficient r of the target area is calculatediHeterozygosity R of fragmentshet
R of each target area is calculated according to the information contained in the probe area file obtained after the filtering comparisoniAnd fragment heterozygosity RhetThe value is obtained. According to riValue, predicting CNV using t-test, from RHetThe LOH and UPD are predicted using the F test.
Analysis for detecting CNV, LOH and UPD
1. CNV detection
1.1 calculating the depth coefficient (R) of each target regioni)
Calculating the depth of the target region and using TDiTo maintain the stability of several consecutive target regions TD, the method of equation 2 is used to correct TD (as shown in equation 1)iI.e. correcting TD by using the depth of n' regions following the i-th regioniTo obtain TDaiThen using equations 3 and 4 for TDaiPerforming homogenization, and obtaining the depth coefficient R of each target areai
Formula 1: TDi=Tibase/Tilen,
Equation 2:
Figure BDA0001203857430000151
equation 3:
Figure BDA0001203857430000152
equation 4:
Figure BDA0001203857430000153
Tibase: comparing to obtain the number of bases of the target area i; t isilen: the length of the target area i.
1.2 creating a reference line using data of a plurality of reference samples (k 24), and correcting RiTo obtain ri
The efficiency of each capture has certain fluctuation due to the fluctuation of each experimental condition and the self difference among samples, and further R is causediIs likely to cause CNV glitches. Therefore, it is very beneficial to create a uniform reference line according to the fluctuation situation of a plurality of samples. FIG. 4 is a good representation of the creation of a baseline to facilitate this detection, precursor Ri(preRi)
Figure BDA0001203857430000154
The distribution of (A) greatly fluctuates as shown in the figure, and RiRelatively small fluctuations, RiR is obtained by correction of the reference lineiFluctuation of whichSmaller, more sensitive and easier to detect the occurrence of CNV. Theoretically, it is thought that in different samples, R is within the same target region without CNV occurringiThe values are theoretically in accordance with the poisson distribution and all fluctuate relatively stably around the respective characteristic value, in order to maintain the stability of the respective characteristic values, by investigating the R of the same region of a plurality of samplesiValue, adopt RiMean value (mean R)i) Instead of this respective unique value, a respective unique reference line (robust baseline) is constructed for each target region. Based on R of each target regioniIs around mean RiAssuming fluctuating values, we will shift RiConversion to r by mean RiiAnd further make riA normal distribution that fluctuates up and down around 1.
1.3 detection of CNV of target region
Theoretically, r from the same target region of multiple samplesiThe values should all conform to a normal distribution, so when investigating a target region i of a sample, the r of the region can be compared among a plurality of samplesiThe value, using the T test, of the T statistic is calculated as follows to detect the copy number variation of the target region i of the sample.
Figure BDA0001203857430000155
In the formula, 1 in each parameter index represents a target sample, 2 represents a plurality of reference samples,
Figure BDA0001203857430000156
represents n1R of each sample to be measurediThe average number of (a) is,
Figure BDA0001203857430000161
is n2R of a reference sampleiAverage number of (d), mu1R for theoretically all samples to be measurediMean number, mu2Theoretically all reference samples ri2Mean number, S1And S2Marks for sample to be measured and reference sampleTolerance, df is the degree of freedom, df is n1+n2-2。
When the sample to be measured is 1, n 11, the theoretical mean value of the sample to be measured and the reference sample is the same, and the formula is simplified as follows:
Figure BDA0001203857430000162
through the simplified formula, each target area corresponds to a t value capable of detecting the CNV, so as to obtain a P value (confidence), and when P of a certain area is less than 0.05, the area is an area where the CNV occurs.
1.4 detection of Large CNV
And (3) attaching a pseudo signal value to each region based on the p value of the t test of the single region to represent whether the region is considered by the next CNV region connection, and connecting target regions possibly having consistent CNV into blocks along the chromosome so as to determine the final size and copy number of the CNV.
The labeling rule of the pseudo signal values is that, when the measured values of at least four consecutive target regions deviate from the corresponding regions of the reference sample in the same direction (t is greater than or less than 0 at the same time), if the P values of 3 regions are less than a first threshold (e.g., 0.05, which is a common significance level threshold) and the fourth is not more than a second threshold (0.2, which is four times the first threshold), the four regions are labeled as deviation directions (e.g., as deviation plus minus plus) and combined into one block; here the number of consecutive and co-directional zones and the first and second threshold values are adjustable. If the distance between one block and another block does not exceed the span of 5 areas, combining the two blocks into a large block, and so on, and finally obtaining a block; referring to the method equation of 1.3 above, the r value of this block is the r of all the regions it containsiThe average value of (a) indicates that the r value of the block region of the sample to be measured and the reference sample is subjected to t test, and the P value of the block is calculated. When P of the block<0.05, the block is CNV, so as to determine the boundary and size of the block, and obtain the boundary and size of the large CNV.
By analyzing the 15 samples of interest, we obtained CNV results that are highly consistent with the known validation results (SNP-array results) and that are free of false positives and false negatives, see table 3. Furthermore, we simulated 8 30X genome-wide data, including 5 normal samples and 3 CNV-containing samples, and compared the currently reported genome region CNV prediction software CONTRA (Li J, Lupat R, et al, CONTRA: copy number analysis for target response, bioinformatics.2012 May 15; 28(10):1307-13) by performing CNV detection analysis on the 8 simulated data, and the results showed that both the sensitivity and specificity of our method reached 100%, and the respective copy number was also accurately detected, and the detection accuracy for CNV reached 500Kb and could be accurately located, while the sensitivity for CONTRA was 88.9%, the specificity was only 66.7%, and the copy number was not given, as shown in Table 4.
TABLE 3
Figure BDA0001203857430000163
Figure BDA0001203857430000171
TABLE 4
Figure BDA0001203857430000172
Figure BDA0001203857430000181
2. LOH detection
2.1 detection of heterozygous status in regions of the Whole genome
In a certain region of a genome of a sample to be detected, SNP loci with the gene frequency (AF) of 0.1-0.9 in thousand-person data are found out, and R of the region of the SNP loci in the thousand-person data and the sample to be detected is calculated according to the following formulaHetThe value is obtained. When the area i in the sample to be tested is in an absolute heterozygous state, then RhetWhen the expression is absolute homozygosis, R is 1het=0。
RhetMAF/(1-MAF), MAF (minor allele frequency) is the minor allele frequency.
In a sample to be detected, any SNP site m in a certain region is taken as a starting point, and n SNP sites are continuously taken backwards as a heterozygosity set Sm, namely S, in the regionm={Rhet,sm,Rhet,s(m+1),...,Rhet,s(m+n)In the same way, SNP sites at the same position are taken from a database of thousands of people to form a heterozygosity set Pm, namely Pm={Rhet,phetm,Rhet,p(m+1),...,Rhet,p(m+n)And F, checking whether the variances of the two heterozygosity sets are equal, specifically, respectively calculating the variances of the heterozygosity sets in the region of the sample to be detected
Figure BDA0001203857430000182
Variance of heterozygosity set of the same region as thousand samples
Figure BDA0001203857430000183
And the p value of the heterozygosity set Sm in the region of the sample to be tested.
Figure BDA0001203857430000191
Figure BDA0001203857430000192
Figure BDA0001203857430000193
H0s=σp
HAs≠σp
Figure BDA0001203857430000194
Figure BDA0001203857430000195
p=pupper+(1-punder)
When p is less than or equal to 0.01, we accept HAThe heterozygosity Sm is judged to lose heterozygosity in the population, namely LOH occurs in the region where the Sm is located.
2.2 detection of Large LOH
Combining the results of 2.1, a step 1.4 of detecting large CNVs was used to record a subset of 4 consecutive loss heterozygous states as a minimum unit. If the span of the two units does not exceed 2 subsets, the two units are combined into a larger unit, and so on, and finally connected into a block, and then the block is formed according to the R between the sample to be detected and the reference set of thousands of peopleHetThe value is checked by F, the p value of the block is calculated, when p ≦ 0.01, we consider the block to be LOH, otherwise, it is a non-LOH block.
Alternatively, the conditions for pooling can be set more stringent for more accurate detection, e.g., to avoid false positives due to some random errors, defining at least a region greater than 5M may be a true LOH. On this basis, successive subsets with p ≦ 0.01 around the subset with p ≦ 0.01 in 2.1 are merged with the condition that the block fault tolerance is set to 1 (i.e., allowing the p value of 1 subset in the block to be greater than 0.01). Finally, the Rlet in the merged region is subjected to F test again, and if the p value of the Rlet is less than 0.01, the block is considered to be a real LOH.
3. UPD detection
And combining the CNV and LOH detection results of the whole genome, and carrying out UPD detection according to the Mendelian genetic rule.
If a DNA region shows a heterozygous state in thousand people data, R Het1, and in practical tests, its heterozygosity state disappears, i.e. RHetApproaching 0, the region is judged to be LOH, and if CNV occurs in the region and there are two copies (CN ═ 2), that is, there is no change in the number of copies (sample of this example)Is a diploid sample, and each region of the normal diploid sample genome is two copies), the region is judged to have haploidentical diploid (UPD).
In 13 of 15 samples, 10 LOH greater than 5M and 4 UPD greater than 5M were detected, the results are shown in table 5, the detection of LOH and UPD is performed without matching sample (generally, comparing the diseased tissue with the normal tissue, which is a matching sample, i.e. a sample with a certain correlation, while the detection of LOH and UPD in the present embodiment is performed by comparing the target sample with a plurality of reference sample sets, and the target sample and the reference sample set have no correlation and are not matching samples), the LOH detection result of not less than 5M is consistent with the CNV result of CN ═ 1 (the accuracy of the LOH detection result can be verified by using the CNV detection result), and the accuracy of detecting LOH and UPD in the present invention is high and can reach the accuracy of 5M level.
The Circos plot (fig. 5) shows the CNV, LOH and UPD measurements of the GM50275 sample in combination.
TABLE 5
Figure BDA0001203857430000201
Industrial applicability
The method for determining the probe sequence based on the reference sequence can be effectively used for determining the probe sequence, and the obtained probe is used for hybridizing and capturing the genome to obtain a plurality of genome local regions, and the plurality of local regions obtained by capturing can represent the whole genome, can reflect whole genome variation information and is used for discovering the occurrence of structural variation in the whole gene range.
Although specific embodiments of the invention have been described in detail, those skilled in the art will appreciate. Various modifications and substitutions of those details may be made in light of the overall teachings of the disclosure, and such changes are intended to be within the scope of the present invention. The full scope of the invention is given by the appended claims and any equivalents thereof.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Claims (32)

1. A method for determining a probe sequence based on a reference sequence, comprising:
(1) constructing a first candidate probe set based on a plurality of discrete high-frequency SNP sites, wherein the first candidate probe set is composed of a plurality of candidate probes, each of the plurality of candidate probes corresponds to at least one discrete high-frequency SNP site, and the allele frequency of each of the plurality of discrete high-frequency SNP sites is at least 10 percent respectively;
(2) aligning the plurality of candidate probes in the first candidate probe set with a reference sequence to obtain an alignment result;
(3) performing a first screening on the first candidate probe set based on the comparison result to obtain a second candidate probe set consisting of a plurality of candidate probes,
wherein the first screening comprises retaining candidate probes that satisfy at least one of the following conditions:
a candidate probe that is uniquely aligned with the reference sequence;
candidate probes aligned to a plurality of positions of the reference sequence, and at least two of the plurality of positions each having a mismatch ratio of less than 10%;
(4) dividing the reference sequence into a plurality of windows with predetermined lengths respectively, and distributing a plurality of candidate probes in the second candidate probe set to the matched windows respectively so as to determine the position information of the candidate probes respectively;
(5) performing a second screening of the second candidate probe set based on the positional information and the allele frequencies of the discrete high frequency SNP sites to determine the probe sequences,
wherein the probe is determined according to the following steps:
(a) if a plurality of candidate probes are positioned in the same window, determining the candidate probe with the highest allele frequency of the corresponding discrete high-frequency SNP locus;
(b) and if the same window only has one candidate probe with the highest allele frequency of the corresponding discrete high-frequency SNP locus, selecting the candidate probe with the highest allele frequency of the corresponding discrete high-frequency SNP locus as the probe, and if the same window has a plurality of candidate probes with the highest allele frequency of the corresponding discrete high-frequency SNP locus, selecting the candidate probe which is closest to the center of the window from the candidate probes with the highest allele frequency of the corresponding discrete high-frequency SNP locus as the probe.
2. The method of claim 1, wherein the allele frequency of each of the plurality of discrete high frequency SNP sites is no more than 90% respectively.
3. The method of claim 1, wherein any two adjacent discrete high frequency SNP sites in the plurality of discrete high frequency SNP sites are not physically closer to the reference sequence than the length of the candidate probe.
4. The method of claim 1, wherein the candidate probe is 50-250 mers in length.
5. The method of claim 4, wherein the candidate probe is 100 mers in length.
6. The method of claim 1, wherein the candidate probe corresponds to one of the discrete high frequency SNP sites, and wherein the discrete high frequency SNP site corresponds to a mid-section of the candidate probe.
7. The method of claim 6, wherein the discrete high frequency SNP sites correspond to midpoints of the candidate probes.
8. The method of claim 1, wherein the candidate probe is truncated from the reference sequence.
9. The method of claim 1, wherein prior to performing the alignment, the first set of candidate probes is pre-screened in advance based on at least one of GC content and number of single base repeats of the candidate probes;
the prescreening includes retaining candidate probes that satisfy at least one of:
the GC content is 35 to 65 percent; and
the single base gravity was less than 7.
10. The method of claim 1, wherein in step (4), the reference sequence is divided into a plurality of windows each having the same predetermined length.
11. The method of claim 10, wherein the reference sequence is partitioned into a plurality of windows of 10Kb in length.
12. The method of claim 1, wherein after the second candidate probe set is subjected to the second screening, when the distance between two adjacent candidate probes in the second candidate probe set respectively falling into two adjacent windows on the reference genome is greater than the length of either of the two adjacent windows, the short tandem repeat sequence or a part of the short tandem repeat sequence located between the two adjacent candidate probes on the reference genome is further added to the second candidate probe set subjected to the second screening to form the probe sequence together.
13. The method of claim 1, wherein the reference sequence is a reference genome or a portion thereof.
14. A method of detecting a genomic structural variation comprising at least one of a chromosomal aneuploidy, a copy number variation, and an indel, for a non-diagnostic purpose, the method comprising,
(1) sequencing the genomic nucleic acid of the target sample to obtain a genomic sequencing result, the genomic sequencing result being comprised of a plurality of reads, wherein the sequencing comprises screening with a probe, wherein the probe is obtained by the method of any one of claims 1 to 13;
(2) dividing the reference genome into m regions, calculating the depth of coverage TD of region i using the number of reads falling into region iiM and i are natural numbers, i represents the number of the region, i is more than or equal to 1 and less than or equal to m, 10<m;
(3) And determining whether the region i has structural variation or not based on the difference degree between the coverage depth of the region i and the coverage depth of the regions i of k reference samples, wherein k is a natural number and is more than or equal to 2.
15. The method of claim 14, wherein the depth of coverage of the region i is determined using the following equation:
Figure FDA0003082386560000021
or
Figure FDA0003082386560000022
Where i represents the number of the region.
16. The method of claim 14, wherein the test for the degree of difference between the depth of coverage of the genomic region i of the target sample and the depth of coverage of the region i of the k reference samples is performed by a t-test.
17. The method according to claim 14, wherein the comparison of the degree of difference between the depth of coverage of region i and the depth of coverage of region i of the k reference samples is performed by comparing the depth of coverage coefficients of genomic region i of the target sample and the reference sample, wherein the depth of coverage coefficient R of region i isiThe determination of (a) comprises the steps of,
(a) for TDiPerforming a first correction to obtain a first corrected coverage depth TDaiThe first correction is implemented by performing linear regression on the depth-covered values of 2n consecutive areas including an area i, where n is a natural number, 10<n≤m/2;
(b) For TDaiIs homogenized to obtain
Figure FDA0003082386560000031
Thereby obtaining
Figure FDA0003082386560000032
18. The method of claim 17, wherein in step (a), the first correction coverage depth TD is determined based on the following formulaai
Figure FDA0003082386560000033
Wherein, TDjAnd j is a natural number, and j is more than or equal to 1 and less than or equal to n.
19. The method of claim 18, wherein in step (b), the TD is identified based on the following formulaaiIs homogenized to obtain
Figure FDA0003082386560000034
Figure FDA0003082386560000035
20. The method of any one of claims 16 to 19, wherein R is measured at the time of obtaining the target sampleiThen further comprises the step of reacting with RiPerforming a second correction to obtain ri
Figure FDA0003082386560000036
Wherein R isaiIs the average of the depth of coverage coefficients for k reference sample genomic regions i,
Figure FDA0003082386560000037
y is a natural number representing the reference sample number, Ri,yThe coverage depth coefficient for genomic region i of the reference sample y is shown.
21. The method of any one of claims 16 to 19, wherein R is measured at the time of obtaining the target sampleiThen further comprises the step of reacting with RiPerforming a second correction to obtain ri
Figure FDA0003082386560000038
Wherein R isaiThe mean of the depth of coverage coefficients for genomic region i for k reference samples and one target sample,
Figure FDA0003082386560000039
y is a natural number representing the reference sample number, Ri,yThe coverage depth coefficient for genomic region i of the reference sample y is shown.
22. The method of claim 20, wherein the t-test is performed such that the t-statistic for the genomic region i of the target sample is calculated as
Figure FDA00030823865600000310
Wherein the content of the first and second substances,
Figure FDA00030823865600000311
r representing k reference samplesi,yAverage value of ri,yFor reference to the second corrected depth of coverage coefficient of genomic region i of sample ygenome,
Figure FDA00030823865600000312
s is the standard deviation of k reference samples,
Figure FDA0003082386560000041
23. the method of claim 22, wherein t is based on the genomic region i of the target sampleiValue, obtaining a significance level PiWhen P isi<0.05, judging that the structural variation exists in the region i; otherwise, judging that the structural variation does not exist in the region i.
24. The method of claim 22, wherein t is based on the genomic region i of the target sampleiValue and predetermined significance level Pi0Obtaining tiTheoretical value ti0When t isi≥ti0Judging that the region i has structural variation, otherwise, judging that the region i has no structural variation; the predetermined Pi0≤0.05。
25. The method according to any one of claims 14 to 19, wherein after performing step (3), W regions that are co-directional and continuous are merged to obtain a primary merged region, when the two primary merged regions are co-directional and span no more than L regions, the two primary merged regions are merged to obtain a secondary merged region, and structural variation of the secondary merged region is detected based on the degree of difference between the coverage depth of the secondary merged region of the target sample genome and the coverage depth of the corresponding regions on the plurality of reference sample genomes; wherein, the equidirectional region refers to a region in which the t statistics of the region are both greater than 0 or both less than 0, W and L are both natural numbers, W is greater than or equal to 2, and L-W is less than or equal to 1.
26. A method for detecting loss of heterozygosity for non-diagnostic purposes comprising,
(1) sequencing the genomic nucleic acid of the target sample to obtain a genomic sequencing result, the genomic sequencing result being comprised of a plurality of reads, wherein the sequencing comprises screening with a probe, wherein the probe is obtained by the method of any one of claims 1 to 13;
(2) dividing a reference genome into m' regions, obtaining SNP sites shared by a target sample genome region i and a population region i to form a shared SNP set based on read information falling in the region i and data of the population region i in the genome sequencing result, respectively calculating the heterozygosity of fragments of the SNP sites in the target sample genome region i and the population shared SNP set, and obtaining a heterozygosity set U of the target sample genome region iiAnd heterozygosity set U of population region i0iComparing the target samples UiAnd group U0iTo determine whether there is loss of heterozygosity in the target sample region i; wherein, the segment where the SNP locus is located is a boundary point of two upstream SNPs and downstream SNPs adjacent to the SNP, m 'and i are natural numbers, m' is not less than 1 and not less than 6.
27. The method of claim 26, wherein each SNP in the common set of SNPs has an allele frequency greater than 0.1.
28. The method according to claim 26, wherein the heterozygosity of the fragment containing the SNP site is represented by a frequency coefficient R of a sub-allele of the SNP sitehetMAF/(1-MAF), which is the sub-allele frequency of the SNP.
29. The method of claim 28, wherein the comparison target sample U isiAnd group U0iTo determine whether loss of heterozygosity in the target sample region i has occurred, comprises determining U using an F-testiVariance of (2)
Figure FDA0003082386560000043
And U0iVariance of (2)
Figure FDA0003082386560000042
Whether there is a significant difference, if UiAnd U0iIf the variance difference is significant, it is determined that the target sample region i has loss of heterozygosity, otherwise, it is determined that the target sample region i has no loss of heterozygosity.
30. The method of claim 29, wherein the F-test comprises separately calculating UiAnd Ui0Using the obtained target sample UiVariance of (2)
Figure FDA0003082386560000051
And group Ui0Variance of (2)
Figure FDA0003082386560000052
Calculating to obtain two statistics F reciprocal to each otherupperAnd FunderObtaining significance level p using said reciprocal statisticsFComparison of pFWith a predetermined significance level pF0The size of (a), including the calculation formula,
Figure FDA0003082386560000053
Figure FDA0003082386560000054
Figure FDA0003082386560000055
pF=pupper+(1-punder) Wherein v is the number of SNPs in the high frequency SNP set shared by the target sample genomic region i and the population region i, q is the number of SNPs in the high frequency SNP set shared by the target sample genomic region i and the population region i, and Rhet,i,vThe sub-allele frequency coefficient of the v-th SNP in the common high-frequency SNP set of the target sample genome region i,
Figure FDA0003082386560000056
is the average value of the sub-allele frequency coefficients, R, of q SNPs in the common high-frequency SNP set of the target sample genomic region ihet,i0,vThe sub-allele frequency coefficient of the v-th SNP in the shared high-frequency SNP set of the genome region i of the population sample,
Figure FDA0003082386560000057
is the average of the sub-allelic gene frequency coefficients of q SNPs in a common high-frequency SNP set of a population sample genomic region i, pupperAnd punderAre respectively according to FupperAnd FunderObtaining of pF0≤0.05。
31. The method according to any one of claims 26 to 30, wherein after step (2), W ' regions with loss of heterozygosity and continuity are merged to obtain a three-level merged region, when the span between the two three-level merged regions does not exceed L ' regions, the two three-level merged regions are merged to obtain a four-level merged region, a heterozygosity set of the target sample four-level merged region and a heterozygosity set of the same region of the population are respectively obtained, and the two heterozygosity sets are compared to determine whether loss of heterozygosity occurs in the target sample four-level merged region, wherein W ' and L ' are both natural numbers, W ' is not less than 2, and W '/2 is not less than L '.
32. A method for detecting an monadic diploid, said method being used for non-diagnostic purposes, characterized in that when the loss of heterozygosity is detected in a genomic region of a target sample, the copy number of the genomic region is calculated, and when the copy number of the genomic region is the same as that of the genomic region of a normal genome of the same species, the genomic region of the target sample is determined to be the monadic diploid; the determination of loss of heterozygosity in a genomic region of a target sample is carried out by the method of any one of claims 26 to 30.
CN201480080426.0A 2014-07-04 2014-07-04 Method for determining probe sequence and method for detecting genome structure variation Active CN106715711B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/081686 WO2016000267A1 (en) 2014-07-04 2014-07-04 Method for determining the sequence of a probe and method for detecting genomic structural variation

Publications (2)

Publication Number Publication Date
CN106715711A CN106715711A (en) 2017-05-24
CN106715711B true CN106715711B (en) 2021-09-17

Family

ID=55018343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201480080426.0A Active CN106715711B (en) 2014-07-04 2014-07-04 Method for determining probe sequence and method for detecting genome structure variation

Country Status (2)

Country Link
CN (1) CN106715711B (en)
WO (1) WO2016000267A1 (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110462063B (en) * 2017-05-23 2023-06-23 深圳华大生命科学研究院 Mutation detection method and device based on sequencing data and storage medium
WO2019237230A1 (en) * 2018-06-11 2019-12-19 深圳华大生命科学研究院 Method and system for determining type of sample to be tested
CN110872618B (en) * 2018-09-04 2022-04-19 北京果壳生物科技有限公司 Method for judging sex of detected sample based on Illumina human whole genome SNP chip data and application
CN109584963A (en) * 2018-09-30 2019-04-05 南京派森诺基因科技有限公司 A kind of diversified abstracting method of high-flux sequence data
CN111383714B (en) * 2018-12-29 2023-07-28 安诺优达基因科技(北京)有限公司 Method for simulating target disease simulation sequencing library and application thereof
CN110079589A (en) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 A kind of accurate method for obtaining structure variation within the scope of full-length genome
CN110600078B (en) * 2019-08-23 2022-03-18 北京百迈客生物科技有限公司 Method for detecting genome structure variation based on nanopore sequencing
CN110592208B (en) * 2019-10-08 2022-05-03 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN112662767B (en) * 2020-11-25 2021-08-06 深圳华大基因股份有限公司 Kit and probe for measuring genomic instability and application of kit and probe
CN112522382B (en) * 2020-12-22 2024-03-22 广州深晓基因科技有限公司 Y chromosome sequencing method based on liquid phase probe capture
CN112885410B (en) * 2021-01-28 2022-09-09 陈晓熠 Genotyping chip for CNV structural variation detection
CN113593644B (en) * 2021-06-29 2024-03-26 广东博奥医学检验所有限公司 Method for detecting chromosome single parent dimer based on family low depth sequencing
WO2023030233A1 (en) * 2021-08-30 2023-03-09 广州燃石医学检验所有限公司 Copy number variation detection method and application thereof
CN113971986B (en) * 2021-10-12 2023-03-21 江苏先声医疗器械有限公司 Method for checking cross contamination of sequencing sample through sequence similarity
CN114220481B (en) * 2021-11-25 2023-09-08 深圳思勤医疗科技有限公司 Method, system and computer readable medium for completing karyotyping of a sample to be tested based on whole genome sequencing
CN114678067B (en) * 2022-03-21 2023-03-14 纳昂达(南京)生物科技有限公司 Method and device for constructing multi-population non-exon region SNP probe set
CN114582427B (en) * 2022-03-22 2023-04-07 成都基因汇科技有限公司 Method for identifying introgression section and computer readable storage medium
CN115101128B (en) * 2022-06-29 2023-09-15 纳昂达(南京)生物科技有限公司 Method for evaluating off-target risk of hybridization capture probe
CN115713971B (en) * 2022-09-28 2024-01-23 上海睿璟生物科技有限公司 Target sequence capture probe design strategy selection method, system and terminal
CN115631789B (en) * 2022-10-25 2023-08-15 哈尔滨工业大学 Group joint variation detection method based on pan genome
CN115713967B (en) * 2022-11-17 2023-08-29 纳昂达(南京)生物科技有限公司 Method for designing probe pool and related device
CN116144794B (en) * 2023-03-09 2023-12-19 华中农业大学 Bovine 12K SV liquid phase chip and design method and application thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101213312A (en) * 2005-06-30 2008-07-02 先正达参股股份有限公司 Methods for screening for gene specific hybridization polymorphisms (GSHPs) and their use in genetic mapping ane marker development
WO2014099979A2 (en) * 2012-12-17 2014-06-26 Virginia Tech Intellectual Properties, Inc. Methods and compositions for identifying global microsatellite instability and for characterizing informative microsatellite loci

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020086289A1 (en) * 1999-06-15 2002-07-04 Don Straus Genomic profiling: a rapid method for testing a complex biological sample for the presence of many types of organisms
US20050042654A1 (en) * 2003-06-27 2005-02-24 Affymetrix, Inc. Genotyping methods
WO2005001091A1 (en) * 2003-06-27 2005-01-06 Olympus Corporation Probe set for detecting mutation and polymorphism in nucleic acid, dna array having the same immobilized thereon and method of detecting mutation and polymorphism in nucleic acid using the same
US20050136417A1 (en) * 2003-12-19 2005-06-23 Affymetrix, Inc. Amplification of nucleic acids
CA2630409C (en) * 2005-11-21 2016-12-13 Simons Haplomics Limited Method and probes for identifying a nucleotide sequence
US8460866B2 (en) * 2006-03-01 2013-06-11 Keygene N.V. High throughput sequence-based detection of SNPs using ligation assays
US7901882B2 (en) * 2006-03-31 2011-03-08 Affymetrix, Inc. Analysis of methylation using nucleic acid arrays
WO2008115497A2 (en) * 2007-03-16 2008-09-25 Gene Security Network System and method for cleaning noisy genetic data and determining chromsome copy number
CN101712959A (en) * 2008-10-08 2010-05-26 中国人民解放军军事医学科学院放射与辐射医学研究所 Novel human cell growth inhibiting gene THAP11 and application thereof
US20130157873A1 (en) * 2010-05-19 2013-06-20 Translational Genomics Research Institute Methods of assessing a risk of developing necrotizing meningoencephalitis
CN103080333B (en) * 2010-09-14 2015-06-24 深圳华大基因科技服务有限公司 Methods and systems for detecting genomic structure variations
CN102127819B (en) * 2010-11-22 2014-08-27 深圳华大基因科技有限公司 Constructing method and application of nucleic acid library in MHC (Major Histocompatibility Complex) region

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101213312A (en) * 2005-06-30 2008-07-02 先正达参股股份有限公司 Methods for screening for gene specific hybridization polymorphisms (GSHPs) and their use in genetic mapping ane marker development
WO2014099979A2 (en) * 2012-12-17 2014-06-26 Virginia Tech Intellectual Properties, Inc. Methods and compositions for identifying global microsatellite instability and for characterizing informative microsatellite loci

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种增强MLPA检测SNP位点的特异性方法;胡佳莉等;《贵阳医学院学报》;20120630;第37卷(第3期);231-234 *

Also Published As

Publication number Publication date
WO2016000267A1 (en) 2016-01-07
CN106715711A (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN106715711B (en) Method for determining probe sequence and method for detecting genome structure variation
CN107708556B (en) Diagnostic method
US11923046B2 (en) Noninvasive prenatal molecular karyotyping from maternal plasma
CN107077537B (en) Detection of repeat amplification with short read sequencing data
De Roeck et al. NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION
Tsai et al. Discovery of rare mutations in populations: TILLING by sequencing
JP5972448B2 (en) Method and system for detecting copy number variation
EP3289097A1 (en) Error suppression in sequenced dna fragments using redundant reads with unique molecular indices (umis)
US20200286586A1 (en) Sequence-graph based tool for determining variation in short tandem repeat regions
CN110770840A (en) Method and system for the decomposition and quantification of a mixture of DNA from multiple contributors of known or unknown genotypes
KR20230117036A (en) Methods and systems for visualizing short reads in repetitive regions of a genome
Ahsan et al. A survey of algorithms for the detection of genomic structural variants from long-read sequencing data
Roeck et al. Accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION
Deleye et al. Massively parallel sequencing of micro-manipulated cells targeting a comprehensive panel of disease-causing genes: A comparative evaluation of upstream whole-genome amplification methods
WO2021037016A1 (en) Methods for detecting absence of heterozygosity by low-pass genome sequencing
Huszar et al. Mitigating the effects of reference sequence bias in single-multiplex massively parallel sequencing of the mitochondrial DNA control region
US20180142300A1 (en) Universal haplotype-based noninvasive prenatal testing for single gene diseases
WO2019132010A1 (en) Method, apparatus and program for estimating base type in base sequence
CN110993024B (en) Method and device for establishing fetal concentration correction model and method and device for quantifying fetal concentration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant