CN106715711B

CN106715711B - Method for determining probe sequence and method for detecting genome structure variation

Info

Publication number: CN106715711B
Application number: CN201480080426.0A
Authority: CN
Inventors: 李剑; 王煜; 李尉; 李金良; 赵霞; 陈仕平; 张现东; 刘赛军
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2014-07-04
Filing date: 2014-07-04
Publication date: 2021-09-17
Anticipated expiration: 2034-07-04
Also published as: WO2016000267A1; CN106715711A

Abstract

The invention provides a method for determining a probe sequence based on a reference sequence and a method for detecting genome structural variation. Wherein the method for determining the probe sequence based on the reference sequence comprises the following steps: constructing a first candidate probe set based on the plurality of discrete high-frequency SNP sites, wherein the first candidate probe set is composed of a plurality of candidate probes, and each of the plurality of candidate probes contains at least one discrete high-frequency SNP; comparing a plurality of candidate probes in the first candidate probe set with the reference sequence so as to obtain comparison results; performing first screening on the first candidate probe set based on the comparison result to obtain a second candidate probe set; dividing the reference sequence into a plurality of windows with preset lengths, and respectively allocating the plurality of candidate probes in the second candidate probe set to the matched windows to determine the position information of the plurality of candidate probes; a second screening of a second set of candidate probes is performed based on said positional information and the allele frequencies of the discrete high frequency SNPs to determine said probe sequences.

Description

Method for determining probe sequence and method for detecting genome structure variation

PRIORITY INFORMATION

Is free of

Technical Field

The invention relates to the technical field of genomics and bioinformatics, in particular to a method for determining a probe sequence and a method for detecting genome structural variation.

Background

DNA Copy Number Variation (CNV) and Loss of heterozygosity (LOH) are different types of genomic variations. CNV is a common genomic structural variation, with fragments varying from 1kb to several Mb, mainly represented by deletions and duplications at the sub-microscopic level. LOH refers to the fact that a gene on one chromosome of a pair of chromosomes is deleted, and the matched chromosome still exists, and shows that only homozygote SNP exists in a long region of DNA. When the LOH has not undergone copy number change, i.e., only two copies are inherited from one parent, it is called uniparental diploid (UPD). CNV, LOH, and UPD are associated with many common genetic diseases, cancer, and other complex diseases. The method for accurately, comprehensively, efficiently, quickly, simply and economically detecting the CNV, the LOH and the UPD is established, and has important values for researching chromosome variation events, determining the causes of relevant diseases and adopting corresponding treatment schemes.

There are some inspection technologies, such as PCR technology, including real-time fluorescence quantitative PCR technology and multiple Ligation-dependent Probe Amplification (MLPA), where the real-time fluorescence PCR technology analyzes one or several targets at a time, and the MLPA can analyze 40 sequences at a time, and has high sensitivity, and the detection range is limited by the chromosome and region to which the Probe is directed; the FISH technology is generally used for detecting specific chromosomes and can not detect unknown regions; chip-based technologies, including chip-based Comparative Genomic Hybridization (aCGH) and SNP chip-based technologies, can detect CNVs in the whole genome range, cannot detect polyploids, and have a high missing rate of small fragment loss; and a sequencing technology, which is based on Whole Genome Sequencing (WGS) to detect the structural variation of the whole genome range and the variation of the target region based on the sequencing of the target region, and mainly comprises four methods for analyzing CNV, including: paired-end mapping (paired-end mapping), read-depth analysis (read-depth analysis), split-read strategies (split-read strategies), and sequence assembly comparisons (sequence assembly comparisons).

With the development of sequencing technology, it is necessary to research means for discovering genomic structural abnormalities based on sequencing results, particularly local region sequencing results, including means for discovering chromosomal aneuploidy, CNV, insertion-deletion (indel), LOH, UPD, and SNP.

Disclosure of Invention

One aspect of the present invention provides a method for determining a probe sequence based on a reference sequence, comprising the steps of: constructing a first candidate probe set based on the plurality of discrete high-frequency SNP sites, wherein the first candidate probe set is composed of a plurality of candidate probes, and each of the plurality of candidate probes contains at least one discrete high-frequency SNP; comparing a plurality of candidate probes in the first candidate probe set with the reference sequence so as to obtain comparison results; performing first screening on the first candidate probe set based on the comparison result to obtain a second candidate probe set; dividing the reference sequence into a plurality of windows with preset lengths, and respectively allocating the plurality of candidate probes in the second candidate probe set to the matched windows to determine the position information of the plurality of candidate probes; a second screening of a second set of candidate probes is performed based on said positional information and the allele frequencies of the discrete high frequency SNPs to determine said probe sequences. Wherein, the allele frequency of the discrete high-frequency SNP locus is more than 10 percent, preferably not more than 90 percent, and the physical distance between the discrete high-frequency SNP locus and any other discrete high-frequency SNP locus on the reference genome is not less than the length of the candidate probe, and the length of the candidate probe is 50-250 mers.

The probe obtained by the method for determining the probe sequence is used for hybridizing and capturing a genome to obtain a plurality of genome local regions, and the captured plurality of local regions can represent the whole genome, can reflect whole genome variation information and is used for discovering the occurrence of structural variation in the whole gene range.

In another aspect of the present invention, a method for detecting genomic structural variation, which is suitable for detecting chromosomal aneuploidy, copy number variation and indels, comprises the following steps: sequencing the target sample genomic nucleic acid to obtain a genomic sequencing result, said genomic sequencing result consisting of a plurality of reads, optionally said sequencing comprising screening with a probe, wherein the probe is obtained by a method for determining the sequence of the probe based on a reference sequence as provided by an aspect of the invention. The genome sequencing result can be obtained by extracting genome DNA, constructing a library according to the conventional high-throughput platform instruction manual and sequencing on a computer; the genome sequencing result can also be obtained by capturing the genome of the target sample through a probe and sequencing, wherein the probe can be obtained by the method for determining the sequence of the probe based on the reference sequence provided by one aspect of the invention; dividing a reference genome into m regions, and calculating the coverage depth TD of the target sample genome region i by using the reads falling into the region i in the genome sequencing result_iWherein m and i are natural numbers, i is more than or equal to 1 and less than or equal to m, 10<m; and judging the occurrence of structural variation of the target sample region i based on the difference degree between the coverage depth of the target sample genome region i and the coverage depths of the regions i of k reference samples, wherein k is a natural number and is not less than 2, and the method for obtaining the coverage depth of the region i of each reference sample can refer to the method for obtaining the coverage depth of the target sample region i. By merging adjacent regions with structural variation, it is further detected whether the merged regions have large structural variation or whether the structural variation occurring in the region i spans several regions.

In a further aspect of the invention, there is provided a method suitable for detecting loss of heterozygosity, another genomic structural variation, comprising the steps of: obtaining a genome sequencing result of the target sample, optionally, the genome sequencing result is obtained by capturing a genome of the target sample through a probe and sequencing, the probe is obtained according to the method for determining the probe sequence based on the reference sequence provided by the aspect of the invention; dividing the genome into m' regions, based on the genome assayReading data of a reading segment falling in a region i and a group region i in a sequence result to obtain an SNP set shared by a target sample genome region i and the group region i, respectively calculating the heterozygosity of fragments of SNP sites in the shared SNP set of the target sample and the group, and obtaining a heterozygosity set U of the target sample genome region i_iAnd heterozygosity set U of population region i_0iComparing the target samples U_iAnd group U_0iTo determine whether loss of heterozygosity for the target sample region i has occurred; wherein, the allele frequency of each SNP in the common SNP set is more than 0.1, the fragment of one SNP site in the common SNP set is located by taking two SNPs at the upstream and downstream adjacent to the SNP as boundary points, m ' and i are natural numbers, m ' is not less than i and not less than 1, and m ' is not less than 6. The number of samples extracted can truly reflect the group, can be determined according to the accuracy, statistical method, sample data distribution condition and the like required by detection, the group data is composed of a plurality of sample data of the same species, and can be obtained through whole genome sequencing, or according to a method for obtaining target sample data, or from a published database or website, such as thousand-people genome data.

In another aspect of the present invention, a computer-readable storage medium is provided for storing a program for execution by a computer, and it will be understood by those skilled in the art that when the program is executed, all or part of the steps of the methods for detecting genomic structural variation described above can be performed by instruction-related hardware. The storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.

According to a final aspect of the present invention, there is provided an apparatus for detecting genomic structural variation, comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data including an executable program; and a processor, which is connected with the data input unit, the data output unit and the storage unit in a data mode and is used for executing the executable program stored in the storage unit, wherein the execution of the program comprises all or part of the steps of the method for detecting the genome structural variation.

The probes obtained by the method for determining the probe sequences based on the reference sequences are used for capturing and sequencing target regions by using the probes or a solid phase/liquid phase chip containing the probes, so that the structural variation can be detected in the whole genome range at low sequencing cost, the CNV, LOH and UPD can be detected by 23 pairs of chromosomes covering people, and the detection resolution can be adjusted by adjusting the average spacing distribution of the probes, namely increasing/reducing SNP sites according to requirements. The target region capture sequencing and biological information analysis method provided by the invention realize high-resolution, high-accuracy, high-throughput and low-cost CNV, LOH and UPD detection in the whole genome range, and meanwhile, the genome structural variation detection method provided by the invention is also suitable for detection of chromosome aneuploidy variation, SNP and Indel, and is suitable for structural variation analysis and detection based on whole gene sequencing data.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic diagram showing the characteristics of SeTR probes on the whole genome in one embodiment of the present invention, (A) a length distribution diagram of the SeTR probe sequence; (B) physical distance profile of two probes in the SETR probe.

FIG. 2 is a graph showing the results of a test using a SeTR probe according to an embodiment of the present invention, wherein (A) a distribution graph of the depth of coverage of a target region (B) supports reads distributions for a ref base type and a non-ref base type.

Fig. 3 is a schematic diagram of the detection process of CNV, LOH and UPD in one embodiment of the present invention.

FIG. 4 shows R in one embodiment of the present invention_iA baseline view.

FIG. 5 is a schematic diagram of the genomic structural variation of a sample (GM50275) detected in the present invention, the circles from outside to inside being I) chromosomal information, II) r_iChange in value (wavy lines); III)R_hetCorresponding change in P value, IV) R_hetThe value changes (dots).

Detailed Description

The following describes embodiments of the present invention in detail. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

It should be noted that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. Further, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

According to an embodiment of the present invention, there is provided a method for determining a probe sequence based on a reference sequence, including the steps of:

the method comprises the following steps: constructing a first candidate Probe set

The method comprises the steps of constructing a first candidate probe set by utilizing discrete high-frequency SNP sites distributed in a genome, wherein each candidate probe in the first candidate probe set comprises at least one discrete high-frequency SNP site, the allele frequency of each discrete high-frequency SNP site is greater than 10%, the physical distance between each discrete high-frequency SNP site and any other discrete high-frequency SNP site on a reference genome is not less than the length of the candidate probe, and the length of the candidate probe is 50-250 mers.

In one embodiment of the invention, the discrete high-frequency SNP is obtained by thousand-person genome data, and the length of the candidate probe is determined to be 100 mers from other published genome data or obtained discrete high-frequency SNP sites with allele frequency less than 90% of further selection.

In one embodiment of the present invention, each candidate probe comprises a discrete high frequency SNP site, and the discrete high frequency SNP site is located in the middle of the candidate probe. Thus, each candidate probe only contains one high-frequency SNP site, and adjacent candidate probes may or may not have overlap. The term "middle section" is used herein in relation to the term "front section" and "rear section", and can be conventionally understood, for example, a sequence with upstream and downstream 1/3 being respectively designated as "front section" and "rear section", and the central 1/3 being "middle section"; further, the discrete high frequency SNP site is located at the midpoint of the candidate probe, where the "midpoint" position, for example, a sequence contains 2n +1 nucleotides, the midpoint is the position of the n +1 th nucleotide, and when a sequence contains 2n nucleotides, the midpoint of the sequence is the position of the n or n +1 th nucleotide, so as to enhance the capture efficiency of the probe to the target discrete high frequency SNP site.

In one embodiment of the invention, prescreening of the first candidate probe set is repeated based on the GC-content and/or the single base weight of the candidate probe sequences in the first candidate probe set, leaving candidate probes in the first candidate probe set with GC-content of 35% to 65% and/or single base gravity of less than 7. The single base repetition degree refers to the number of continuous occurrences of a base type in a sequence, for example, TGAAAAAAAAGC, in which A continuously occurs 8 times, and the A base repetition degree of the sequence is 8. The PCR or hybrid capture process of the sequence is easily influenced by high or low GC content and high heterozygosity of the sequence, GC bias (GC bias) is brought, the capture specificity is reduced, the first candidate probe set reserved by the pre-screening cannot be hybridized with the sequences, and therefore the influence of the GC bias or low-specificity capture on the result is avoided.

Step two: aligning the first candidate probe set with the reference sequence to obtain an alignment result

And comparing the first candidate probe set with the reference sequence to obtain a comparison result and obtain the position information of the first candidate probe set on the reference sequence. The reference sequence used is a known sequence and may be any reference template in a biological class to which the target sample obtained in advance belongs. For example, the target sample is human, HG18 or HG19 provided by the National Center for Biotechnology Information (NCBI) can be selected as the reference sequence, a resource library containing more reference sequences can be further preconfigured, and before sequence comparison, a closer reference sequence is selected according to factors such as sex, race, region, and the like of the target sample, which is beneficial to obtaining a more targeted probe sequence.

Step three: performing a first screening on the first candidate probe set to obtain a second candidate probe set

In one embodiment of the present invention, the candidate probes retained by the first screening satisfy either of the following two conditions: 1) candidate probes in the first candidate probe set that align to a unique location in the reference genome; 2) the mismatches in the alignment to the plurality of positions of the reference sequence in the first candidate probe set are less than 10% for at least two of the plurality of positions of the reference sequence; for example, the length of the candidate probe is 100 mers, the mismatch ratio of 10 base mismatches is 10%, the mismatch ratio is low, the probe can be matched with a target region in a nearly complete complementary manner when being used for hybridization, the capture effect is good, and the specificity is high.

Step four: dividing the reference sequence into a plurality of windows, assigning a second set of candidate probes to the respective matching windows

And dividing the reference sequence into a plurality of windows with preset lengths, and distributing a plurality of candidate probes in the second candidate probe set to the matched windows by using alignment to obtain the position information of each candidate probe on each window.

The length of the windows with the preset length can be consistent or not consistent, and can be overlapped or not overlapped, in one embodiment of the invention, the reference sequence is a reference genome, the reference genome is divided into a plurality of windows with consistent length, the window length is 10Kb, and two adjacent windows are connected but not overlapped.

Step five: (ii) applying a second candidate probe based on said positional information and the allele frequency of the discrete high frequency SNP Second screening to determine the probe sequence

In one embodiment of the invention, performing the second screening comprises two steps, (a) if there are multiple candidate probes located in the same window, determining the candidate probe with the highest allele frequency of the discrete high frequency SNP; (b) if only one candidate probe with the highest allele frequency of the discrete high-frequency SNPs exists, the candidate probe with the highest allele frequency of the discrete high-frequency SNPs is selected as the probe, and if a plurality of candidate probes with the highest allele frequency of the discrete high-frequency SNPs exist, the candidate probe closest to the center of the window among the candidate probes with the highest allele frequency of the discrete high-frequency SNPs is selected as the probe. The distance of the candidate probe from the center of the window may be the distance of the midpoint of the candidate probe from the center of the window. The target position is positioned at the central position of the probe sequence as far as possible, which is beneficial to improving the capture efficiency.

In an embodiment of the present invention, after the second candidate probe set is subjected to the second screening, when the distance between two adjacent candidate probes in the second candidate probe set respectively falling into two adjacent windows on the reference genome is greater than the length of any one of the two adjacent windows, optionally, a short tandem repeat sequence or a part of the short tandem repeat sequence located between the two adjacent candidate probes on the reference genome is further added to the second candidate probe set subjected to the second screening to form a probe sequence together. Thus, when the probe sequences obtained by the design are used for capturing the whole genome, the intervals of the captured regions can be relatively uniformly distributed, and the information of the whole genome can be better and comprehensively reflected by the combination of the captured and determined regions.

According to another embodiment of the present invention, there is provided a method for detecting a genomic structural variation including at least one of a chromosomal aneuploidy, a copy number variation, and an indel, comprising the steps of:

sequencing target sample genome nucleic acid to obtain a genome sequencing result, wherein the genome sequencing result is composed of a plurality of reads, and can be obtained by whole-gene sequencing, such as extracting genome DNA, performing library construction and on-machine sequencing according to an instruction manual of an existing high-throughput platform, such as Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torque, a single-molecule or nanopore sequencing platform and the like to obtain reads; or by capturing the genome of the target sample by a probe and sequencing the captured genome, wherein the probe can be designed and determined by the method for determining the probe provided by the aspect of the invention, and then synthesized or prepared according to the existing method.

Dividing the reference genome into m regions, and calculating the coverage depth TD of the target sample genome region i by using the reading of the falling region i in the reading in the sequencing result_iWherein m and i are natural numbers, i is a region number, i is more than or equal to 1 and less than or equal to m, 10<m。

In one embodiment of the present invention, the coverage depth of the area i is calculated by the formula

Or

Where i represents the number of the region. Reading of paragraphs to the genomic position can be determined by sequence alignment using various alignment software, such as SOAP (short Oligonucleotide Analysis Package), bwa (Burrows-Wheeler Aligner), samtools, GATK (genome Analysis toolkit), and the like.

And (III) judging the occurrence of structural variation of the target sample region i based on the difference degree of the coverage depth of the target sample genome region i and the coverage depth of the regions i of k reference samples, wherein k is a natural number and is not less than 2.

In one embodiment of the present invention, the comparison of the degree of difference between the depth of coverage of the genomic region i of the target sample and the depth of coverage of the genomic regions i of the k reference samples is performed by comparing the depth of coverage coefficients of the genomic regions i of the target sample and the reference samples, the depth of coverage coefficient R of the genomic region i of the target sample_iComprises the steps of (a) determining the TD_iPerforming a first correction to obtain TD_aiThe first correction is performed by linear regression of the depth-of-coverage values of 2n consecutive areas including area i, where n is a natural number, 10<n is less than or equal to m/2, in the first aspect of the inventionIn one embodiment, the first corrected linear regression is obtained

Wherein, TD_jRepresenting the coverage depth of the jth area in the n continuous areas, wherein j is a natural number and is more than or equal to 1 and less than or equal to n; (b) obtaining a first corrected coverage depth TD of the area i_aiThen, further to TD_aiIs homogenized to obtain

Thereby obtaining

In one embodiment of the invention, the first correction for the area i covers the depth TD_aiObtained by homogenization

In one embodiment of the invention, R is obtained for a target sample_iThen further comprises the step of reacting with R_iPerforming a second correction to obtain r_i，

Wherein R is_aiIs the average of the depth of coverage coefficients for k reference sample genomic regions i,

y is a natural number representing the reference sample number, R_i，yThe coverage depth coefficient for genomic region i of the reference sample y is shown.

In another embodiment of the present invention, R is obtained from a target sample_iThen further comprises the step of reacting with R_iPerforming a second correction to obtain r_i，

Wherein R is_aiThe mean of the depth of coverage coefficients for genomic region i for k reference samples and one target sample,

Calculating the coverage depth coefficient R of the genome region i of the processing target sample_iThe correction, homogenization, etc. of the intermediate values can reduce errors due to fluctuation of experimental conditions, self-differences among samples, etc., so that the final r is_iCan truly reflect R_iAnd a fluctuation amplitude ratio R around 1_iSmall, and multiple sample r_iConforming to normal distribution; for TD in the above embodiment_iPerforming the first correction, followed by normalizing the first corrected value, corresponds to a double averaging procedure, i.e. before intending to represent the coverage depth of the region i by the average of the coverage depths of n consecutive regions including the region i, the calculation of the coverage depth value of each of the n regions is represented by the average of the coverage depths of the n consecutive regions with the region as the first region, which corresponds to correcting the TD by the coverage depth values of 2n consecutive regions including the target region i_iThe covering depth of the continuous area can be kept stable. It should be noted that other correction or averaging processes may be used by those skilled in the art to stabilize the coverage depth values of adjacent areas, such as correcting the coverage depth of the target area by the average coverage depth of several areas spaced from the target area by a certain distance. The calculation processing of the coverage depth coefficient of the reference sample genome region i may refer to the calculation processing process of the coverage depth coefficient of the target sample genome region i, and the reference sample data may be pre-calculated for standby or may be obtained by performing the calculation processing process of the target sample synchronously.

In one embodiment of the present invention, the degree of difference between the coverage depth of the target sample genome region i and the coverage depth of the k reference sample regions i is determined by t-testing whether the difference between the coverage depth coefficients is significant. In one embodiment of the inventionIn this manner, the t-test statistic for the genomic region i of the target sample is calculated as

Wherein the content of the first and second substances,

r representing k reference samples_i,yAverage value of r_i,yTo reference the second corrected depth of coverage coefficient for genomic region i of sample y,

for the standard deviation of the k reference samples,

t based on target sample genomic region i_iValue, obtaining a significance level P_iWhen P is_i<0.05, judging that the structural variation of the region i occurs; otherwise, judging that the region i has no structural variation. In another embodiment of the invention, t is based on the genomic region i of the target sample_iValue and predetermined significance level P_i0Obtaining t_iTheoretical value t_i0When t is_i≥t_i0If the structural variation of the region i is judged to occur, otherwise, the structural variation of the region i is judged not to occur, and the predetermined P is determined_i0Less than or equal to 0.05. Predetermining P from a table of t values for t tests_i0Then the corresponding t can be found_i0。

In one embodiment of the present invention, to detect a larger CNV or an insertion deletion, after performing step (three), merging W regions that are continuous and in the same direction to obtain a primary merged region, merging two primary merged regions when the two primary merged regions are in the same direction and the span between the two primary merged regions does not exceed L regions to obtain a secondary merged region, and detecting structural variation of the secondary merged region; wherein, the equidirectional region refers to a region in which t statistics of the coverage depth of the region are both greater than 0 or both less than 0, W and L are both natural numbers, W is greater than or equal to 2, and L-W is less than or equal to 1. To further detect larger structural variations, one can proceed by analogy, e.g., to further merge eligible secondary merge regions, the merge condition can be analogous to two secondary merge regions being co-directional and not more than L regions or L secondary merge regions being separated by a distance on the reference genome.

In one embodiment of the present invention, the structural variation of the secondary pooling region is detected by determining whether the secondary pooling region has structural variation or not, or whether the structural variation occurring in the region i spans W regions, based on the degree of difference between the coverage depth of the secondary pooling region of the target sample genome and the coverage depth of the corresponding regions on the plurality of reference sample genomes. The obtaining of the coverage depth of the corresponding secondary merging region on the reference sample genome, the calculation of the t statistic of the coverage depth of the secondary merging region on the target sample genome, and the structural variation determination process can be referred to the calculation and determination process of the structural variation of the relatively small region i.

According to yet another embodiment of the present invention, there is provided a method for detecting loss of heterozygosity in a genomic structural variation, comprising the steps of:

(1) sequencing target sample genome nucleic acid to obtain a genome sequencing result, wherein the genome sequencing result is composed of a plurality of reads, and the genome sequencing result can be obtained by whole gene sequencing, such as by extracting genome DNA, performing library construction and on-machine sequencing according to an instruction manual of an existing high-throughput platform, such as by utilizing Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torque, single molecule or nanopore sequencing platforms and the like to obtain reads (reads); or by capturing the genome of the target sample by a probe and sequencing the captured genome, wherein the probe can be designed and determined by the method for determining the probe provided by the aspect of the invention, and then synthesized or prepared according to the existing method.

(2) Dividing a reference genome into m' regions, obtaining SNP sets shared by a target sample genome region i and a population region i based on read information falling in the reference genome region i and population region i data in a sequencing result, and calculating S in the shared SNP sets of the target sample and the population respectivelyObtaining the heterozygosity set U of the target sample genome region i by the heterozygosity of the segment of the NP locus_iAnd heterozygosity set U of population region i_0iComparing the target samples U_iAnd group U_0iTo determine whether loss of heterozygosity for the target sample region i has occurred; wherein, the allele frequency of each SNP in the common SNP set is more than 0.1, the fragment of one SNP site in the common SNP set is located by taking two SNPs at the upstream and downstream adjacent to the SNP as boundary points, m ' and i are natural numbers, m ' is not less than i and not less than 1, and m ' is not less than 6.

In one embodiment of the present invention, the heterozygosity of the fragment of a SNP site is expressed by the frequency coefficient of the allele of the SNP site, wherein the frequency coefficient of the allele of the SNP site is R_hetMAF/(1-MAF), which is the sub-allele frequency of the high frequency SNP.

In one embodiment of the invention, the target samples U are compared_iAnd group U_0iTo determine whether loss of heterozygosity in the target sample region i has occurred, comprises determining U using an F-test_iVariance of (2)

And U_0iVariance of (2)

If there is a significant difference, if U is present_iAnd U_0iIf the variance difference is significant, it is determined that the target sample region i has loss of heterozygosity, otherwise, it is determined that the target sample region i has no loss of heterozygosity.

In one embodiment of the invention, the F-test comprises calculating U separately_iAnd U_i0Using the obtained target sample U_iVariance of (2)

And group U_i0Variance of (2)

Is obtained by calculationTwo reciprocal statistics F_upperAnd F_underObtaining significance level p by using statistics which are reciprocal of each other_FComparison of p_FWith a predetermined significance level p_F0Size of (1), p_F≤p_F0The difference between the two variances is obvious, otherwise, the difference is not obvious, the F test comprises a calculation formula,

p_F＝p_upper+(1-p_under) Wherein v is the number of SNPs in the SNP set shared by the target sample genome region i and the population region i, q is the number of SNPs in the SNP set shared by the target sample genome region i and the population region i, and R_het,i,vThe sub-allele frequency coefficient of the v-th SNP in the common SNP set of the target sample genome region i,

is the average of the sub-allele frequency coefficients, R, of q SNPs in a common SNP set of a target sample genomic region i_het,i0,vThe sub-allele frequency coefficient of the v-th SNP in the common SNP set of the genome region i of the population sample,

is the average of the sub-allelic frequency coefficients of q SNPs in a common SNP set of a population sample genomic region i, p_upperAnd p_underAre respectively according to F_upperAnd F_underObtaining of p_F0≤0.05。p_F0The setting may be taken as a value of a usual setting, or may be adjusted according to known information grasped, a requirement for detection accuracy, or the like.

In one embodiment of the present invention, to detect a larger LOH, after step (2), W ' regions with loss of heterozygosity and continuity are merged to obtain a three-level merged region, two three-level merged regions are merged to obtain a four-level merged region when the span between the two three-level merged regions does not exceed L ' regions, a heterozygosity set of the four-level merged region of the target sample and a heterozygosity set of the same region of the population are respectively obtained, and the two heterozygosity sets are compared to determine whether the target sample four-level merged region has loss of heterozygosity, wherein W ' and L ' are both natural numbers, W ' is not less than 2, and W '/2 is not less than L '. In one embodiment of the present invention, W' ≧ 4. To detect LOH occurring in larger regions, one can proceed by analogy, such as further pooling of eligible quaternary pooled regions, which can be similar if the distance between two quaternary pooled regions on the reference genome does not exceed L 'regions or L' tertiary pooled regions.

According to still another embodiment of the present invention, there is provided a method for detecting an unipolar diploid, wherein when there is a loss of heterozygosity in a genomic region of a target sample, the copy number of the genomic region is calculated, and when the copy number of the genomic region is the same as the copy number of the genomic region in a normal genome of the same species, the presence of a UPD in the genomic region of the target sample is determined; the presence or absence of LOH in a genomic region can be detected by an LOH detection method according to one aspect of the present disclosure.

It will be understood by those skilled in the art that all or part of the steps of the methods in the above embodiments may be implemented by hardware associated with program instructions, and the program may be stored in a computer-readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.

According to a final embodiment of the present invention, there is also provided an apparatus for detecting genomic structural variation, comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data including an executable program; and the processor is in data connection with the data input unit, the data output unit and the storage unit and is used for executing the executable program stored in the storage unit, and the execution of the program comprises the completion of all or part of the steps of the various methods in the embodiment.

The operation results of the specific probe design method and the structural variation detection method according to the present invention will be described in detail below with reference to specific target individuals. The name definitions or specific parameter settings involved in the following process are selected as:

1. the designed probe is called a selected Target Region probe (SeTR);

2. hereinafter, "depth of coverage", "depth of sequencing" and "depth" may be used interchangeably; hereinafter, "area" and "target area" may be used interchangeably;

3. constructing and sequencing a library according to a small fragment library construction operation instruction and a computer-on sequencing instruction provided by a Hiseq2000 platform, wherein the size of the library is 300bp-350bp, double-end sequencing (pair-end sequencing) and the read length is 91bp (the sequencing type is PE91+8+ 91);

4. the reference genome or reference sequence selected for alignment is the human reference genome (hg19, Build 37).

The examples do not specify particular techniques or conditions, and are carried out according to techniques or conditions described in literature in the art (for example, refer to molecular cloning, a laboratory Manual, third edition, scientific Press, written by J. SammBruke et al, Huang Petang et al) or according to product instructions. The reagents or apparatus used are not indicated by the manufacturer, but are conventional products available commercially, for example from Illumina.

Example 1: chip design, preparation and test

In general, high (> 60%) or low (< 35%) GC content and high heterozygosity tend to adversely affect DNA fragments during PCR or probe capture, and to avoid this, we have designed a special probe, which we call it SeTR. a) The probe sequence is high in uniqueness and stability, low heterozygosity and medium GC (35-60%) content are required, b) a discrete high-frequency SNP marker (SNP marker) is contained, the allele frequency (0.9 > AF >0.1) of each SNP is used for better detecting LOH of the whole genome, and c) the final target region presents relatively uniform distribution.

The flow of SeTR probe design or target region selection is as follows:

1) based on thousand-person gene database (ftp:// ftp. ncbi. nih. gov/1000 genes/ftp/release), selecting candidate SNP set with Allele Frequency (AF) of 10% -90%, and then removing one SNP with physical distance between two SNPs being less than 100pb from the SNP set, thereby forming SNP marker 1 set.

2) Each SNP of the SNP marker 1 set is taken as a midpoint, 50pb of the reference genome sequence is intercepted respectively at the upstream and the downstream, a theoretical probe sequence set of 100bp is formed, and then the probe sequence set is compared back to the reference sequence. If an optimal alignment of a probe sequence has no mismatches and the next optimal alignment has less than 5% mismatches, then its corresponding SNPs are retained, thereby constituting the SNP marker 2 set.

3) Based on the SNP marker 2 set, we pick out SNPmaker that is uniform in the reference genome physical location as the final SNP marker set. In our study, we selected a set of SNP markers with a physical distance of about 10 Kbp.

4) If the distance between two adjacent SNPs is larger than 10Kbp in the final SNP marker set, short repeated Sequences (STR) between the two adjacent SNPs are selected to fill up the uniform.

After designing the SeTR probe, we entrusted Roche to produce SeTR liquid phase chips. The SeTR liquid phase chip contains 278800 probes, with a total size of 41,795,106bp, covering 1.45% of the effective whole genome (2.89G). The average length of the SeTR probes reached 149bp, and the average physical distance between two adjacent probes was 10.6kbp, as shown in Table 1 and FIG. 1.

TABLE 1 distribution of SeTR probes on each chromosome

The usability of the SeTR chip was tested with 3 quality-qualified DNA samples, YH (Yanhuang specimen, Chinese genomic DNA), HG00537 (one of the thousand human genomic projects) and GM50275 (human fibroblast specimen obtained from Coriell Institute for Medical Research, Cochler Institute of medicine) to ensure that the probe chip could be used for subsequent detection studies. All three samples were sequenced using SeTR capture pooling to obtain sequencing sequences (reads). First, we remove the reads contaminated by linker (adapter) and having lower quality, such as average quality value less than 20, and then call the remaining reads clean reads (clean reads), and align the clean reads to the hg19 reference sequence, so as to obtain 98.13% -99.29 reads aligned to the reference genome, where the alignment to the target region reaches 67.43% -67.87%, and in addition, at least one read covers at least 99.73% -99.95% of the target region, and at least 10 times covers at least 99% of the target region, as shown in table 2, these performances are all better than exome capture (exome capture) chips of the same type, such as exome liquid chip produced by Roche Nimblegen. In addition, the depth distribution of the target region, as shown in fig. 4A, is similar to Poisson distribution (Poisson distribution), and fig. 4B shows that the number of reads supported by the non-reference sequence base type (the non-reference alloy) of most of the high heterozygous sites in the target region is almost equivalent to the number of reads supported by the reference sequence base type (reference alloy), i.e. the number of positive and negative reads supported by a certain high heterozygous site is equivalent (the positive and negative reads are respectively from two homologous chromosomes), which all show that the probe has no obvious bias of haplotype (usually reference sequence base type, ref type) capture, and has better uniformity of target region capture.

TABLE 2 alignment of the three samples

Example 2: target region library construction and sequencing

1. Test materials, reagents, and instruments

Sample preparation: 15 samples of the target gDNA (human genomic DNA, sample numbers in Table 3, "GM" "headed human fibroblasts), and 24 samples of the reference DNA.

Main reagent instrument: PCR instrument, pipettor, centrifuge, comfortable constant temperature blending instrument, DNA breaking instrument, vortex oscillator, magnetic frame, electrophoresis apparatus, Hiseq2000 sequencer, Nanodrop ultraviolet spectrophotometer, etc., the reagents or instruments used are not indicated by manufacturers, and are all conventional products which can be obtained commercially.

Designing and synthesizing a probe: obtained by the first example, in the whole genome range of human, about 41M of target region was selected, and NimbleGen SeqCap EZ liquid phase probes were customized from Roche, which could capture the corresponding designed target region.

2. Library construction

1) Genomic DNA extraction

About 3-5. mu.g of genomic DNA was extracted from the target sample using QIAGEN DNA extraction Kit (DNA Mini Kit) according to the Kit instructions for subsequent experiments. And (3) carrying out electrophoresis detection on the extracted DNA (30-100ng) to judge whether the DNA is complete and the degradation degree.

2) Genomic DNA disruption and purification

The genomic DNA was disrupted using a covaris E210 instrument (operating with reference to the instrument instructions). The DNA was disrupted to 200 and 250 bp. The disrupted DNA fragment was purified using QIAquick PCR Purification kit (250) according to the instructions of the kit, and the size of the main band was determined by electrophoresis to be within the range of 200-250 bp.

3) End repair, end-plus-A, adapter, pre-amplification

According to the library building requirement, according to the steps of a specification building of a double-end label library and the listed reagents, reaction conditions and the like, carrying out end repair on the DNA fragments after the fragmentation and purification, and purifying; adding a base A to the two ends of the DNA fragment subjected to end repair purification, and purifying the end added product A; and connecting sequencing adapters at two ends of the product A added at the tail end, and purifying the DNA fragment with the adapters by using magnetic beads which can be complementarily combined with the sequencing adapters. Preparing a PCR reaction system, amplifying the DNA fragment with the joint, purifying the PCR product by magnetic beads, and detecting whether the size of the main band of the amplified product is 300-350bp by electrophoresis; the DNA amount is detected by a Nanodrop ultraviolet spectrophotometer, and the total amount is more than 1.0 mu g.

4) Hybridization and elution of SeTR Probe, amplification

According to the commercial NimbleGen SeqCap EZ hybridization elution kit instruction, purchasing or configuring the hybridization and elution related reagents in the kit instruction. A1.5 mL centrifuge tube was prepared and the Cot-1DNA, the universal Block sequence (Block), the Block sequence of the tag (index N Block) and the DNA sample from step 3) were added. Then centrifuging for 1min, vacuum concentrating and drying at 60 deg.C, adding hybridization buffer solution, shaking and centrifuging, placing in 95 deg.C metal dry bath for denaturation for 10min, shaking and centrifuging at high speed. The tubes were then filled with 4.5ul of probe and hybridized on a PCR machine (47 ℃ C., 64-72 hours). Elution was performed after hybridization was completed. Then, PCR is carried out according to the final amplification step of the library construction specification, a PCR reaction system is prepared according to requirements, and reactants such as DNA obtained by hybridization elution, polymerase, substrate, PCR reaction buffer solution, Flowcell primer (primer designed according to fixed sequence on a sequencing chip of a sequencer) and the like are mixed uniformly. The PCR program comprises pre-denaturation at 94 ℃ for 2min, denaturation at 94 ℃ for 15s, annealing at 58 ℃ for 30s, extension at 72 ℃ for 30s, reaction for 15 cycles, and extension at 72 ℃ for 5 min. And after the PCR is finished, taking out a PCR product, centrifuging, and purifying by magnetic beads to obtain a target region library. The concentration of the library was measured by Nanodrop uv spectrophotometry and prepared for sequencing on the machine.

3. Hiseq2000 high throughput sequencing

And (4) carrying out on-machine sequencing on the qualified DNA library according to Hiseq2000 operation instructions. The data size of each sample is about 4.5G, the average sequencing depth reaches 100X, but the efficiency of the capture chip hardly reaches 100%, and the final effective sequencing depth of the target area is 30X-45X through analysis.

Example 3: detection of CNV, LOH and UPD

The general flow is shown in FIG. 3. After sequencing is completed, the off-line data is in a fastq file format. Then, comparing the filtered reads with a reference genome (Hg19, Build 37) by adopting BWA software to obtain a comparison file in an SAM format, then converting the SAM comparison file format into a binary BAM file by using samtools software, performing de-duplication and sequencing processing on the comparison result, and then converting the BAM format into a PILEUP format by using the samtools software again to see a biological information analysis strategy part for specific details.

First, filtering and comparing sequencing data

The sequencing data obtained from the illumina Hiseq2000 machine of the above example was first subjected to a simple data filtration to remove reads contaminated with adapters, having an N content of more than 5% and an average mass value of less than Q20. The filtered data was then aligned to the human reference genome using bwa alignment software (hg19, Build 37), the sequence alignment results, i.e., the alignment files in SAM (sequence alignment/map) format (SAM files for short), were output, then converted to binary BAM files using Samtools software, the PCR-induced repeats (PCR duplicates) were removed and sorted, and the results were re-aligned and re-corrected using GATK software.

Second corrected coverage depth coefficient r of the target area is calculated_iHeterozygosity R of fragments_het

R of each target area is calculated according to the information contained in the probe area file obtained after the filtering comparison_iAnd fragment heterozygosity R_hetThe value is obtained. According to r_iValue, predicting CNV using t-test, from R_HetThe LOH and UPD are predicted using the F test.

Analysis for detecting CNV, LOH and UPD

1. CNV detection

1.1 calculating the depth coefficient (R) of each target region_i)

Calculating the depth of the target region and using TD_iTo maintain the stability of several consecutive target regions TD, the method of equation 2 is used to correct TD (as shown in equation 1)_iI.e. correcting TD by using the depth of n' regions following the i-th region_iTo obtain TD_aiThen using

equations

3 and 4 for TD_aiPerforming homogenization, and obtaining the depth coefficient R of each target area_i。

Formula 1: TD_i＝T_ibase/T_ilen，

Equation 2:

equation 3:

equation 4:

T_ibase: comparing to obtain the number of bases of the target area i; t is_ilen: the length of the target area i.

1.2 creating a reference line using data of a plurality of reference samples (k 24), and correcting R_iTo obtain r_i

The efficiency of each capture has certain fluctuation due to the fluctuation of each experimental condition and the self difference among samples, and further R is caused_iIs likely to cause CNV glitches. Therefore, it is very beneficial to create a uniform reference line according to the fluctuation situation of a plurality of samples. FIG. 4 is a good representation of the creation of a baseline to facilitate this detection, precursor R_i(preR_i)

The distribution of (A) greatly fluctuates as shown in the figure, and R_iRelatively small fluctuations, R_iR is obtained by correction of the reference line_iFluctuation of whichSmaller, more sensitive and easier to detect the occurrence of CNV. Theoretically, it is thought that in different samples, R is within the same target region without CNV occurring_iThe values are theoretically in accordance with the poisson distribution and all fluctuate relatively stably around the respective characteristic value, in order to maintain the stability of the respective characteristic values, by investigating the R of the same region of a plurality of samples_iValue, adopt R_iMean value (mean R)_i) Instead of this respective unique value, a respective unique reference line (robust baseline) is constructed for each target region. Based on R of each target region_iIs around mean R_iAssuming fluctuating values, we will shift R_iConversion to r by mean Ri_iAnd further make r_iA normal distribution that fluctuates up and down around 1.

1.3 detection of CNV of target region

Theoretically, r from the same target region of multiple samples_iThe values should all conform to a normal distribution, so when investigating a target region i of a sample, the r of the region can be compared among a plurality of samples_iThe value, using the T test, of the T statistic is calculated as follows to detect the copy number variation of the target region i of the sample.

In the formula, 1 in each parameter index represents a target sample, 2 represents a plurality of reference samples,

represents n₁R of each sample to be measured_iThe average number of (a) is,

is n₂R of a reference sample_iAverage number of (d), mu₁R for theoretically all samples to be measured_iMean number, mu₂Theoretically all reference samples r_i2Mean number, S₁And S₂Marks for sample to be measured and reference sampleTolerance, df is the degree of freedom, df is n₁+n₂-2。

When the sample to be measured is 1, n ₁1, the theoretical mean value of the sample to be measured and the reference sample is the same, and the formula is simplified as follows:

through the simplified formula, each target area corresponds to a t value capable of detecting the CNV, so as to obtain a P value (confidence), and when P of a certain area is less than 0.05, the area is an area where the CNV occurs.

1.4 detection of Large CNV

And (3) attaching a pseudo signal value to each region based on the p value of the t test of the single region to represent whether the region is considered by the next CNV region connection, and connecting target regions possibly having consistent CNV into blocks along the chromosome so as to determine the final size and copy number of the CNV.

The labeling rule of the pseudo signal values is that, when the measured values of at least four consecutive target regions deviate from the corresponding regions of the reference sample in the same direction (t is greater than or less than 0 at the same time), if the P values of 3 regions are less than a first threshold (e.g., 0.05, which is a common significance level threshold) and the fourth is not more than a second threshold (0.2, which is four times the first threshold), the four regions are labeled as deviation directions (e.g., as deviation plus minus plus) and combined into one block; here the number of consecutive and co-directional zones and the first and second threshold values are adjustable. If the distance between one block and another block does not exceed the span of 5 areas, combining the two blocks into a large block, and so on, and finally obtaining a block; referring to the method equation of 1.3 above, the r value of this block is the r of all the regions it contains_iThe average value of (a) indicates that the r value of the block region of the sample to be measured and the reference sample is subjected to t test, and the P value of the block is calculated. When P of the block<0.05, the block is CNV, so as to determine the boundary and size of the block, and obtain the boundary and size of the large CNV.

By analyzing the 15 samples of interest, we obtained CNV results that are highly consistent with the known validation results (SNP-array results) and that are free of false positives and false negatives, see table 3. Furthermore, we simulated 8 30X genome-wide data, including 5 normal samples and 3 CNV-containing samples, and compared the currently reported genome region CNV prediction software CONTRA (Li J, Lupat R, et al, CONTRA: copy number analysis for target response, bioinformatics.2012 May 15; 28(10):1307-13) by performing CNV detection analysis on the 8 simulated data, and the results showed that both the sensitivity and specificity of our method reached 100%, and the respective copy number was also accurately detected, and the detection accuracy for CNV reached 500Kb and could be accurately located, while the sensitivity for CONTRA was 88.9%, the specificity was only 66.7%, and the copy number was not given, as shown in Table 4.

TABLE 3

TABLE 4

2. LOH detection

2.1 detection of heterozygous status in regions of the Whole genome

In a certain region of a genome of a sample to be detected, SNP loci with the gene frequency (AF) of 0.1-0.9 in thousand-person data are found out, and R of the region of the SNP loci in the thousand-person data and the sample to be detected is calculated according to the following formula_HetThe value is obtained. When the area i in the sample to be tested is in an absolute heterozygous state, then R_hetWhen the expression is absolute homozygosis, R is 1_het＝0。

R_hetMAF/(1-MAF), MAF (minor allele frequency) is the minor allele frequency.

In a sample to be detected, any SNP site m in a certain region is taken as a starting point, and n SNP sites are continuously taken backwards as a heterozygosity set Sm, namely S, in the region_m＝{R_het,sm,R_het,s(m+1),...,R_het,s(m+n)In the same way, SNP sites at the same position are taken from a database of thousands of people to form a heterozygosity set Pm, namely P_m＝{R_het,phetm,R_het,p(m+1),...,R_het,p(m+n)And F, checking whether the variances of the two heterozygosity sets are equal, specifically, respectively calculating the variances of the heterozygosity sets in the region of the sample to be detected

Variance of heterozygosity set of the same region as thousand samples

And the p value of the heterozygosity set Sm in the region of the sample to be tested.

H₀:σ_s＝σ_p

H_A:σ_s≠σ_p

p＝p_upper+(1-p_under)

When p is less than or equal to 0.01, we accept H_AThe heterozygosity Sm is judged to lose heterozygosity in the population, namely LOH occurs in the region where the Sm is located.

2.2 detection of Large LOH

Combining the results of 2.1, a step 1.4 of detecting large CNVs was used to record a subset of 4 consecutive loss heterozygous states as a minimum unit. If the span of the two units does not exceed 2 subsets, the two units are combined into a larger unit, and so on, and finally connected into a block, and then the block is formed according to the R between the sample to be detected and the reference set of thousands of people_HetThe value is checked by F, the p value of the block is calculated, when p ≦ 0.01, we consider the block to be LOH, otherwise, it is a non-LOH block.

Alternatively, the conditions for pooling can be set more stringent for more accurate detection, e.g., to avoid false positives due to some random errors, defining at least a region greater than 5M may be a true LOH. On this basis, successive subsets with p ≦ 0.01 around the subset with p ≦ 0.01 in 2.1 are merged with the condition that the block fault tolerance is set to 1 (i.e., allowing the p value of 1 subset in the block to be greater than 0.01). Finally, the Rlet in the merged region is subjected to F test again, and if the p value of the Rlet is less than 0.01, the block is considered to be a real LOH.

3. UPD detection

And combining the CNV and LOH detection results of the whole genome, and carrying out UPD detection according to the Mendelian genetic rule.

If a DNA region shows a heterozygous state in thousand people data, R _Het1, and in practical tests, its heterozygosity state disappears, i.e. R_HetApproaching 0, the region is judged to be LOH, and if CNV occurs in the region and there are two copies (CN ═ 2), that is, there is no change in the number of copies (sample of this example)Is a diploid sample, and each region of the normal diploid sample genome is two copies), the region is judged to have haploidentical diploid (UPD).

In 13 of 15 samples, 10 LOH greater than 5M and 4 UPD greater than 5M were detected, the results are shown in table 5, the detection of LOH and UPD is performed without matching sample (generally, comparing the diseased tissue with the normal tissue, which is a matching sample, i.e. a sample with a certain correlation, while the detection of LOH and UPD in the present embodiment is performed by comparing the target sample with a plurality of reference sample sets, and the target sample and the reference sample set have no correlation and are not matching samples), the LOH detection result of not less than 5M is consistent with the CNV result of CN ═ 1 (the accuracy of the LOH detection result can be verified by using the CNV detection result), and the accuracy of detecting LOH and UPD in the present invention is high and can reach the accuracy of 5M level.

The Circos plot (fig. 5) shows the CNV, LOH and UPD measurements of the GM50275 sample in combination.

TABLE 5

Industrial applicability

The method for determining the probe sequence based on the reference sequence can be effectively used for determining the probe sequence, and the obtained probe is used for hybridizing and capturing the genome to obtain a plurality of genome local regions, and the plurality of local regions obtained by capturing can represent the whole genome, can reflect whole genome variation information and is used for discovering the occurrence of structural variation in the whole gene range.

Although specific embodiments of the invention have been described in detail, those skilled in the art will appreciate. Various modifications and substitutions of those details may be made in light of the overall teachings of the disclosure, and such changes are intended to be within the scope of the present invention. The full scope of the invention is given by the appended claims and any equivalents thereof.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Claims

1. A method for determining a probe sequence based on a reference sequence, comprising:

(1) constructing a first candidate probe set based on a plurality of discrete high-frequency SNP sites, wherein the first candidate probe set is composed of a plurality of candidate probes, each of the plurality of candidate probes corresponds to at least one discrete high-frequency SNP site, and the allele frequency of each of the plurality of discrete high-frequency SNP sites is at least 10 percent respectively;

(2) aligning the plurality of candidate probes in the first candidate probe set with a reference sequence to obtain an alignment result;

(3) performing a first screening on the first candidate probe set based on the comparison result to obtain a second candidate probe set consisting of a plurality of candidate probes,

wherein the first screening comprises retaining candidate probes that satisfy at least one of the following conditions:

a candidate probe that is uniquely aligned with the reference sequence;

candidate probes aligned to a plurality of positions of the reference sequence, and at least two of the plurality of positions each having a mismatch ratio of less than 10%;

(4) dividing the reference sequence into a plurality of windows with predetermined lengths respectively, and distributing a plurality of candidate probes in the second candidate probe set to the matched windows respectively so as to determine the position information of the candidate probes respectively;

(5) performing a second screening of the second candidate probe set based on the positional information and the allele frequencies of the discrete high frequency SNP sites to determine the probe sequences,

wherein the probe is determined according to the following steps:

(a) if a plurality of candidate probes are positioned in the same window, determining the candidate probe with the highest allele frequency of the corresponding discrete high-frequency SNP locus;

(b) and if the same window only has one candidate probe with the highest allele frequency of the corresponding discrete high-frequency SNP locus, selecting the candidate probe with the highest allele frequency of the corresponding discrete high-frequency SNP locus as the probe, and if the same window has a plurality of candidate probes with the highest allele frequency of the corresponding discrete high-frequency SNP locus, selecting the candidate probe which is closest to the center of the window from the candidate probes with the highest allele frequency of the corresponding discrete high-frequency SNP locus as the probe.

2. The method of claim 1, wherein the allele frequency of each of the plurality of discrete high frequency SNP sites is no more than 90% respectively.

3. The method of claim 1, wherein any two adjacent discrete high frequency SNP sites in the plurality of discrete high frequency SNP sites are not physically closer to the reference sequence than the length of the candidate probe.

4. The method of claim 1, wherein the candidate probe is 50-250 mers in length.

5. The method of claim 4, wherein the candidate probe is 100 mers in length.

6. The method of claim 1, wherein the candidate probe corresponds to one of the discrete high frequency SNP sites, and wherein the discrete high frequency SNP site corresponds to a mid-section of the candidate probe.

7. The method of claim 6, wherein the discrete high frequency SNP sites correspond to midpoints of the candidate probes.

8. The method of claim 1, wherein the candidate probe is truncated from the reference sequence.

9. The method of claim 1, wherein prior to performing the alignment, the first set of candidate probes is pre-screened in advance based on at least one of GC content and number of single base repeats of the candidate probes;

the prescreening includes retaining candidate probes that satisfy at least one of:

the GC content is 35 to 65 percent; and

the single base gravity was less than 7.

10. The method of claim 1, wherein in step (4), the reference sequence is divided into a plurality of windows each having the same predetermined length.

11. The method of claim 10, wherein the reference sequence is partitioned into a plurality of windows of 10Kb in length.

12. The method of claim 1, wherein after the second candidate probe set is subjected to the second screening, when the distance between two adjacent candidate probes in the second candidate probe set respectively falling into two adjacent windows on the reference genome is greater than the length of either of the two adjacent windows, the short tandem repeat sequence or a part of the short tandem repeat sequence located between the two adjacent candidate probes on the reference genome is further added to the second candidate probe set subjected to the second screening to form the probe sequence together.

13. The method of claim 1, wherein the reference sequence is a reference genome or a portion thereof.

14. A method of detecting a genomic structural variation comprising at least one of a chromosomal aneuploidy, a copy number variation, and an indel, for a non-diagnostic purpose, the method comprising,

(1) sequencing the genomic nucleic acid of the target sample to obtain a genomic sequencing result, the genomic sequencing result being comprised of a plurality of reads, wherein the sequencing comprises screening with a probe, wherein the probe is obtained by the method of any one of claims 1 to 13;

(2) dividing the reference genome into m regions, calculating the depth of coverage TD of region i using the number of reads falling into region i_iM and i are natural numbers, i represents the number of the region, i is more than or equal to 1 and less than or equal to m, 10<m；

(3) And determining whether the region i has structural variation or not based on the difference degree between the coverage depth of the region i and the coverage depth of the regions i of k reference samples, wherein k is a natural number and is more than or equal to 2.

15. The method of claim 14, wherein the depth of coverage of the region i is determined using the following equation:

or

Where i represents the number of the region.

16. The method of claim 14, wherein the test for the degree of difference between the depth of coverage of the genomic region i of the target sample and the depth of coverage of the region i of the k reference samples is performed by a t-test.

17. The method according to claim 14, wherein the comparison of the degree of difference between the depth of coverage of region i and the depth of coverage of region i of the k reference samples is performed by comparing the depth of coverage coefficients of genomic region i of the target sample and the reference sample, wherein the depth of coverage coefficient R of region i is_iThe determination of (a) comprises the steps of,

(a) for TD_iPerforming a first correction to obtain a first corrected coverage depth TD_aiThe first correction is implemented by performing linear regression on the depth-covered values of 2n consecutive areas including an area i, where n is a natural number, 10<n≤m/2；

(b) For TD_aiIs homogenized to obtain

Thereby obtaining

18. The method of claim 17, wherein in step (a), the first correction coverage depth TD is determined based on the following formula_ai：

Wherein, TD_jAnd j is a natural number, and j is more than or equal to 1 and less than or equal to n.

19. The method of claim 18, wherein in step (b), the TD is identified based on the following formula_aiIs homogenized to obtain

20. The method of any one of claims 16 to 19, wherein R is measured at the time of obtaining the target sample_iThen further comprises the step of reacting with R_iPerforming a second correction to obtain r_i，

21. The method of any one of claims 16 to 19, wherein R is measured at the time of obtaining the target sample_iThen further comprises the step of reacting with R_iPerforming a second correction to obtain r_i，

22. The method of claim 20, wherein the t-test is performed such that the t-statistic for the genomic region i of the target sample is calculated as

Wherein the content of the first and second substances,

r representing k reference samples_i,yAverage value of r_i,yFor reference to the second corrected depth of coverage coefficient of genomic region i of sample ygenome,

s is the standard deviation of k reference samples,

23. the method of claim 22, wherein t is based on the genomic region i of the target sample_iValue, obtaining a significance level P_iWhen P is_i<0.05, judging that the structural variation exists in the region i; otherwise, judging that the structural variation does not exist in the region i.

24. The method of claim 22, wherein t is based on the genomic region i of the target sample_iValue and predetermined significance level P_i0Obtaining t_iTheoretical value t_i0When t is_i≥t_i0Judging that the region i has structural variation, otherwise, judging that the region i has no structural variation; the predetermined P_i0≤0.05。

25. The method according to any one of claims 14 to 19, wherein after performing step (3), W regions that are co-directional and continuous are merged to obtain a primary merged region, when the two primary merged regions are co-directional and span no more than L regions, the two primary merged regions are merged to obtain a secondary merged region, and structural variation of the secondary merged region is detected based on the degree of difference between the coverage depth of the secondary merged region of the target sample genome and the coverage depth of the corresponding regions on the plurality of reference sample genomes; wherein, the equidirectional region refers to a region in which the t statistics of the region are both greater than 0 or both less than 0, W and L are both natural numbers, W is greater than or equal to 2, and L-W is less than or equal to 1.

26. A method for detecting loss of heterozygosity for non-diagnostic purposes comprising,

(2) dividing a reference genome into m' regions, obtaining SNP sites shared by a target sample genome region i and a population region i to form a shared SNP set based on read information falling in the region i and data of the population region i in the genome sequencing result, respectively calculating the heterozygosity of fragments of the SNP sites in the target sample genome region i and the population shared SNP set, and obtaining a heterozygosity set U of the target sample genome region i_iAnd heterozygosity set U of population region i_0iComparing the target samples U_iAnd group U_0iTo determine whether there is loss of heterozygosity in the target sample region i; wherein, the segment where the SNP locus is located is a boundary point of two upstream SNPs and downstream SNPs adjacent to the SNP, m 'and i are natural numbers, m' is not less than 1 and not less than 6.

27. The method of claim 26, wherein each SNP in the common set of SNPs has an allele frequency greater than 0.1.

28. The method according to claim 26, wherein the heterozygosity of the fragment containing the SNP site is represented by a frequency coefficient R of a sub-allele of the SNP site_hetMAF/(1-MAF), which is the sub-allele frequency of the SNP.

29. The method of claim 28, wherein the comparison target sample U is_iAnd group U_0iTo determine whether loss of heterozygosity in the target sample region i has occurred, comprises determining U using an F-test_iVariance of (2)

And U_0iVariance of (2)

Whether there is a significant difference, if U_iAnd U_0iIf the variance difference is significant, it is determined that the target sample region i has loss of heterozygosity, otherwise, it is determined that the target sample region i has no loss of heterozygosity.

30. The method of claim 29, wherein the F-test comprises separately calculating U_iAnd U_i0Using the obtained target sample U_iVariance of (2)

And group U_i0Variance of (2)

Calculating to obtain two statistics F reciprocal to each other_upperAnd F_underObtaining significance level p using said reciprocal statistics_FComparison of p_FWith a predetermined significance level p_F0The size of (a), including the calculation formula,

p_F＝p_upper+(1-p_under) Wherein v is the number of SNPs in the high frequency SNP set shared by the target sample genomic region i and the population region i, q is the number of SNPs in the high frequency SNP set shared by the target sample genomic region i and the population region i, and R_het,i,vThe sub-allele frequency coefficient of the v-th SNP in the common high-frequency SNP set of the target sample genome region i,

is the average value of the sub-allele frequency coefficients, R, of q SNPs in the common high-frequency SNP set of the target sample genomic region i_het,i0,vThe sub-allele frequency coefficient of the v-th SNP in the shared high-frequency SNP set of the genome region i of the population sample,

is the average of the sub-allelic gene frequency coefficients of q SNPs in a common high-frequency SNP set of a population sample genomic region i, p_upperAnd p_underAre respectively according to F_upperAnd F_underObtaining of p_F0≤0.05。

31. The method according to any one of claims 26 to 30, wherein after step (2), W ' regions with loss of heterozygosity and continuity are merged to obtain a three-level merged region, when the span between the two three-level merged regions does not exceed L ' regions, the two three-level merged regions are merged to obtain a four-level merged region, a heterozygosity set of the target sample four-level merged region and a heterozygosity set of the same region of the population are respectively obtained, and the two heterozygosity sets are compared to determine whether loss of heterozygosity occurs in the target sample four-level merged region, wherein W ' and L ' are both natural numbers, W ' is not less than 2, and W '/2 is not less than L '.

32. A method for detecting an monadic diploid, said method being used for non-diagnostic purposes, characterized in that when the loss of heterozygosity is detected in a genomic region of a target sample, the copy number of the genomic region is calculated, and when the copy number of the genomic region is the same as that of the genomic region of a normal genome of the same species, the genomic region of the target sample is determined to be the monadic diploid; the determination of loss of heterozygosity in a genomic region of a target sample is carried out by the method of any one of claims 26 to 30.