WO2016000267A1 - Method for determining the sequence of a probe and method for detecting genomic structural variation - Google Patents

Method for determining the sequence of a probe and method for detecting genomic structural variation Download PDF

Info

Publication number
WO2016000267A1
WO2016000267A1 PCT/CN2014/081686 CN2014081686W WO2016000267A1 WO 2016000267 A1 WO2016000267 A1 WO 2016000267A1 CN 2014081686 W CN2014081686 W CN 2014081686W WO 2016000267 A1 WO2016000267 A1 WO 2016000267A1
Authority
WO
WIPO (PCT)
Prior art keywords
region
probe
candidate
target sample
sample
Prior art date
Application number
PCT/CN2014/081686
Other languages
French (fr)
Chinese (zh)
Inventor
李剑
王煜
李尉
李金良
赵霞
陈仕平
张现东
刘赛军
Original Assignee
深圳华大基因股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因股份有限公司 filed Critical 深圳华大基因股份有限公司
Priority to CN201480080426.0A priority Critical patent/CN106715711B/en
Priority to PCT/CN2014/081686 priority patent/WO2016000267A1/en
Publication of WO2016000267A1 publication Critical patent/WO2016000267A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the present invention relates to the field of genomics and bioinformatics, and in particular to methods for determining probe sequences and methods for detecting genomic structural variations. Background technique
  • CNV DNA copy number variation
  • LOH loss of heterozygosity
  • CNV is a common genomic structural variant with fragments ranging from lkb to a few Mb, mainly characterized by submicroscopic levels of deletion and duplication.
  • LOH refers to a gene deletion on a chromosome on a pair of chromosomes, and the paired chromosomes still exist, showing that only homozygous SNPs appear in a long region of DNA.
  • UPD single parent diomy
  • CNV, LOH, and UPD are associated with many common genetic diseases, cancer, and other complex diseases. Establishing an accurate, comprehensive, efficient, fast, simple, and economical method for detecting CNV, LOH, and UPD is of great value for studying chromosomal variation events, identifying the etiology of related diseases, and adopting appropriate treatment options.
  • PCR technology includes real-time PCR and multiplex ligature amplification (MLPA).
  • MLPA multiplex ligature amplification
  • Real-time fluorescent PCR technology analyzes one or several targets per MLPA.
  • - can analyze more than 40 sequences, high sensitivity, detection range is limited by the chromosome and region targeted by the probe;
  • FISH technology FISH technology is generally used to detect specific chromosomes, unable to detect unknown regions; chip-based technology , including array-based Comparative Genomic Hybridization (aCGH) and SNP-based technology (SNP-array), aCGH can detect CNV within the genome-wide range, and cannot detect polyploids, small fragments The loss detection rate is high;
  • sequencing technology based on whole genome sequnecing (WGS) to detect genome-wide structural variation and target region-based sequencing to detect target region variation, there are four main methods for analyzing CNV, including : Paired End Mapping (p Aired-end mapping ), read-depth analysis
  • sequencing technology it is necessary to study the methods of discovering genomic structural abnormalities based on sequencing results, especially local region sequencing results, including discovery of chromosome aneuploidy, CNV, insertion-deletion (indel), LOH, UPD. And the means of SNP.
  • An aspect of the present invention provides a method for determining a probe sequence based on a reference sequence, comprising the steps of: constructing a first candidate probe set based on a plurality of discrete high frequency SNP sites, the first candidate probe set being composed of a plurality of candidates Constructing a probe, and each of the plurality of candidate probes comprises at least one discrete high frequency SNP; comparing the plurality of candidate probes in the first candidate probe set with the reference sequence to obtain a comparison result; Aligning the result, performing a first screening on the first candidate probe set to obtain a second candidate probe set; dividing the reference sequence into a plurality of windows having a predetermined length, and respectively selecting a plurality of candidate probes in the second candidate probe set Pins are assigned to respective matching windows to determine respective positional information of the plurality of candidate probes; second screening of the second set of candidate probes based on said positional information and allele frequencies of discrete high frequency SNPs The probe sequence is determined.
  • the discrete high frequency SNP site has an allele frequency greater than 10%, preferably no more than 90%, and the physical distance from any other discrete high frequency SNP site on the reference genome is not less than the candidate probe length, candidate The probe length is 50-250 mer.
  • the probe obtained by the method for determining a probe sequence of the present invention is used for hybridization to capture a genome to obtain a plurality of local regions of a genome, and the captured plurality of local regions can represent a whole genome and can reflect the genome-wide variation information for use in discovering the whole The occurrence of structural variations in the gene range.
  • Another aspect of the present invention provides a method for detecting genomic structural variation suitable for detecting chromosomal aneuploidy, copy number variation, and insertion deletion, comprising the steps of: sequencing a target sample genomic nucleic acid to obtain a genome sequencing result
  • the genome sequencing result is composed of a plurality of reads.
  • the sequencing comprises screening by using a probe, wherein the probe is a method for determining a probe sequence based on a reference sequence provided by one aspect of the present invention.
  • Genomic sequencing results can be obtained by extracting genomic DNA and performing library construction and sequencing on the basis of the existing high-throughput platform instruction manual; genome sequencing results can also be obtained by probe capturing the genome of the target sample and sequencing it.
  • the method for determining a probe sequence based on a reference sequence provided by one aspect of the present invention is obtained; the reference gene component is m regions, and the coverage depth of the target sample genome region i is calculated by using the read sequence falling into the region i in the genome sequencing result.
  • m and i are natural numbers, l ⁇ i ⁇ m, 10 ⁇ m ; based on the degree of difference between the coverage depth of the target sample genomic region i and the coverage depth of the region i of the k reference samples, the structural variation of the target sample region i is determined.
  • the method of obtaining the coverage depth of the region i of each reference sample can refer to the method of obtaining the coverage depth of the target sample region i.
  • a further aspect of the invention provides a method suitable for detecting loss of heterozygosity of another genomic structural variation, comprising the steps of: obtaining a genome sequencing result of a target sample, optionally, said genome sequencing result is passed
  • the probe captures the genome of the target sample and is obtained by sequencing.
  • the probe is obtained according to the method for determining the probe sequence based on the reference sequence provided by one aspect of the present invention; the gene component is divided into m' regions, and the result is based on the genome sequencing result.
  • the read segment and the population region i data in the region i obtain the SNP set shared by the target sample genomic region i and the population region i, and calculate the heterozygosity of the segment of each SNP site in the common SNP set of the target sample and the population respectively, and obtain the target.
  • the heterozygosity set of the sample genomic region i and the heterozygosity set u 0l of the population region i are compared, and the target sample and the population U Q1 are compared to determine whether the heterozygosity loss of the target sample region i occurs; wherein, each SNP of the SNP is concentrated, etc.
  • the locus frequency is greater than 0.1, and there is a SNP locus in the SNP set.
  • the segment is bordered by two SNPs upstream and downstream of the SNP, m' and i are natural numbers, m' ⁇ i ⁇ l, m' ⁇ 6.
  • the number of samples taken can truly reflect the population, which can be determined according to the accuracy required for the detection, the statistical method, the distribution of the sample data, etc.
  • the population data is composed of sample data of multiple species, which can be sequenced by whole genome or obtained according to the target sample. The method of data, or obtained from a published database or website, such as thousands of genomic data.
  • a further aspect of the present invention provides a computer readable storage medium for storing a program for execution by a computer, and those skilled in the art can understand that when the program is executed, the above-mentioned detection of genomic structural variation can be completed by instructing related hardware. All or part of the steps of the various methods.
  • the storage medium may include: a read only memory, a random memory, a magnetic disk, or an optical disk.
  • apparatus for detecting genomic structural variation comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data, including an executable program; a processor, coupled to the data input unit, the data output unit, and the storage unit, for executing an executable program stored in the storage unit, wherein the executing of the program includes completing all or part of the steps of the foregoing methods for detecting genomic structural variation.
  • the probe obtained by the method for determining a probe sequence based on a reference sequence of the present invention can perform target region capture sequencing using a probe or a solid phase/liquid phase chip containing the probe, and can realize low sequencing cost in a genome-wide range.
  • Structural variability was detected, including CNV, LOH, and UPD for 23 pairs of chromosomes covered by humans, and the detection resolution was adjusted by adjusting the average spacing distribution of the probes, ie increasing/decreasing SNP sites, as needed.
  • the target region capture sequencing of the present invention in combination with bioinformatics analysis methods, high-resolution, high-accuracy, high-throughput, low-cost CNV, LOH, and UPD detection across the genome is achieved, while genomic structural variation of the present invention
  • the detection method is also applicable to the detection of chromosomal aneuploidy variation, SNP and Indel, and is suitable for structural variation analysis based on whole-genome sequencing data.
  • FIG. 1 is a schematic diagram showing the characteristics of a SeTR probe on a whole genome in one embodiment of the present invention, (A) a length distribution map of a SeTR probe sequence; (B) a physical distance distribution map of two probes in a SETR probe .
  • Fig. 2 is a graph showing the results of the test of the SeTR probe in one embodiment of the present invention, (A) coverage depth distribution map of the target region (B) supporting the reads distribution map of the ref base type and the non-ref base type.
  • Fig. 3 is a flow chart showing the detection of CNV, LOH and UPD in one embodiment of the present invention.
  • Fig. 4 is a diagram showing a reference line in an embodiment of the present invention.
  • Figure 5 is a schematic diagram showing the genomic structural variation of a sample (GM50275) detected in an embodiment of the present invention, the ring is from the outside to the inside, followed by I) chromosome information, II)! ⁇ Change in value (wavy line); III) R het Corresponding P value change, IV) R het value change (point) Detailed description of the invention
  • first and second are used for descriptive purposes only, and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defining “first” and “second” may explicitly or implicitly include one or more of the features. Further, in the description of the present invention, “multiple” means two or more unless otherwise stated.
  • a method for determining a probe sequence based on a reference sequence comprising the steps of:
  • Step 1 Construct the first candidate probe set
  • each candidate probe in the first candidate probe set comprising at least one discrete high frequency SP site, the discrete high frequency SP site being equipotential
  • the gene frequency is greater than 10%, and the physical distance from any other discrete high frequency SP site on the reference genome is not less than the candidate probe length, and the candidate probe length is 50-250 mer.
  • the discrete high frequency SP is obtained by thousands of human genome data, and the discrete high frequency SP bits having an allele frequency less than 90% can be further selected from other published genomic data. Point to determine the candidate probe length is 100 mer.
  • each candidate probe comprises a discrete high frequency S P site and the discrete high frequency S P site is located in the middle of said candidate probe.
  • each candidate probe contains only one high frequency S P site, and there may or may not overlap between adjacent candidate probes.
  • the "middle section” here is relative to the "front section” and the "post-section”. It can be understood as usual, such as a sequence.
  • the upper and lower 1/3 are defined as “front” and "back” respectively.
  • the discrete high frequency SNP site is located at the midpoint of the candidate probe, where the "midpoint" position, such as a sequence containing 2n + 1 nucleotides, The point is the position of the n+1th nucleotide, and when a sequence contains 2n nucleotides, the midpoint of the sequence is the position of the nth or n+1th nucleotide, which enhances the probe to the target. Capture efficiency of discrete high frequency SP sites.
  • the first candidate probe set is pre-screened based on the GC content and/or single base repeat of the candidate probe sequence in the first candidate probe set, and the first candidate probe is retained.
  • Single base repeatability refers to the number of consecutive occurrences of a base type in a sequence. For example, in TGAAAAAAAAGC, A is consecutively 8 times, and the sequence A has a base repeatability of 8.
  • Step 2 Align the first candidate probe set with the reference sequence to obtain the comparison result
  • the reference sequence used is a known sequence, and may be any reference template in the biological category to which the target sample belongs in advance.
  • the target sample is human, and the reference sequence can select HG18 or HG19 provided by the National Center for Biotechnology Information (NCBI).
  • NCBI National Center for Biotechnology Information
  • a resource library containing more reference sequences can be pre-configured, and the sequence is compared before the sequence comparison.
  • the gender, ethnicity, and geographic factors of the target sample select a closer reference sequence, which is beneficial to obtain a more targeted probe sequence.
  • Step 3 Perform a first screening on the first candidate probe set to obtain a second candidate probe set
  • the candidate probe retained by the first screening needs to satisfy any one of the following two conditions: 1) candidate probes in the first candidate probe set that are aligned to the unique position of the reference genome 2) Aligning the first candidate probe set to multiple positions of the reference sequence and having a mismatch ratio of at least two of the plurality of positions of the reference sequence is less than 10%; for example, the candidate probe length lOOmer, 10
  • the base mismatch is 10% mismatch ratio, and the mismatch ratio is low.
  • it can be completely complementary to the target region, and the capture effect is good and the specificity is high.
  • Step 4 Divide the reference sequence into multiple windows, assign the second candidate probe set to the respective matching window, divide the reference sequence into a plurality of windows having a predetermined length, and use the comparison to concentrate the second candidate probe A plurality of candidate probes are assigned to the matching window to obtain position information of each candidate probe on the respective window.
  • the lengths of the windows of the plurality of predetermined lengths may be inconsistent, and may overlap without overlapping.
  • the reference sequence is a reference genome, and the reference genome is divided into a plurality of windows of consistent length, and the window length is 10Kb, and two adjacent windows are connected but do not overlap.
  • Step 5 Perform a second screening of the second candidate probe set based on the position information and the allele frequency of the discrete high frequency S P to determine the probe sequence
  • performing the second screening comprises two steps, (a) if there are multiple candidate probes located in the same window, determining the candidate probe with the highest allele frequency of the discrete high frequency SP (b) If there is only one candidate probe with the highest allele frequency of the discrete high frequency SP, select the candidate probe with the highest allele frequency of the discrete high frequency SP as the probe, if there are multiple discrete heights
  • the candidate probe having the highest allele frequency of the frequency SP selects the candidate probe closest to the center of the window among the candidate probes having the highest allele frequency of the plurality of discrete high-frequency SPs as the probe.
  • the distance of the candidate probe from the center of the window may be the distance between the midpoint of the candidate probe and the center of the window.
  • the target position is as close as possible to the center of the probe sequence, which helps to improve the capture efficiency.
  • a short tandem repeat sequence or a portion of the short tandem repeat sequence between the adjacent two candidate probes on the reference genome is further added.
  • the second candidate probe set after the second screening is combined to form a probe sequence.
  • a method for detecting a genomic structural variation comprising at least one of chromosomal aneuploidy, copy number variation, and insertion deletion, comprising the steps of:
  • Genomic sequencing results can be obtained by whole-genome sequencing, such as by extracting genomic DNA, according to existing high-throughput platform guidelines, such as Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torrent a single molecule or nanopore sequencing platform or the like for library construction and sequencing on the machine to obtain a read; or capture the genome of the target sample by a probe and perform sequencing, and the probe can be provided by an aspect of the present invention.
  • the method of determining the needle is determined by design, and then synthesized or prepared according to the existing method.
  • the reference gene component is m regions, and the coverage depth TD 1 of the target sample genomic region i is calculated by using the read segment falling into the region i in the readout in the sequencing result, where m and i are natural numbers, and i is the region number , l ⁇ i ⁇ m, 10 ⁇ m.
  • the calculation formula of the coverage depth of the region i is _ the number of reads falling into the region i ⁇ ⁇ _ the total number of bases included in the read of the region i. Length ⁇ , the length of the area i, eight dry, 1 number of the clothing area.
  • the position of the reading paragraph to the genome can be determined by sequence alignment, and various alignment software such as SOAP (Short Oligonucleotide Analysis Package), bwa (Burrows-Wheeler Aligner), samtools, GATK (Genome Analysis Toolkit) and the like can be used for the alignment.
  • the degree of difference between the coverage depth of the target sample genomic region i and the coverage depth of the region i of the k reference samples is compared by comparing the coverage depth of the genomic region i of the target sample and the reference sample.
  • the coefficient, the determination of the coverage depth coefficient of the target sample genomic region i includes the following steps: (a) performing the first correction to obtain the TD ⁇ the first correction is by covering the 2n consecutive regions including the region i
  • TD" represents the coverage depth of the jth region in n consecutive regions, j is a natural number, l ⁇ j ⁇ n;
  • the pair further includes
  • n y is a natural number indicating a reference sample number
  • R y is a reference depth coefficient of the reference sample y genomic region i.
  • the average of the degree coefficients, ai k+ l , y is a natural number indicating the reference sample number, and y is the coverage depth coefficient of the reference sample y genomic region i.
  • the correction, homogenization, and the like of the intermediate values can reduce errors caused by fluctuations in experimental conditions, differences between samples, and the like, so that the final n It can be truly reflected and the fluctuation amplitude ratio around 1 is small, and the plurality of samples conform to the normal distribution; in the above embodiment, the first correction is performed, and then the first corrected value is normalized, which is equivalent to two requests.
  • the process of averaging that is, before the mean depth of coverage of the n consecutive regions including the region i is used to represent the coverage depth of the region i, the calculation of the coverage depth value of each of the n regions is performed by using the region as the first
  • the coverage depths of the n consecutive regions of the regions are represented by the mean value, which is equivalent to correcting the TD ⁇ by using the coverage depth values of the 2n consecutive regions including the target region i to stabilize the coverage depth of the continuous regions.
  • other correction or averaging processing can be used to stabilize the coverage depth values of adjacent regions, for example, the average coverage depth of several regions spaced from the target region to correct the target region.
  • the depth of coverage is a concept of the present invention.
  • reference may be made to the calculation process of the coverage depth coefficient of the target sample genomic region i, and the reference sample data may be pre-computed and processed for backup, or may be synchronized with the calculation process of the target sample. obtain.
  • the degree of difference between the coverage depth of the target sample genomic region i and the coverage depth of the region i of the k reference samples is determined by t test whether the difference in coverage depth coefficients between the two is significant. Realized. In a specific embodiment of the invention, the calculation of the t-test statistic of the target sample genomic region i
  • the formula is Y k , where 1 ⁇ represents the average of ⁇ of the k reference samples, 1 ⁇ is the reference sample y genomic region
  • the second corrected coverage depth factor for domain i, ' R , , S is the standard deviation of k reference samples, .
  • a significant level I is obtained.
  • the theoretical value t l0 is obtained based on the value of the target sample genomic region i and the predetermined significant level P lQ , and when ⁇ 3 ⁇ 4, it is determined that the region i undergoes structural variation, and vice versa. It is determined that the region i does not undergo structural variation, and the predetermined P 1 () ⁇ 0.05. According to the t-value table of the t-test, the corresponding 1 ⁇ 2 can be found after the predetermined P 1() .
  • the W regions in the same direction and consecutively are merged to obtain a first-level merged region, and the two primary merges are merged.
  • the two primary merged areas are in the same direction and the span does not exceed L areas
  • the secondary merged area is obtained, and the secondary merge is detected.
  • the structural variation of the region; wherein, the same direction region refers to the region where the t-statistic of the coverage depth of the region is greater than 0 or both are less than 0, and W and L are both natural numbers, W ⁇ 2, LW ⁇ l.
  • the merged conditions can be similarly the same in the same direction of the two secondary merged regions and the distance between the reference genomes does not exceed L areas or L secondary merge areas.
  • detecting a structural variation of the secondary merged region is based on a difference in coverage depth of the secondary merged region of the target sample genome and a coverage depth of a corresponding region on the plurality of reference sample genomes.
  • To determine whether the secondary merged region has structural variation, or to determine whether the structural variation occurring in region i spans W regions. Refer to the acquisition of the coverage depth of the corresponding secondary merged region on the sample genome, the calculation of the t-statistic of the coverage depth of the secondary merged region on the target sample genome, and the structural variation judgment process. See the structural variation of the previously relatively small region i. Calculate the judgment process.
  • a method suitable for detecting loss of heterozygosity in genomic structural variation comprising the steps of:
  • the genome sequencing result is composed of multiple reads
  • the genome sequencing result can be obtained by whole-genome sequencing, for example, by extracting genomic DNA, according to the existing high Guidance manual for the flux platform, such as using Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torrent, single molecule or nanopore sequencing platform for library construction and sequencing on the machine to obtain reads; or by probe capture
  • the genome of the target sample is obtained and sequenced, and the probe can be designed and determined by the method for determining the probe provided by one aspect of the present invention, and then synthesized or prepared according to the existing method.
  • the reference gene components are divided into m' regions, and based on the read segment information and the population region i data falling in the reference genome region i in the sequencing result, the SP sets shared by the target sample genome region i and the population region i are obtained, respectively, The heterozygosity of the fragment of each SP site in the consensus SP set of the target sample and the population, the heterozygosity set 1 ⁇ of the target sample genomic region i, and the heterozygosity set U Ql of the population region i, the comparison target sample U and the population U Ql is determined whether the loss of heterozygosity in the target sample region i occurs; wherein the allele frequency of each SP in the shared SP set is greater than 0.1, and the segment of the SP site in the shared SP set is The two SPs upstream and downstream of the SP are boundary points, m' and i are natural numbers, m'> i > l , m' ⁇ 6.
  • the target sample U and the population U Q1 are compared to determine whether the loss of heterozygosity of the target sample region i occurs, including whether the variance of U and the variance of U Ql are significantly different by using the F test. If the variance of U ⁇ PU Q1 is significant, it is determined that there is a loss of heterozygosity in the target sample region i, and conversely, it is determined that there is no loss of heterozygosity in the target sample region i.
  • the F test includes separately calculating the variance of the U ⁇ PU lQ , and using the variance of the obtained target sample U and the variance of the population U lQ to obtain two statistics of the reciprocal statistic F up ⁇ and utilization.
  • the statistic of the reciprocal of each other obtains a significant level p F , comparing the magnitude of p F with a predetermined significance level p FQ , p F ⁇ p F0 indicating the two inclusion calculation formulas,
  • v is the target sample genomic region i and groups region i consensus SP concentrated SP number
  • q is the number of the target sample genomic region i and groups region i consensus SP concentrate the SP
  • R ⁇ AV for the target sample genome zone i
  • the sub-allelic frequency coefficient of the Vth SNP in the SP set is the average of the sub-allelic frequency coefficients of the q SNPs in the consensus SP set of the target sample genomic region i, Rte , lQ , v population sample genomic region SP shared allele frequency coefficients times the concentration of the first V i th SNP's, Rte, lQ SP sample is total genomic region population set of q i, the average of the SNP, Pupper Punder times and allele frequency coefficients, respectively Obtained from F upper and F under , p F Q ⁇ 0.05 p F Q can be set to a value that is normally set, or adjusted according to known information that is known, requirements for detection accuracy, and the like.
  • step (2) in order to detect a larger LOH, after step (2), W's heterozygous loss and continuous regions are merged to obtain a three-level merged region, and two tertiary merges are merged.
  • a four-level merged region is obtained, and the heterozygosity set of the four-level merged region of the target sample and the heterozygosity set of the same region of the same are respectively obtained.
  • Two heterozygosity sets are used to determine whether the four-level merged region of the target sample has a loss of heterozygosity, where W' and L' are both natural numbers, W' > 2, W' /2 ⁇ L '.
  • the merge condition can be similarly that the distance between the two four-level merged regions on the reference genome does not exceed L' Zone or L' three-level merged zone.
  • a method for detecting a diploid of a single parent wherein when there is loss of heterozygosity in a genomic region of a target sample, the copy number of the region is calculated, and when the copy number of the region is the normal genome of the same species When the copy number of the region is the same, it is determined that there is UPD in the genomic region of the target sample; whether or not the LOH is present in the genomic region can be performed by the LOH detection method of the aspect disclosed in the present invention.
  • an apparatus for detecting a genomic structure variation comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data, including Executable program; processor, connected with the above data input unit, data output unit and storage unit data
  • the executable program for executing the storage in the storage unit, the execution of the program includes all or part of the steps of the various methods in the above embodiments.
  • the specific probe design method and the structural mutation detection method according to the present invention will be described in detail below in conjunction with specific target individuals.
  • the name definitions or specific parameter settings involved in the following process are selected as:
  • Library construction and sequencing were performed according to the small fragment library construction instructions provided by the Hiseq 2000 platform and the sequencing instructions.
  • the library size was 300bp-350bp, pair-end sequencing, and the read length was 91bp.
  • Type is PE91+8+91);
  • the reference genome or reference sequence selected for comparison is the human reference genome (hgl9, Build 37).
  • Example 1 Chip Design, Preparation, Testing
  • SeTR The special probe
  • the uniqueness and stability of the probe sequence is high, requiring low heterozygosity and moderate GC (35% ⁇ 60%) content
  • the selection process of the SeTR probe design or target area is as follows:
  • SeTR liquid phase chip contains 278,800 probes with a total size of 41,795,106 bp covering an effective whole genome (2.89G) 1.45% region.
  • the average length of the SeTR probe was 149 bp, and the average physical distance between adjacent probes was 10.6 kbp, as shown in Table 1 and Figure 1.
  • YH Yanhuang sample, Chinese genomic DNA
  • HG00537 a sample from the Thousand Human Genome Project
  • Wo B GM50275 obtained from the Couriel Institute for Research
  • Medical Research's human fibroblast sample was used to test the availability of the SeTR chip to ensure that the probe chip could be used for subsequent detection studies. All three samples were sequenced using SeTR capture to obtain sequencing sequences ( rea ds).
  • the remaining reads are clean reads (clean reads)
  • the clean reads are compared to the reference sequence hgl9
  • the reads of 98.13% ⁇ 99.29 were compared to the reference genome, and the target area reached 67.43% ⁇ 67.87%.
  • the target area of 99.73% ⁇ 99.95 was covered by at least one read, and over 99%.
  • the area was covered at least 10 times, as shown in Table 2, which performed better than the same type of exome capture chip, such as the exome liquid chip produced by Roche Nimblegen.
  • the depth distribution of the target region is similar to the Poisson distribution as shown in Fig.
  • Fig. 4B the non-reference sequence base type of the most heterozygous sites in the target region is shown in Fig. 4B (the non The number of reads supported by -reference allele is almost the same as the number of reads supported by the reference allele, that is, the number of positive and negative reads supported by a high heterozygous site is comparable (positive and negative reads are respectively sourced from two).
  • a homologous chromosome which shows that this probe has no obvious haplotype (commonly referred to as the base sequence, ie ref type). The bias of capture is better, and the capture uniformity of the target region is better.
  • Samples 15 target gDNA samples (human genomic DNA, sample numbers are shown in Table 3 below, "GM” begins with human fibroblasts), and 24 reference DNA samples.
  • Main reagent instruments PCR instrument, pipette, centrifuge, comfort thermomixer, DNA interrupter, vortex shaker, magnetic stand, electrophoresis instrument, Hiseq2000 sequencer, Nanodrop UV spectrophotometer, etc. Or if the instrument does not indicate the manufacturer, it is a regular product that can be obtained through the market.
  • Probe Design and Synthesis By the first example, a target area of approximately 41M was selected from the human genome, and a NimbleGen SeqCap EZ liquid probe was customized from Roche. The probe set was able to capture the corresponding The target area of the design.
  • the DNA was interrupted to 200-250 bp.
  • the DNA fragment was purified interrupted, electrophoresis main band size meets the requirements, i.e., whether the size of the main belt 200-250bp o
  • the DNA fragment of the above-mentioned fragmentation was subjected to end repair and purification according to the steps of the construction of the double-end tag library and the reagents and reaction conditions, and the base A was added and purified by terminal repair.
  • the purified end is added with the A product; the sequencing linker is ligated at both ends of the terminal A product, and the DNA fragment with the linker is purified using magnetic beads capable of complementary binding to the sequencing linker.
  • the PCR reaction system is prepared, the DNA fragment with the linker is amplified, the PCR product is purified by magnetic beads, and the main band size of the amplified product is 300-350 bp by electrophoresis; the amount of DNA is detected by Nanodrop ultraviolet spectrophotometer, and the total amount needs to be greater than 1.0 ⁇ ⁇ .
  • PCR is carried out, and the PCR reaction system is prepared as required, and the DNA obtained by hybridization elution, polymerase, substrate, PCR reaction buffer, Flowcell primer (based on the sequencer of the sequencing chip flowcell)
  • the primers with a fixed sequence design are homogeneously mixed.
  • the PCR program was predenatured at 94 °C for 2 min, denatured at 94 °C for 15 s, annealed at 58 °C for 30 s, extended at 72 °C for 30 s, reacted for 15 cycles, and extended at 72 °C for 5 min.
  • the PCR product was taken out, centrifuged, and magnetic beads were purified to obtain a target region library.
  • the concentration of the library was measured using a Nanodrop UV spectrophotometer and prepared for sequencing on the machine.
  • a quality-qualified DNA library was sequenced according to the Hiseq2000 operating instructions. The amount of data per sample is approximately
  • the sequencing data of the above-mentioned embodiment illumina Hiseq2000 is first subjected to simple data filtering, and the reads which are contaminated by the adapter, containing N ratio higher than 5% and average mass value lower than Q20 are removed.
  • the bwa alignment software is then used to compare the filtered data to the human reference genome (hgl9, Build 37), and the output sequence alignment result is a SAM (sequence alignment/map) format comparison file (referred to as SAM file), and then
  • SAM files were converted to binary BAM files using the Samtools software, PCR duplicates were removed and sorted, and the results were compared and recalibrated using GATK software.
  • R m 1 / m 1 , Tl base: number of bases aligned to the target area i; TJen: length of the target area i.
  • the ⁇ values of the same target region from multiple samples should conform to the normal distribution, so when investigating the target region i of a sample, you can compare the n values of this region with multiple samples, using the ⁇ test, t statistics
  • the formula for calculating the quantity is as follows.
  • the 1 in the subscript of each parameter in the formula represents the target sample, 2 represents multiple reference samples, and ⁇ represents a sample to be tested!
  • the average of ⁇ is the average of the ⁇ of the reference sample, ⁇ is theoretically all the samples to be tested! ⁇ Average,
  • each target region corresponds to a t value of a detectable CNV, and then a P value (confidence) is obtained.
  • P 0.05 of a certain region
  • this region is a region where CNV occurs.
  • a pseudo-signal value is attached to each region to characterize whether it is considered by the next CNV region connection, and then along the chromosome, the target regions that may have a consistent CNV are connected into blocks to determine The final size and copy number of CNV.
  • the marking rule of the pseudo signal value is that when the measured values of at least four consecutive target regions are in the same direction (the t value is greater than or at the same time less than 0), that is, when the corresponding region of the reference sample is deviated, if the P values of the three regions are smaller than the first Threshold (such as 0.05, commonly used significant horizontal threshold), and the fourth does not exceed the second threshold (0.2, four times the first threshold), then the four regions are marked as off-direction (such as the partial mark is +, partial The small mark is -), merged into one block; here the number of consecutive and same direction regions and the first and second threshold values are adjustable.
  • the first Threshold such as 0.05, commonly used significant horizontal threshold
  • the fourth does not exceed the second threshold (0.2, four times the first threshold)
  • the two blocks are merged into one large block, and so on, and finally the block is obtained; refer to the method formula of 1.3 above, this block
  • the r value is in all the areas it contains!
  • the average value of ⁇ indicates that the r value of the block domain of the sample to be tested and the reference sample is subjected to t-test to calculate the P value of the block.
  • P ⁇ 0.05 of the block CNV occurs in this block, thereby determining the boundary and size of the block, and obtaining the boundary and size of the large CNV.
  • R het MAF l ( ⁇ - MAF) ⁇ MAF ( min or allele frequency) is the minor allele frequency.
  • a subset of four consecutive lost heterozygous states is recorded as a minimum unit by means of detecting large CNV step 1.4. If there is no more than 2 subset spans between the two units, the two units are merged into a larger unit, and so on, and finally connected into a block. At this time, according to the sample to be tested and the reference set of thousands The R Het value is subjected to the F test, and the p value of the block is calculated, when! When ⁇ 0.01, we think that LOH occurs in this block, otherwise it is non-LOH block.
  • the area defined at least greater than 5M may be a true LOH.
  • the block fault tolerance is 1 (that is, the p value of one subset in the block is allowed to be greater than 0.01)
  • an additional F-test is performed on the RHet in the merged region. If the p-value is less than 0.01, the block is considered to be a true LOH.
  • UPD single parent diploid
  • the Circos diagram ( Figure 5) shows the CNV, LOH and UPD results of the GM50275 sample.
  • the method for determining a probe sequence based on a reference sequence of the invention can be effectively used for determining a probe sequence, and the obtained probe is used for hybridization to capture a genome to obtain a plurality of local regions of the genome, and the captured plurality of local regions can represent the whole
  • the genome which reflects the genome-wide variation, is used to discover the occurrence of structural variations across the entire genome.

Abstract

Provided in the present invention is a method for determining the sequence of a probe based on a reference sequence and a method for detecting genomic structural variation. The method for determining the sequence of a probe based on a reference sequence comprises: constructing a first candidate probe set based on a plurality of discrete high-frequency SNP sites, wherein the first candidate probe set is composed of a plurality of candidate probes and each one of a plurality of candidate probes has at least one discrete high-frequency SNP. A plurality of candidate probes of the first candidate probe set is compared with the reference sequence in order to obtain a comparison result. On the basis of the comparison result, the first candidate probe set is firstly screened, so as to obtain a second candidate probe set. The reference sequence is divided into a plurality of windows with a predetermined length, and a plurality of candidate probes of the second candidate probe set are respectively distributed to each of the matching windows, so as to determine its own positional information of a plurality of candidate probes. Based on the positional information and allele frequency of the discrete high-frequency SNP, the second candidate probe set is secondly screened in order to determine the probe sequence.

Description

确定探针序列的方法和基因组结构变异的检测方法  Method for determining probe sequence and method for detecting genomic structural variation
优先权信息 Priority information
无 技术领域  No technical field
本发明涉及基因组学及生物信息学技术领域,具体涉及确定探针序列的方法和基因 组结构变异的检测方法。 背景技术  The present invention relates to the field of genomics and bioinformatics, and in particular to methods for determining probe sequences and methods for detecting genomic structural variations. Background technique
DNA 拷贝数变异 ( copy number variation, CNV ) 和杂合性丢失 ( Loss of heterozygosity, LOH) 是不同类型的基因组变异。 CNV 是一种常见基因组结构变异, 片段从 lkb 到几 Mb不等, 主要表现为亚显微水平的缺失和重复。 LOH是指一对染色 体上某一个染色体上基因缺失, 与之配对的染色体上仍然存在, 表现为在 DNA很长一 段区域只出现纯合子 SNP。 当 LOH没有发生拷贝数的变化, 即只从一个亲本遗传两个 副本, 被称单亲二倍体 (uniparental disomy, UPD ) 。 CNV , LOH, 和 UPD与许多常 见的遗传性疾病, 癌症和其他复杂疾病相关。 建立一种准确、全面、 高效、 快速、 简单、 经济的检测 CNV、 LOH和 UPD 的方法, 对于研究染色体变异事件, 明确相关疾病的 病因和采取相应的治疗方案, 都具有重要的价值。  DNA copy number variation (CNV) and loss of heterozygosity (LOH) are different types of genomic variation. CNV is a common genomic structural variant with fragments ranging from lkb to a few Mb, mainly characterized by submicroscopic levels of deletion and duplication. LOH refers to a gene deletion on a chromosome on a pair of chromosomes, and the paired chromosomes still exist, showing that only homozygous SNPs appear in a long region of DNA. When the LOH does not change the copy number, that is, only two copies are inherited from one parent, which is called a single parent diomy (UPD). CNV, LOH, and UPD are associated with many common genetic diseases, cancer, and other complex diseases. Establishing an accurate, comprehensive, efficient, fast, simple, and economical method for detecting CNV, LOH, and UPD is of great value for studying chromosomal variation events, identifying the etiology of related diseases, and adopting appropriate treatment options.
目前已有一些检查技术, 比如 PCR技术, 包括实时荧光定量 PCR 技术和多重连 接扩增技术(Multiplex Ligation-dependent Probe Amplification, MLPA), 实时荧光 PCR 技术每次检测分析一个或数个靶点, MLPA—次能够分析 40多个序列, 灵敏度高, 检 测范围受限于探针所针对的染色体和区域; FISH技术, FISH技术一般用于检测特定 的几条染色体, 无法检测未知区域; 基于芯片的技术, 包括基于芯片的比较基因组杂交 技术 (array-based Comparative Genomic Hybridization, aCGH) 和基于 SNP芯片的技 术 (SNP-array) , aCGH可检测全基因组范围内的 CNV, 不能检测出多倍体, 小片段 的丢失的漏检率高; 以及测序技术, 基于全基因组测序 (whole genome sequnecing, WGS ) 检测全基因组范围的结构变异和基于目标区域测序检测目标区域的变异, 主要 有四种方法分析 CNV, 包括: 配对末端映射 (paired-end mapping ) , 读长深度分析 ( read-depth analysis), 分开读长策略( split-read strategies)和序列组装比较(sequence assembly comparisons) 。  There are several inspection techniques, such as PCR technology, including real-time PCR and multiplex ligature amplification (MLPA). Real-time fluorescent PCR technology analyzes one or several targets per MLPA. - can analyze more than 40 sequences, high sensitivity, detection range is limited by the chromosome and region targeted by the probe; FISH technology, FISH technology is generally used to detect specific chromosomes, unable to detect unknown regions; chip-based technology , including array-based Comparative Genomic Hybridization (aCGH) and SNP-based technology (SNP-array), aCGH can detect CNV within the genome-wide range, and cannot detect polyploids, small fragments The loss detection rate is high; and sequencing technology, based on whole genome sequnecing (WGS) to detect genome-wide structural variation and target region-based sequencing to detect target region variation, there are four main methods for analyzing CNV, including : Paired End Mapping (p Aired-end mapping ), read-depth analysis, split-read strategies, and sequence assembly comparisons.
随着测序技术的发展,有必要研究基于测序结果特别是局部区域测序结果来发现基 因组结构异常的手段, 包括发现染色体非整倍性、 CNV、 插入缺失 (insertion-deletion, indel ) 、 LOH、 UPD以及 SNP的手段。 发明内容 本发明的一方面提供一种基于参考序列确定探针序列的方法, 包括以下步骤: 基于 多个离散高频 SNP位点, 构建第一候选探针集, 第一候选探针集由多个候选探针构成, 并且多个候选探针中的每一个均含有至少一个离散高频 SNP;将第一候选探针集中的多 个候选探针与参考序列进行比对, 以便获得比对结果; 基于比对结果, 对第一候选探针 集进行第一筛选, 获得第二候选探针集; 将参考序列划分为多个具有预定长度的窗口, 分别将第二候选探针集中的多个候选探针分配至各自匹配的窗口,以确定多个候选探针 各自的位置信息; 基于所说的位置信息以及离散高频 SNP 的等位基因频率, 对第二候 选探针集进行第二筛选, 以便确定所述探针序列。 其中, 离散高频 SNP位点为等位基 因频率大于 10%, 优选的不大于 90%, 并且与任意另外一个离散高频 SNP位点在参考 基因组上的物理距离不小于候选探针长度, 候选探针长度为 50-250mer。 With the development of sequencing technology, it is necessary to study the methods of discovering genomic structural abnormalities based on sequencing results, especially local region sequencing results, including discovery of chromosome aneuploidy, CNV, insertion-deletion (indel), LOH, UPD. And the means of SNP. Summary of the invention An aspect of the present invention provides a method for determining a probe sequence based on a reference sequence, comprising the steps of: constructing a first candidate probe set based on a plurality of discrete high frequency SNP sites, the first candidate probe set being composed of a plurality of candidates Constructing a probe, and each of the plurality of candidate probes comprises at least one discrete high frequency SNP; comparing the plurality of candidate probes in the first candidate probe set with the reference sequence to obtain a comparison result; Aligning the result, performing a first screening on the first candidate probe set to obtain a second candidate probe set; dividing the reference sequence into a plurality of windows having a predetermined length, and respectively selecting a plurality of candidate probes in the second candidate probe set Pins are assigned to respective matching windows to determine respective positional information of the plurality of candidate probes; second screening of the second set of candidate probes based on said positional information and allele frequencies of discrete high frequency SNPs The probe sequence is determined. Wherein, the discrete high frequency SNP site has an allele frequency greater than 10%, preferably no more than 90%, and the physical distance from any other discrete high frequency SNP site on the reference genome is not less than the candidate probe length, candidate The probe length is 50-250 mer.
利用本发明的确定探针序列的方法获得的探针,用于杂交捕获基因组获得多个基因 组局部区域, 捕获得的多个局部区域能够代表全基因组、 能够反映全基因组变异信息, 用于发现全基因范围的结构变异的发生。  The probe obtained by the method for determining a probe sequence of the present invention is used for hybridization to capture a genome to obtain a plurality of local regions of a genome, and the captured plurality of local regions can represent a whole genome and can reflect the genome-wide variation information for use in discovering the whole The occurrence of structural variations in the gene range.
本发明的另一方面提供了一种检测基因组结构变异的方法,适用于检测染色体非整 倍性、 拷贝数变异和插入缺失, 包括以下步骤: 对目标样本基因组核酸进行测序, 以获 得基因组测序结果, 所说的基因组测序结果由多个读段构成, 可选地, 所说的测序包括 采用探针进行筛选, 其中, 探针是通过本发明一方面提供的基于参考序列确定探针序列 的方法获得的。 基因组测序结果, 可以通过提取基因组 DNA, 依据现有高通量平台指 导手册进行文库构建及上机测序获得;基因组测序结果也可以通过探针捕获目标样本的 基因组并进行测序获得的,探针可以通过本发明一方面提供的基于参考序列确定探针序 列的方法获得;将参考基因组分为 m个区域,利用基因组测序结果中落入区域 i的读段 计算目标样本基因组区域 i的覆盖深度 TD^ 其中, m和 i为自然数, l≤i≤m, 10<m; 基于目标样本基因组区域 i的覆盖深度与 k个参照样本的区域 i的覆盖深度的差异程度, 判断目标样本区域 i 结构变异的发生, 其中, k 为自然数, k≥2, 各参照样本的区域 i 的覆盖深度的得来方法可参照目标样本区域 i的覆盖深度的获得方法。通过合并邻近发 生结构变异的区域, 进一步检测合并后的区域是否发生大的结构变异, 或者 说进一步 检测发生在区域 i的结构变异是否横跨几个区域。 Another aspect of the present invention provides a method for detecting genomic structural variation suitable for detecting chromosomal aneuploidy, copy number variation, and insertion deletion, comprising the steps of: sequencing a target sample genomic nucleic acid to obtain a genome sequencing result The genome sequencing result is composed of a plurality of reads. Optionally, the sequencing comprises screening by using a probe, wherein the probe is a method for determining a probe sequence based on a reference sequence provided by one aspect of the present invention. acquired. Genomic sequencing results can be obtained by extracting genomic DNA and performing library construction and sequencing on the basis of the existing high-throughput platform instruction manual; genome sequencing results can also be obtained by probe capturing the genome of the target sample and sequencing it. The method for determining a probe sequence based on a reference sequence provided by one aspect of the present invention is obtained; the reference gene component is m regions, and the coverage depth of the target sample genome region i is calculated by using the read sequence falling into the region i in the genome sequencing result. Where m and i are natural numbers, l ≤ i ≤ m, 10 < m ; based on the degree of difference between the coverage depth of the target sample genomic region i and the coverage depth of the region i of the k reference samples, the structural variation of the target sample region i is determined. Occurs, where k is a natural number, k ≥ 2, and the method of obtaining the coverage depth of the region i of each reference sample can refer to the method of obtaining the coverage depth of the target sample region i. By combining regions adjacent to the structural variation, it is further detected whether large structural variations occur in the merged region, or further detecting whether the structural variation occurring in region i spans several regions.
本发明的再一方面提供了适用于检测另一种基因组结构变异一一杂合性丢失的方 法, 包括以下步骤: 获取目标样本的基因组测序结果, 可选地, 所说的基因组测序结果 是通过探针捕获目标样本的基因组并进行测序获得的,探针是按照本发明一方面提供的 基于参考序列确定探针序列的方法获得的; 将基因组分成 m' 个区域, 基于基因组测序 结果中落在区域 i中的读段和群体区域 i数据, 获得目标样本基因组区域 i和群体区域 i共有的 SNP集, 分别计算目标样本和群体的共有 SNP集中的各个 SNP位点所在片段 的杂合度, 获得目标样本基因组区域 i的杂合度集 和群体区域 i的杂合度集 u0l, 比较目标样本 和群体 UQl以确定目标样本区域 i杂合性丢失是否发生; 其中, 共有 SNP集中的每个 SNP的等位基因频率都大于 0.1, 共有 SNP集中的一个 SNP位点所在 片段是以与该 SNP相邻的上下游两个 SNP 为边界点的, m' 和 i为自然数, m' ≥i≥l, m' ≥6。 抽取多少样本能够真实反映群体, 可根据检测所需的精确度、 统计方法、 样本 数据分布情况等确定, 群体数据由多个同物种的样本数据构成, 可通过全基因组测序、 或者依据获得目标样本数据的方法、或者从已完成已公开的数据库或网站获得, 比如千 人基因组数据。 A further aspect of the invention provides a method suitable for detecting loss of heterozygosity of another genomic structural variation, comprising the steps of: obtaining a genome sequencing result of a target sample, optionally, said genome sequencing result is passed The probe captures the genome of the target sample and is obtained by sequencing. The probe is obtained according to the method for determining the probe sequence based on the reference sequence provided by one aspect of the present invention; the gene component is divided into m' regions, and the result is based on the genome sequencing result. The read segment and the population region i data in the region i, obtain the SNP set shared by the target sample genomic region i and the population region i, and calculate the heterozygosity of the segment of each SNP site in the common SNP set of the target sample and the population respectively, and obtain the target. The heterozygosity set of the sample genomic region i and the heterozygosity set u 0l of the population region i are compared, and the target sample and the population U Q1 are compared to determine whether the heterozygosity loss of the target sample region i occurs; wherein, each SNP of the SNP is concentrated, etc. The locus frequency is greater than 0.1, and there is a SNP locus in the SNP set. The segment is bordered by two SNPs upstream and downstream of the SNP, m' and i are natural numbers, m' ≥ i ≥ l, m' ≥ 6. The number of samples taken can truly reflect the population, which can be determined according to the accuracy required for the detection, the statistical method, the distribution of the sample data, etc. The population data is composed of sample data of multiple species, which can be sequenced by whole genome or obtained according to the target sample. The method of data, or obtained from a published database or website, such as thousands of genomic data.
本发明的再一方面提供一种计算机可读存储介质, 用于存储供计算机执行的程序, 本领域普通技术人员可以理解, 在执行该程序时, 通过指令相关硬件可完成上述检测基 因组结构变异的各种方法的全部或部分步骤。 所称存储介质可以包括: 只读存储器、 随 机存储器、 磁盘或光盘等。  A further aspect of the present invention provides a computer readable storage medium for storing a program for execution by a computer, and those skilled in the art can understand that when the program is executed, the above-mentioned detection of genomic structural variation can be completed by instructing related hardware. All or part of the steps of the various methods. The storage medium may include: a read only memory, a random memory, a magnetic disk, or an optical disk.
根据本发明的最后一方面提供检测基因组结构变异的装置, 包括: 数据输入单元, 用于输入数据; 数据输出单元, 用于输出数据; 存储单元, 用于存储数据, 其中包括可 执行的程序; 处理器, 与上述数据输入单元、 数据输出单元及存储单元数据连接, 用于 执行存储单元中存储的可执行的程序,程序的执行包括完成上述检测基因组结构变异的 各种方法的全部或部分步骤。  According to a final aspect of the present invention, there is provided apparatus for detecting genomic structural variation, comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data, including an executable program; a processor, coupled to the data input unit, the data output unit, and the storage unit, for executing an executable program stored in the storage unit, wherein the executing of the program includes completing all or part of the steps of the foregoing methods for detecting genomic structural variation. .
利用本发明的基于参考序列确定探针序列的方法获得的探针,利用探针或者包含这 些探针的固相 / 液相芯片进行目标区域捕获测序, 能够低测序成本的实现在全基因组范 围内检测结构变异, 包括覆盖人的 23对染色体检测 CNV、 LOH和 UPD, 而且检测分 辨率能根据需求通过调整探针的平均间距分布即增加 / 减少 SNP 位点进行调整。 利用 本发明的目标区域捕获测序结合生物信息分析方法实现了在全基因组范围内进行高分 辨率、 高准确性、 高通量、 低成本的 CNV、 LOH和 UPD检测, 同时本发明的基因组 结构变异检测方法也适用于染色体非整倍性变异、 SNP和 Indel的检测, 适用于基于全 基因测序数据的结构变异分析检测。  The probe obtained by the method for determining a probe sequence based on a reference sequence of the present invention can perform target region capture sequencing using a probe or a solid phase/liquid phase chip containing the probe, and can realize low sequencing cost in a genome-wide range. Structural variability was detected, including CNV, LOH, and UPD for 23 pairs of chromosomes covered by humans, and the detection resolution was adjusted by adjusting the average spacing distribution of the probes, ie increasing/decreasing SNP sites, as needed. Utilizing the target region capture sequencing of the present invention in combination with bioinformatics analysis methods, high-resolution, high-accuracy, high-throughput, low-cost CNV, LOH, and UPD detection across the genome is achieved, while genomic structural variation of the present invention The detection method is also applicable to the detection of chromosomal aneuploidy variation, SNP and Indel, and is suitable for structural variation analysis based on whole-genome sequencing data.
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得 明显, 或通过本发明的实践了解到。 附图说明  The additional aspects and advantages of the invention will be set forth in part in the description which follows. DRAWINGS
本发明的上述和 /或附加的方面和优点, 结合下面附图对实施方式的描述将变得明 显和容易理解, 其中:  The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from
图 1是本发明一个实施方式中的 SeTR探针在全基因组上的特性的示意图, (A) SeTR探针序列的长度分布图; (B ) SETR探针中两两探针的物理距离分布图。  1 is a schematic diagram showing the characteristics of a SeTR probe on a whole genome in one embodiment of the present invention, (A) a length distribution map of a SeTR probe sequence; (B) a physical distance distribution map of two probes in a SETR probe .
图 2是本发明一个实施方式中的 SeTR探针的测试结果图, (A) 目标区域的覆盖 深度分布图 (B ) 支持 ref碱基型和非 ref碱基型的 reads分布图。  Fig. 2 is a graph showing the results of the test of the SeTR probe in one embodiment of the present invention, (A) coverage depth distribution map of the target region (B) supporting the reads distribution map of the ref base type and the non-ref base type.
图 3是本发明的一个实施方式中的 CNV、 LOH和 UPD的检测流程示意图。  Fig. 3 is a flow chart showing the detection of CNV, LOH and UPD in one embodiment of the present invention.
图 4是本发明的一个实施方式中的 基准线示图。  Fig. 4 is a diagram showing a reference line in an embodiment of the present invention.
图 5 是本发明一个实施方式中的检测到的一个样本 (GM50275 ) 的基因组结构变 异的示意图, 圆环由外到里, 依次为 I)染色体信息, II)!^值的变化(波浪线); III)Rhet 对应的 P值变化, IV) Rhet值变化 (点) 发明详细描述 Figure 5 is a schematic diagram showing the genomic structural variation of a sample (GM50275) detected in an embodiment of the present invention, the ring is from the outside to the inside, followed by I) chromosome information, II)! ^Change in value (wavy line); III) R het Corresponding P value change, IV) R het value change (point) Detailed description of the invention
下面详细描述本发明的实施例。下面通过参考附图描述的实施例是示例性的, 仅用 于解释本发明, 而不能理解为对本发明的限制。  Embodiments of the present invention are described in detail below. The embodiments described below with reference to the drawings are intended to be illustrative of the invention and are not to be construed as limiting.
需要说明的是, 术语 "第一" 、 "第二 "仅用于描述目的, 而不能理解为指示或暗 示相对重要性或者隐含指明所指示的技术特征的数量。 由此, 限定有 "第一" 、 "第二" 的特征可以明示或者隐含地包括一个或者更多个该特征。进一步地,在本发明的描述中, 除非另有说明, "多个" 的含义是两个或两个以上。  It should be noted that the terms "first" and "second" are used for descriptive purposes only, and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defining "first" and "second" may explicitly or implicitly include one or more of the features. Further, in the description of the present invention, "multiple" means two or more unless otherwise stated.
根据本发明的一种实施方式, 提供一种基于参考序列确定探针序列的方法, 包括以下 步骤:  According to an embodiment of the present invention, there is provided a method for determining a probe sequence based on a reference sequence, comprising the steps of:
步骤一: 构建第一候选探针集  Step 1: Construct the first candidate probe set
利用分布于基因组的离散高频 S P位点构建第一候选探针集, 第一候选探针集中的每 一条候选探针包含至少一个离散高频 S P位点, 离散高频 S P位点为等位基因频率大于 10%、 并且与任意另外一个离散高频 S P位点在参考基因组上的物理距离不小于候选探针 长度, 候选探针长度为 50-250mer。  Constructing a first candidate probe set using discrete high frequency SP sites distributed in the genome, each candidate probe in the first candidate probe set comprising at least one discrete high frequency SP site, the discrete high frequency SP site being equipotential The gene frequency is greater than 10%, and the physical distance from any other discrete high frequency SP site on the reference genome is not less than the candidate probe length, and the candidate probe length is 50-250 mer.
在本发明的一个具体实施方式中, 离散高频 S P是通过千人基因组数据获得的, 也可 以从其它已公开的基因组数据或者获得的进一步选择等位基因频率小于 90%的离散高频 S P位点, 确定候选探针长度为 100mer。  In a specific embodiment of the present invention, the discrete high frequency SP is obtained by thousands of human genome data, and the discrete high frequency SP bits having an allele frequency less than 90% can be further selected from other published genomic data. Point to determine the candidate probe length is 100 mer.
在本发明的一个具体实施方式中, 每个候选探针包含一个离散高频 S P位点, 并且离 散高频 S P位点位于所说的候选探针的中段。这样每条候选探针只包含一个高频 S P位点, 相邻候选探针之间可能有重叠也可能没有重叠。 这里的 "中段", 是相对于 "前段"和 "后 段"来说的, 可以按常规理解, 比如一条序列, 其上、下游 1/3分别定为"前段"和"后段", 中间的 1/3为 "中段"; 进一步的, 离散高频 SNP位点位于所说的候选探针的中点, 这里的 "中点"位置, 比如一条序列包含 2n+l个核苷酸, 中点即为第 n+1核苷酸的位置, 而当一条 序列含有 2η个核苷酸, 序列的中点为第 η或第 n+1个核苷酸的位置, 这样可以增强探针对 目标离散高频 S P位点的捕获效率。  In one embodiment of the invention, each candidate probe comprises a discrete high frequency S P site and the discrete high frequency S P site is located in the middle of said candidate probe. Thus each candidate probe contains only one high frequency S P site, and there may or may not overlap between adjacent candidate probes. The "middle section" here is relative to the "front section" and the "post-section". It can be understood as usual, such as a sequence. The upper and lower 1/3 are defined as "front" and "back" respectively. 1/3 is "middle"; further, the discrete high frequency SNP site is located at the midpoint of the candidate probe, where the "midpoint" position, such as a sequence containing 2n + 1 nucleotides, The point is the position of the n+1th nucleotide, and when a sequence contains 2n nucleotides, the midpoint of the sequence is the position of the nth or n+1th nucleotide, which enhances the probe to the target. Capture efficiency of discrete high frequency SP sites.
在本发明的一个具体实施方式中,基于第一候选探针集中的候选探针序列的 GC含量和 /或单碱基重复对第一候选探针集进行预筛选, 保留了第一候选探针集中的 GC 含量为 35%-65%和 /或单碱基重度小于 7的候选探针。 单碱基重复度是指在一段序列中一个碱基类 型连续出现的次数, 比如 TGAAAAAAAAGC中, 其中的 A连续出现 8次, 该序列的 A碱 基重复度为 8。 序列 GC含量偏高或偏低、 高杂合度容易影响该序列的 PCR或者杂交捕获 过程, 带来 GC偏向性 (GC bias) 等, 使捕获特异性降低, 经此预筛选保留的第一候选探 针集将不会与这些序列杂交, 从而免除 GC bias或低特异性捕获对结果产生的影响。  In a specific embodiment of the present invention, the first candidate probe set is pre-screened based on the GC content and/or single base repeat of the candidate probe sequence in the first candidate probe set, and the first candidate probe is retained. Candidate probes with a concentrated GC content of 35%-65% and/or a single base weight of less than 7. Single base repeatability refers to the number of consecutive occurrences of a base type in a sequence. For example, in TGAAAAAAAAGC, A is consecutively 8 times, and the sequence A has a base repeatability of 8. High or low sequence GC content, high heterozygosity may easily affect the PCR or hybridization capture process of the sequence, bring GC bias, etc., so that the capture specificity is reduced, and the first candidate for retention by this pre-screening The needle set will not hybridize to these sequences, thereby eliminating the effects of GC bias or low specific capture on the results.
步骤二: 将第一候选探针集与参考序列进行比对.以便获得比对结果  Step 2: Align the first candidate probe set with the reference sequence to obtain the comparison result
将第一候选探针集与参考序列进行比对, 获得比对结果,获得第一候选探针集在参考序 列上的位置信息。 所使用的参考序列是已知序列, 可以是预先获得的目标样本所属生物类 别中的任意的参考模板。 比如, 目标样本是人类的, 参考序列可选择美国国家生物技术信 息中心 (NCBI) 提供的 HG18或者 HG19, 进一步的可以预先配置包含更多参考序列的资 源库, 在进行序列比对前, 先依据目标样本的性别、 人种、 地域等因素选择更接近的参考 序列, 有利于获得更有针对性的探针序列。 Comparing the first candidate probe set with the reference sequence to obtain a comparison result, obtaining the first candidate probe set in the reference sequence Location information on the column. The reference sequence used is a known sequence, and may be any reference template in the biological category to which the target sample belongs in advance. For example, the target sample is human, and the reference sequence can select HG18 or HG19 provided by the National Center for Biotechnology Information (NCBI). Further, a resource library containing more reference sequences can be pre-configured, and the sequence is compared before the sequence comparison. The gender, ethnicity, and geographic factors of the target sample select a closer reference sequence, which is beneficial to obtain a more targeted probe sequence.
步骤三: 对第一候选探针集进行第一筛选, 以便获得第二候选探针集  Step 3: Perform a first screening on the first candidate probe set to obtain a second candidate probe set
在本发明的一个具体实施方式中, 经过第一筛选保留的候选探针需满足以下两个条件 中的任一个: 1 ) 第一候选探针集中的比对到参考基因组唯一位置的候选探针; 2) 第一候 选探针集中的比对到参考序列多个位置、 并且与参考序列多个位置中的至少两个位置的错 配比例都小于 10%; 比如候选探针长度 lOOmer, 10个碱基错配即错配比例 10%, 错配率低 用于杂交时能与目标区接近完全互补配对, 捕获效果佳, 特异性高。  In a specific embodiment of the present invention, the candidate probe retained by the first screening needs to satisfy any one of the following two conditions: 1) candidate probes in the first candidate probe set that are aligned to the unique position of the reference genome 2) Aligning the first candidate probe set to multiple positions of the reference sequence and having a mismatch ratio of at least two of the plurality of positions of the reference sequence is less than 10%; for example, the candidate probe length lOOmer, 10 The base mismatch is 10% mismatch ratio, and the mismatch ratio is low. When used for hybridization, it can be completely complementary to the target region, and the capture effect is good and the specificity is high.
步骤四: 将参考序列划分为多个窗口, 将第二候选探针集分配至各自匹配的窗口 将参考序列划分成多个具有预定长度的窗口, 利用比对, 将第二候选探针集中的多个 候选探针分配到匹配上的窗口, 获得各个候选探针在各自窗口上的位置信息。  Step 4: Divide the reference sequence into multiple windows, assign the second candidate probe set to the respective matching window, divide the reference sequence into a plurality of windows having a predetermined length, and use the comparison to concentrate the second candidate probe A plurality of candidate probes are assigned to the matching window to obtain position information of each candidate probe on the respective window.
多个预定长度的窗口的长度可以一致可以不一致, 可以重叠可以不重叠, 在本发明的 一个具体实施方式中, 参考序列为参考基因组, 将参考基因组划分为多个一致长度的窗口, 窗口长度为 10Kb, 且相邻两个窗口连接但不重叠。  The lengths of the windows of the plurality of predetermined lengths may be inconsistent, and may overlap without overlapping. In a specific embodiment of the present invention, the reference sequence is a reference genome, and the reference genome is divided into a plurality of windows of consistent length, and the window length is 10Kb, and two adjacent windows are connected but do not overlap.
步骤五: 基于所说的位置信息以及离散高频 S P的等位基因频率, 对第二候选探针集 进行第二筛选, 确定探针序列  Step 5: Perform a second screening of the second candidate probe set based on the position information and the allele frequency of the discrete high frequency S P to determine the probe sequence
在本发明的一个具体实施方式中, 进行第二筛选包括两个步骤, (a)如果存在多个候选 探针位于同一个窗口, 则确定离散高频 S P的等位基因频率最高的候选探针; (b) 如果仅 存在一个离散高频 S P的等位基因频率最高的候选探针,则选择该离散高频 S P的等位基 因频率最高的候选探针作为探针, 如果存在多个离散高频 S P的等位基因频率最高的候选 探针, 则选择多个离散高频 S P的等位基因频率最高的候选探针中距离窗口中心最近的候 选探针作为所述探针。 候选探针与窗口中心的距离可以是候选探针的中点与该窗口中心的 距离。 目标位置尽可能处于探针序列的中心位置, 有利于提高捕获效率。  In a specific embodiment of the present invention, performing the second screening comprises two steps, (a) if there are multiple candidate probes located in the same window, determining the candidate probe with the highest allele frequency of the discrete high frequency SP (b) If there is only one candidate probe with the highest allele frequency of the discrete high frequency SP, select the candidate probe with the highest allele frequency of the discrete high frequency SP as the probe, if there are multiple discrete heights The candidate probe having the highest allele frequency of the frequency SP selects the candidate probe closest to the center of the window among the candidate probes having the highest allele frequency of the plurality of discrete high-frequency SPs as the probe. The distance of the candidate probe from the center of the window may be the distance between the midpoint of the candidate probe and the center of the window. The target position is as close as possible to the center of the probe sequence, which helps to improve the capture efficiency.
在本发明的一个具体实施方式中, 对第二候选探针集进行第二筛选之后, 当第二候选 探针集中的分别落入相邻两个窗口的相邻两条候选探针在参考基因组上的距离大于相邻两 窗口中任一窗口的长度时, 可选择地, 进一步将参考基因组上的位于相邻两条候选探针之 间的短串连重复序列或者短串联重复序列的一部分添加到经第二筛选后的第二候选探针集 中, 一起构成探针序列。 这样, 利用这些设计获得的探针序列捕获全基因组时, 能使捕获 得的区域的间距呈现相对均匀的分布, 能使捕获确定的区域组合更好的全面反映整个基因 组信息。  In a specific embodiment of the present invention, after the second screening of the second candidate probe set, when two candidate probes in the second candidate probe set fall into adjacent two windows respectively in the reference genome When the upper distance is greater than the length of any of the adjacent two windows, optionally, a short tandem repeat sequence or a portion of the short tandem repeat sequence between the adjacent two candidate probes on the reference genome is further added. The second candidate probe set after the second screening is combined to form a probe sequence. Thus, the capture of the whole genome by the probe sequences obtained by these designs enables a relatively uniform distribution of the captured regions, enabling the capture of the identified regions to better reflect the entire genome information.
根据本发明的另一种实施方式, 提供一种基因组结构变异的检测方法, 所说的基因组 结构变异包括染色体非整倍性、 拷贝数变异和插入缺失的至少之一, 包括以下步骤:  According to another embodiment of the present invention, there is provided a method for detecting a genomic structural variation, the genomic structural variation comprising at least one of chromosomal aneuploidy, copy number variation, and insertion deletion, comprising the steps of:
(一) 对目标样本基因组核酸进行测序, 以便获得基因组测序结果, 所说的基因组测 序结果由多个读段构成, 基因组测序结果可以通过全基因测序获得, 比如通过提取基因组 DNA, 依据现有高通量平台的指导手册, 比如利用 Illumina Hiseq2000/2500、 Roche 454、 Life technologies Ion Torrent,单分子或纳米孔测序平台等进行文库构建及上机测序获得读段 (reads); 或者通过探针捕获所述目标样本的基因组并进行测序获得, 探针可以通过本发明 一方面提供的探针的确定方法进行设计确定, 接着按照现有的方法合成或制备而得的。 (i) sequencing the target sample genomic nucleic acid to obtain genome sequencing results, said genomic measurement The sequence results are composed of multiple reads. Genomic sequencing results can be obtained by whole-genome sequencing, such as by extracting genomic DNA, according to existing high-throughput platform guidelines, such as Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torrent a single molecule or nanopore sequencing platform or the like for library construction and sequencing on the machine to obtain a read; or capture the genome of the target sample by a probe and perform sequencing, and the probe can be provided by an aspect of the present invention. The method of determining the needle is determined by design, and then synthesized or prepared according to the existing method.
(二) 将参考基因组分为 m个区域, 利用测序结果中的读段中落入区域 i的读段计算 目标样本基因组区域 i的覆盖深度 TD1 其中, m和 i为自然数, i为区域编号, l≤i≤m, 10<m。 (2) The reference gene component is m regions, and the coverage depth TD 1 of the target sample genomic region i is calculated by using the read segment falling into the region i in the readout in the sequencing result, where m and i are natural numbers, and i is the region number , l ≤ i ≤ m, 10 < m.
在本发明的一个具体实施方式中, 区域 i 的覆盖深度的计算公式为 _落入区域 i的读段数目 ^^ _落入区域 i的读段所包含的碱基总数 . 一 ^区域 i的长度 ^, 区域 i的长度 , 八干, 1衣不 区域的编号。 读段落到基因组上位置可以通过序列比对确定, 比对可使用各种比对软件, 例如 SOAP ( Short Oligonucleotide Analysis Package), bwa ( Burrows-Wheeler Aligner), samtools, GATK ( Genome Analysis Toolkit) 等。 In a specific embodiment of the present invention, the calculation formula of the coverage depth of the region i is _ the number of reads falling into the region i ^ ^ _ the total number of bases included in the read of the region i. Length ^, the length of the area i, eight dry, 1 number of the clothing area. The position of the reading paragraph to the genome can be determined by sequence alignment, and various alignment software such as SOAP (Short Oligonucleotide Analysis Package), bwa (Burrows-Wheeler Aligner), samtools, GATK (Genome Analysis Toolkit) and the like can be used for the alignment.
(三)基于目标样本基因组区域 i的覆盖深度与 k个参照样本的区域 i的覆盖深度的差 异程度, 判断目标样本区域 i的结构变异的发生, 其中, k为自然数, k≥2。  (3) Based on the difference between the coverage depth of the target sample genomic region i and the coverage depth of the region i of the k reference samples, the occurrence of structural variation of the target sample region i is determined, wherein k is a natural number, k ≥ 2.
在本发明的一个具体实施方式中, 目标样本基因组区域 i的覆盖深度与 k个参照样本的 区域 i的覆盖深度的差异程度的比较, 是通过比较目标样本和参照样本的基因组区域 i的覆 盖深度系数来实现的, 目标样本基因组区域 i的覆盖深度系数 的确定包括以下步骤, (a) 对 进行第一校正以获得 TD^ 第一校正是通过对包含区域 i在内的 2η个连续区域的覆 盖深度值进行线性回归实现的, 其中, η为自然数, 10<n≤m/2, 在本发明的一个具体实施方 式中, 经第一校正线性回归获得的11)31 =( 」 TDj )/n , 其中, TD」表示 n个连续区域中的 第 j个区域的覆盖深度, j 为自然数, l≤j≤n; (b) 在获得区域 i 的第一校正覆盖深度 TDai 后, 进一步对 1031进行均一化获得 1 Α^, 进而获得 κ^ ^^ 11^ , 在本发明一个具体实施 In a specific embodiment of the present invention, the degree of difference between the coverage depth of the target sample genomic region i and the coverage depth of the region i of the k reference samples is compared by comparing the coverage depth of the genomic region i of the target sample and the reference sample. The coefficient, the determination of the coverage depth coefficient of the target sample genomic region i includes the following steps: (a) performing the first correction to obtain the TD^ the first correction is by covering the 2n consecutive regions including the region i The depth value is implemented by linear regression, where η is a natural number, 10<n≤m/2, and in a specific embodiment of the present invention, 11) 31 = ( ” TDj )/n obtained by the first corrected linear regression Where TD" represents the coverage depth of the jth region in n consecutive regions, j is a natural number, l ≤ j ≤ n; (b) after obtaining the first corrected coverage depth TD ai of region i, further to 10 31 is homogenized to obtain 1 Α ^, and further κ ^ ^^ 11 ^ is obtained, in a specific implementation of the present invention
TDa = V^TD, /n 方式中, 对区域 i的第一校正覆盖深度 TDM进行均一化获得的 ai ^J 。 在本 发明 一个具体实施方式中, 在获得目标样本的 后进一步包括对 进行第二 TD a = V^TD, in the /n mode, ai ^ J obtained by homogenizing the first corrected coverage depth TD M of the region i. In a specific embodiment of the present invention, after the target sample is obtained, the pair further includes
n,
Figure imgf000007_0001
, y为自然数表示参照样本编号, R y表示参照样本 y基因组区域 i的覆盖深度系数。
n,
Figure imgf000007_0001
y is a natural number indicating a reference sample number, and R y is a reference depth coefficient of the reference sample y genomic region i.
在本发明的另一个具体实施方式中, 在获得目标样本的 后进一步包括对 进行第二 校正以获得 , R-, 其中, !^为 k个参照样本和一个目标样本的基因组区域 i的覆盖深 R - y=1 In another embodiment of the present invention, after obtaining the target sample, further comprising performing a second correction to obtain, R -, wherein, ! ^The depth of coverage of the genomic region i for k reference samples and one target sample R - y=1
度系数的平均值, ai k+l , y为自然数表示参照样本编号, y表示参照样本 y基 因组区域 i的覆盖深度系数。 The average of the degree coefficients, ai k+ l , y is a natural number indicating the reference sample number, and y is the coverage depth coefficient of the reference sample y genomic region i.
上述计算处理目标样本基因组区域 i的覆盖深度系数 的过程中, 对中间数值的的校 正、 均一化等处理能减少因实验条件的波动、 样品间本身的差异等带来的误差, 使最后的 n能真实反映 且围绕 1的波动幅度比 小, 且多个样本的 ^符合正态分布; 上述实施方 式中对 进行第一校正,接着对第一校正后的数值进行均一化,相当于两次求均值的过程, 即在打算以包含区域 i的 n个连续区域的覆盖深度均值代表区域 i的覆盖深度之前, n个区 域中每个区域的覆盖深度值的计算都是利用以该区域为第一个区域的 n个连续区域的覆盖 深度均值表示的,这样相当于利用包含目标区域 i的 2η个连续区域的覆盖深度值来校正 TD^ 能使连续区域的覆盖深度保持稳定。 需要说明的是, 本领域人员可以利用其它校正或求平 均值处理使相邻几个区域的覆盖深度值保持稳定, 比如以与目标区域间隔多少个的几个区 域的平均覆盖深度来校正目标区域覆盖深度, 均属于本发明的构思。 参照样本基因组区域 i 的覆盖深度系数的计算处理可以参考目标样本基因组区域 i 的覆盖深度系数的计算处理过 程, 参照样本数据可以预先计算处理好备用, 也可以与目标样本的计算处理过程同步进行 而获得。  In the above process of calculating the coverage depth coefficient of the target sample genomic region i, the correction, homogenization, and the like of the intermediate values can reduce errors caused by fluctuations in experimental conditions, differences between samples, and the like, so that the final n It can be truly reflected and the fluctuation amplitude ratio around 1 is small, and the plurality of samples conform to the normal distribution; in the above embodiment, the first correction is performed, and then the first corrected value is normalized, which is equivalent to two requests. The process of averaging, that is, before the mean depth of coverage of the n consecutive regions including the region i is used to represent the coverage depth of the region i, the calculation of the coverage depth value of each of the n regions is performed by using the region as the first The coverage depths of the n consecutive regions of the regions are represented by the mean value, which is equivalent to correcting the TD^ by using the coverage depth values of the 2n consecutive regions including the target region i to stabilize the coverage depth of the continuous regions. It should be noted that other correction or averaging processing can be used to stabilize the coverage depth values of adjacent regions, for example, the average coverage depth of several regions spaced from the target region to correct the target region. The depth of coverage is a concept of the present invention. Referring to the calculation process of the coverage depth coefficient of the sample genomic region i, reference may be made to the calculation process of the coverage depth coefficient of the target sample genomic region i, and the reference sample data may be pre-computed and processed for backup, or may be synchronized with the calculation process of the target sample. obtain.
在本发明的一个具体实施方式中, 目标样本基因组区域 i的覆盖深度与 k个参照样本的 区域 i的覆盖深度的差异程度的判断, 是通过 t检验二者的覆盖深度系数的差异是否显著来 实现的。 在本发明的一个具体实施方式中, 目标样本基因组区域 i的 t检验统计量的计算公  In a specific embodiment of the present invention, the degree of difference between the coverage depth of the target sample genomic region i and the coverage depth of the region i of the k reference samples is determined by t test whether the difference in coverage depth coefficients between the two is significant. Realized. In a specific embodiment of the invention, the calculation of the t-test statistic of the target sample genomic region i
式为 Y k , 其中, 1 ^表示 k个参照样本的 ^的平均值, 1 ^为参照样本 y基因组区 The formula is Y k , where 1 ^ represents the average of ^ of the k reference samples, 1 ^ is the reference sample y genomic region
域 i 的经第二校正的覆盖深度系数, ' R
Figure imgf000008_0001
, S为 k个参照样本标准差,
Figure imgf000008_0002
。 基于目标样本基因组区域 1的 值, 获得显著水平 I 当 Ρ Ο.05 , 判定所 述区域 i发生结构变异; 反之, 则判定所述区域 i不发生结构变异。 在本发明的另一个具体 实施方式中,基于目标样本基因组区域 i的 值和预先确定的显著水平 PlQ,获得 理论值 tl0, 当 ≥ ¾, 判定所述区域 i发生结构变异, 反之, 则判定所述区域 i不发生结构变异, 预先 确定的 P1()≤ 0.05。 根据 t检验的 t值表, 预定 P1()后可查得对应的½。
The second corrected coverage depth factor for domain i, ' R ,
Figure imgf000008_0001
, S is the standard deviation of k reference samples,
Figure imgf000008_0002
. Based on the value of the target sample genomic region 1, a significant level I is obtained. When Ρ 05 .05 , it is determined that the region i has a structural variation; otherwise, it is determined that the region i does not undergo structural variation. In another embodiment of the present invention, the theoretical value t l0 is obtained based on the value of the target sample genomic region i and the predetermined significant level P lQ , and when ≥ 3⁄4, it is determined that the region i undergoes structural variation, and vice versa. It is determined that the region i does not undergo structural variation, and the predetermined P 1 () ≤ 0.05. According to the t-value table of the t-test, the corresponding 1⁄2 can be found after the predetermined P 1() .
在本发明的一个实施方式中, 为检测更大的 CNV或插入缺失, 在进行步骤(三)之后, 将同方向且连续的 W个区域合并, 获得一级合并区域, 合并两个一级合并区域当两个一级 合并区域是同方向的并且之间的跨度不超过 L个区域, 获得二级合并区域, 检测二级合并 区域的结构变异; 其中, 同方向区域指区域的覆盖深度的 t统计量都大于 0或者都小于 0的 区域, W和 L均为自然数, W≥2, L-W≤l。 要进一步检测更大的结构变异, 可依次类推, 如进一步合并符合条件的二级合并区域, 合并条件可类似的为两个二级合并区域同方向且 之间的在参考基因组上的距离不超过 L个区域或 L个二级合并区域。 In one embodiment of the present invention, in order to detect a larger CNV or an insertion defect, after performing step (3), the W regions in the same direction and consecutively are merged to obtain a first-level merged region, and the two primary merges are merged. When the two primary merged areas are in the same direction and the span does not exceed L areas, the secondary merged area is obtained, and the secondary merge is detected. The structural variation of the region; wherein, the same direction region refers to the region where the t-statistic of the coverage depth of the region is greater than 0 or both are less than 0, and W and L are both natural numbers, W≥2, LW≤l. To further detect larger structural variations, and so on, such as further merging the eligible secondary merged regions, the merged conditions can be similarly the same in the same direction of the two secondary merged regions and the distance between the reference genomes does not exceed L areas or L secondary merge areas.
在本发明的一个具体实施方式中, 检测二级合并区域的结构变异, 是基于目标样本基 因组的所述二级合并区域的覆盖深度与多个参照样本基因组上对应的区域的覆盖深度的差 异程度, 来判断该二级合并区域是否发生结构变异, 或者说来判断发生在区域 i的结构变异 是否横跨 W个区域。 参照样本基因组上对应的二级合并区域的覆盖深度的获得、 目标样本 基因组上的二级合并区域覆盖深度的 t统计量的计算及结构变异判断过程可参见前面相对 小的区域 i的结构变异的计算判断过程。  In a specific embodiment of the present invention, detecting a structural variation of the secondary merged region is based on a difference in coverage depth of the secondary merged region of the target sample genome and a coverage depth of a corresponding region on the plurality of reference sample genomes. To determine whether the secondary merged region has structural variation, or to determine whether the structural variation occurring in region i spans W regions. Refer to the acquisition of the coverage depth of the corresponding secondary merged region on the sample genome, the calculation of the t-statistic of the coverage depth of the secondary merged region on the target sample genome, and the structural variation judgment process. See the structural variation of the previously relatively small region i. Calculate the judgment process.
根据本发明的再一个实施方式, 提供一种适用于检测基因组结构变异中的杂合性丢失 的的方法, 包括以下步骤:  According to still another embodiment of the present invention, there is provided a method suitable for detecting loss of heterozygosity in genomic structural variation, comprising the steps of:
( 1 ) 对目标样本基因组核酸进行测序, 以便获得基因组测序结果, 所说的基因组测序 结果由多个读段构成, 基因组测序结果可以通过全基因测序获得, 比如通过提取基因组 DNA, 依据现有高通量平台的指导手册, 比如利用 Illumina Hiseq2000/2500、 Roche 454、 Life technologies Ion Torrent,单分子或纳米孔测序平台等进行文库构建及上机测序获得读段 (reads); 或者通过探针捕获所述目标样本的基因组并进行测序获得, 探针可以通过本发明 一方面提供的探针的确定方法进行设计确定, 接着按照现有的方法合成或制备而得的。  (1) sequencing the target sample genomic nucleic acid to obtain the result of genome sequencing, wherein the genome sequencing result is composed of multiple reads, and the genome sequencing result can be obtained by whole-genome sequencing, for example, by extracting genomic DNA, according to the existing high Guidance manual for the flux platform, such as using Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torrent, single molecule or nanopore sequencing platform for library construction and sequencing on the machine to obtain reads; or by probe capture The genome of the target sample is obtained and sequenced, and the probe can be designed and determined by the method for determining the probe provided by one aspect of the present invention, and then synthesized or prepared according to the existing method.
(2) 将参考基因组分成 m' 个区域, 基于测序结果中落在参考基因组区域 i中的读段 信息和群体区域 i数据, 获得目标样本基因组区域 i和群体区域 i共有的 S P集, 分别计算 目标样本和群体的共有 S P集中的各个 S P位点所在片段的杂合度,获得目标样本基因组 区域 i的杂合度集 1^, 和群体区域 i的杂合度集 UQl, 比较目标样本 U和群体 UQl以确定目 标样本区域 i杂合性丢失是否发生; 其中, 所述共有 S P集中的每个 S P的等位基因频率 都大于 0.1, 所说的共有 S P集中的一个 S P位点所在片段是以与该 S P相邻的上下游两 个 S P为边界点的, m' 和 i为自然数, m' > i >l , m' ≥6。 (2) The reference gene components are divided into m' regions, and based on the read segment information and the population region i data falling in the reference genome region i in the sequencing result, the SP sets shared by the target sample genome region i and the population region i are obtained, respectively, The heterozygosity of the fragment of each SP site in the consensus SP set of the target sample and the population, the heterozygosity set 1^ of the target sample genomic region i, and the heterozygosity set U Ql of the population region i, the comparison target sample U and the population U Ql is determined whether the loss of heterozygosity in the target sample region i occurs; wherein the allele frequency of each SP in the shared SP set is greater than 0.1, and the segment of the SP site in the shared SP set is The two SPs upstream and downstream of the SP are boundary points, m' and i are natural numbers, m'> i > l , m' ≥ 6.
在本发明的一个具体实施方式中,一个 S P位点所在片段的杂合度是以该 S P位点的 次等位基因频率系数表示的, 所述 S P位点的次等位基因频率系数 Rhet=MAF/ ( 1-MAF), MAF为该高频 S P的次等位基因频率。 In a specific embodiment of the present invention, the heterozygosity of the fragment in which an SP site is located is represented by a sub-allelic frequency coefficient of the SP site, and the sub-allelic frequency coefficient of the SP site is R het = MAF / ( 1-MAF), MAF is the sub-allelic frequency of the high frequency SP.
在本发明的一个具体实施方式中, 比较目标样本 U和群体 UQl以确定目标样本区域 i 杂合性丢失是否发生,包括利用 F检验判断 U的方差 和 UQl的方差 是否有显著性差异,, 若 U^P UQl的方差差异显著, 则判定所述目标样本区域 i存在杂合性丢失, 反之, 则判定所 述目标样本区域 i没有存在杂合性丢失。 In a specific embodiment of the present invention, the target sample U and the population U Q1 are compared to determine whether the loss of heterozygosity of the target sample region i occurs, including whether the variance of U and the variance of U Ql are significantly different by using the F test. If the variance of U^PU Q1 is significant, it is determined that there is a loss of heterozygosity in the target sample region i, and conversely, it is determined that there is no loss of heterozygosity in the target sample region i.
在本发明的一个具体实施方式中, F检验包括分别计算 U^P UlQ的方差, 利用所得目标 样本 U的方差 和群体 UlQ的方差 计算获得两个互为倒数的统计量 Fup ^和 利用 互为所说的倒数的统计量获得显著水平 pF, 比较 pF与预定显著水平 pFQ的大小, pF≤pF0说 明两 包含计算公式, In a specific embodiment of the present invention, the F test includes separately calculating the variance of the U^PU lQ , and using the variance of the obtained target sample U and the variance of the population U lQ to obtain two statistics of the reciprocal statistic F up ^ and utilization. The statistic of the reciprocal of each other obtains a significant level p F , comparing the magnitude of p F with a predetermined significance level p FQ , p F ≤p F0 indicating the two inclusion calculation formulas,
Figure imgf000010_0001
; 其中, v为目标样本基因组区域 i和群体 区域 i共有 S P集中 S P的编号, q为目标样本基因组区域 i和群体区域 i共有 S P集中 S P的个数, R^AV为目标样本基因组区域 i的共有 S P集中的第 V个 SNP的次等位基因 频率系数, 为目标样本基因组区域 i的共有 S P集中的 q个 SNP的次等位基因频率系 数的平均值, RtelQv群体样本基因组区域 i的共有 S P集中的第 V个 SNP的次等位基因频 率系数, RtelQ为群体样本基因组区域 i的共有 S P集中的 q个 SNP的次等位基因频率系 数的平均值, Pupper和 Punder分别根据 Fupper和 Funder获得, pFQ≤0.05 pFQ可以取通常设置的值、 或者根据所掌握的已知信息、 对检测准确性的要求等调整设置。
Figure imgf000010_0001
; Wherein, v is the target sample genomic region i and groups region i consensus SP concentrated SP number, q is the number of the target sample genomic region i and groups region i consensus SP concentrate the SP, R ^ AV for the target sample genome zone i The sub-allelic frequency coefficient of the Vth SNP in the SP set is the average of the sub-allelic frequency coefficients of the q SNPs in the consensus SP set of the target sample genomic region i, Rte , lQ , v population sample genomic region SP shared allele frequency coefficients times the concentration of the first V i th SNP's, Rte, lQ SP sample is total genomic region population set of q i, the average of the SNP, Pupper Punder times and allele frequency coefficients, respectively Obtained from F upper and F under , p F Q ≤ 0.05 p F Q can be set to a value that is normally set, or adjusted according to known information that is known, requirements for detection accuracy, and the like.
在本发明的一个实施方式中, 为检测更大的 LOH, 在步骤 (2 ) 之后, 将 W' 个发生 杂合性丢失且连续的区域合并, 获得三级合并区域, 合并两个三级合并区域当所述两个三 级合并区域之间的跨度不超过 L' 个区域时, 获得四级合并区域, 分别获得目标样本四级合 并区域的杂合度集和群体同样区域的杂合度集, 比较两个杂合度集, 以确定目标样本四级 合并区域是否发生杂合性丢失, 其中, W' 和 L ' 均为自然数, W' >2, W' /2≥L '。 在本 发明的一个具体实施方式中, W' ≥4。 要检测更大区域发生的 LOH, 可依次类推, 比如进 一步合并符合条件的四级合并区域, 合并条件可类似的为两个四级合并区域之间的在参考 基因组上的距离不超过 L' 个区域或 L' 个三级合并区域。  In one embodiment of the present invention, in order to detect a larger LOH, after step (2), W's heterozygous loss and continuous regions are merged to obtain a three-level merged region, and two tertiary merges are merged. When the span between the two three-level merged regions does not exceed L' regions, a four-level merged region is obtained, and the heterozygosity set of the four-level merged region of the target sample and the heterozygosity set of the same region of the same are respectively obtained. Two heterozygosity sets are used to determine whether the four-level merged region of the target sample has a loss of heterozygosity, where W' and L' are both natural numbers, W' > 2, W' /2 ≥ L '. In a specific embodiment of the invention, W' ≥ 4. To detect the LOH occurring in a larger area, and so on, for example, to further merge the eligible four-level merged region, the merge condition can be similarly that the distance between the two four-level merged regions on the reference genome does not exceed L' Zone or L' three-level merged zone.
根据本发明的再一个实施方式, 提供一种检测单亲二倍体的方法, 当某目标样本基因 组区域存在杂合性丢失时, 计算这个区域的拷贝数, 当这个区域拷贝数与同物种正常基因 组上该区域的拷贝数一样时, 判定所述目标样本的这个基因组区域存在 UPD; 基因组区域 是否存在 LOH可通过前面本发明披露的一方面的 LOH检测方法进行。  According to still another embodiment of the present invention, there is provided a method for detecting a diploid of a single parent, wherein when there is loss of heterozygosity in a genomic region of a target sample, the copy number of the region is calculated, and when the copy number of the region is the normal genome of the same species When the copy number of the region is the same, it is determined that there is UPD in the genomic region of the target sample; whether or not the LOH is present in the genomic region can be performed by the LOH detection method of the aspect disclosed in the present invention.
本领域普通技术人员可以理解, 上述实施方式中各种方法的全部或部分步骤可以通过 程序指令相关硬件完成, 该程序可以存储于一计算机可读存储介质中, 存储介质可以包括: 只读存储器、 随机存储器、 磁盘或光盘等。  A person skilled in the art may understand that all or part of the steps of the various methods in the above embodiments may be completed by a program instruction related hardware, and the program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, Random access memory, disk or CD, etc.
根据本发明的最后一个实施方式, 还提供一种检测基因组结构变异的装置, 包括: 数 据输入单元, 用于输入数据; 数据输出单元, 用于输出数据; 存储单元, 用于存储数据, 其中包括可执行的程序; 处理器, 与上述数据输入单元、 数据输出单元及存储单元数据连 接, 用于执行存储单元中存储的可执行的程序, 程序的执行包括完成上述实施方式中各种 方法的全部或部分步骤。 以下结合具体目标个体对依据本发明的具体探针设计方法及结构变异检测方法的运行 结果进行详细的描述。 下述过程涉及的名称定义或具体参数设置选择为: According to a last embodiment of the present invention, there is provided an apparatus for detecting a genomic structure variation, comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data, including Executable program; processor, connected with the above data input unit, data output unit and storage unit data The executable program for executing the storage in the storage unit, the execution of the program includes all or part of the steps of the various methods in the above embodiments. The specific probe design method and the structural mutation detection method according to the present invention will be described in detail below in conjunction with specific target individuals. The name definitions or specific parameter settings involved in the following process are selected as:
1、 将设计的探针称为选择目标区域探针 ( Seleted Target Region Primers, SeTR);  1. Design the probes as Seleted Target Region Primers (SeTR);
2、 下文中的 "覆盖深度"、 "测序深度"和 "深度", 可替换使用; 下文中的 "区域" 和 "目标区域"可替换使用;  2. The "cover depth", "sequencing depth" and "depth" below can be used interchangeably; the "region" and "target area" below can be used interchangeably;
3、 文库构建、 测序依据 Hiseq 2000平台提供的小片段文库构建操作说明及上机测序说 明来操作, 文库的大小为 300bp-350bp, 双端测序 (pair-end sequencing), 读段长 91bp (测 序类型为 PE91+8+91 );  3. Library construction and sequencing were performed according to the small fragment library construction instructions provided by the Hiseq 2000 platform and the sequencing instructions. The library size was 300bp-350bp, pair-end sequencing, and the read length was 91bp. Type is PE91+8+91);
4、 比对选择的参考基因组或参考序列为人类参考基因组 (hgl9, Build 37)。  4. The reference genome or reference sequence selected for comparison is the human reference genome (hgl9, Build 37).
实施例中未注明具体技术或条件的, 按照本领域内的文献所描述的技术或条件 (例如 参考 J.萨姆布鲁克等著, 黄培堂等译的 《分子克隆实验指南》, 第三版, 科学出版社) 或者 按照产品说明书进行。 所用试剂或仪器未注明生产厂商者, 均为可以通过市购获得的常规 产品, 例如可以采购自 Illumina公司。  In the examples, the specific techniques or conditions are not indicated, according to the techniques or conditions described in the literature in the field (for example, refer to J. Sambrook et al., Huang Peitang et al., Molecular Cloning Experimental Guide, Third Edition, Science Press) Or follow the product manual. The reagents or instruments used are not specified by the manufacturer, and are conventional products that are commercially available, for example, from Illumina.
实施例 1 : 芯片设计、 制备、 测试  Example 1 : Chip Design, Preparation, Testing
通常, 高 (>60%) 或者低 (<35%) GC含量和高杂合度容易给其 DNA片段在 PCR 或者探针捕获过程中是带来不利的影响, 为了避免此种现象, 我们设计了特殊的探针, 我 们将其称为 SeTR. 在设计 SeTR探针的时候, 遵循以下几个原则: a) 探针序列的唯一性和 稳定性较高, 要求具有低杂合性和中等的 GC (35%~60%)含量, b)含有离散型的高频 S P 标记 (SNP marker), 各 SNP的等位基因频率 (allele frequency, 0.9>AF>0.1 ) 以便更好检 测全基因组的 LOH, c) 最终的目标区域呈现出相对均匀的分布。  In general, high (>60%) or low (<35%) GC content and high heterozygosity tend to adversely affect the DNA fragment during PCR or probe capture. To avoid this phenomenon, we designed The special probe, we call it SeTR. When designing the SeTR probe, the following principles are followed: a) The uniqueness and stability of the probe sequence is high, requiring low heterozygosity and moderate GC (35%~60%) content, b) contains discrete high frequency SP markers (SNP markers), allele frequencies of each SNP (allele frequency, 0.9>AF>0.1) for better detection of genome-wide LOH, c) The final target area exhibits a relatively uniform distribution.
SeTR探针设计或者说目标区域的选择流程如下:  The selection process of the SeTR probe design or target area is as follows:
1 ) 基于千人基因数据库 (ftp:〃 ftp.ncbi.nih.gov/1000genomes/ftp/release) ,挑选 出等位基因频率(Allele frequency, AF)为 10%~90%的候选 SNP集, 然后再在 S P 集中去掉两个 S P之间物理距离小于 lOOpb的一个 S P,从而构成 S P makerl 集。  1) Based on the thousand human gene database (ftp: ftp ftp.ncbi.nih.gov/1000genomes/ftp/release), select a candidate SNP set with an allele frequency (AF) of 10% to 90%, and then Then, in the SP set, an SP with a physical distance less than lOOpb between the two SPs is removed, thereby forming a SP makerl set.
2) 以 SNP makerl集的每一个 S P为中点,在其上下游各截取参考基因组序列 50pb, 构成 lOObp 的理论探针序列集, 然后将此探针序列集比回到参考序列上。 如 果某一探针序列的最佳比对没有错配, 且其次佳比对也只有小于 5%的错配, 那么其 对应的 S P则被保留, 从而构成 SNP maker2集。  2) Take each S P of the SNP makerl set as the midpoint, and intercept the reference genome sequence 50pb upstream and downstream to form a theoretical probe sequence set of lOObp, and then compare the probe sequence set back to the reference sequence. If there is no mismatch in the optimal alignment of a certain probe sequence, and the next best alignment is less than 5% mismatch, then the corresponding S P is preserved, thus forming the SNP maker 2 set.
3 ) 基于 SNP maker2 集, 我们挑选出在参考基因组物理位置上均匀的 S P maker为最终的 S P maker集。在我们的研究中,我们选择了物理距离大约为 lOKbp 的 SNP maker集。  3) Based on the SNP maker2 set, we selected the S P maker that is uniform in the physical location of the reference genome as the final S P maker set. In our study, we chose a SNP maker set with a physical distance of approximately lOKbp.
4) 如果在最终的 SNP maker集中, 有两个临近的 S P 之间的距离大于了 lOKbp, 在选用他们之间的短重复序列 (short tandem repeat, STR) 来填补均匀。 设计完 SeTR探针后,我们委托 Roche来产出 SeTR液相芯片。 SeTR液相芯片含有 278800 个探针, 总大小为 41,795,106bp, 其覆盖了有效的全基因组 (2.89G) 1.45%的区域。 SeTR 探针平均长度达到了 149bp, 相邻两两探针之间的平均物理距离为 10.6kbp, 如表 1和图 1 所示。 表 1 SeTR探针在每条染色体上的分布 4) If in the final SNP maker set, there is a distance between two adjacent SPs greater than lOKbp, use a short tandem repeat (STR) between them to fill the uniformity. After designing the SeTR probe, we commissioned Roche to produce the SeTR liquid phase chip. The SeTR liquid phase chip contains 278,800 probes with a total size of 41,795,106 bp covering an effective whole genome (2.89G) 1.45% region. The average length of the SeTR probe was 149 bp, and the average physical distance between adjacent probes was 10.6 kbp, as shown in Table 1 and Figure 1. Table 1 Distribution of SeTR probes on each chromosome
Figure imgf000012_0001
Figure imgf000012_0001
用 3个质检合格的 DNA样品, YH (炎黄样本, 中国人基因组 DNA)、 HG00537 (千 人基因组项目中的一个样本) 禾 B GM50275 (获自柯瑞尔医学研究所 Coriell Institute for Medical Research的人成纤维母细胞样本), 来测试 SeTR芯片的可用性, 以保证此探针芯片 能用于后续的检测研究。 三个样本都利用 SeTR捕获建库测序, 获得测序序列 (reads)。 首 先我们去掉被接头 (adapter) 污染和质量较低的比如平均质量值低于 20的 reads后, 称剩 下的 reads 为干净 reads (clean reads), 将干净 reads 比对到参考序列 hgl9 上, 得到了 98.13%~99.29的 reads比对到参考基因组上,其中比对到目标区域的达到了 67.43%~67.87%, 此外, 有 99.73%~99.95的目标区域至少被一条 reads覆盖了, 有超过 99%的区域至少被覆 盖到了 10次, 如表 2所示, 这些表现都要优于同类型的外显子组捕获 (exome capture) 芯 片, 比如 Roche Nimblegen公司生产的外显子组液相芯片。 此外, 目标区域的深度分布, 如 图 4A所示, 类似于泊松分布(Poisson distribution), 图 4B显示目标区域内的绝大部分的高 杂合位点的非参考序列碱基型 (the non-reference allele) 的 reads支持数与参考序列碱基型 (reference allele) 的 reads支持数几乎相当, 即比对时某高杂合位点获得的正负 reads支持 数相当 (正负 reads分别来源两条同源染色体), 这些都显示此探针无明显的单倍体型 (常 见为参考序列碱基型, 即 ref型) 捕获的偏向性, 以及对目标区域捕获均一性较优。 Use 3 quality-qualified DNA samples, YH (Yanhuang sample, Chinese genomic DNA), HG00537 (a sample from the Thousand Human Genome Project) Wo B GM50275 (obtained from the Couriel Institute for Research) Medical Research's human fibroblast sample) was used to test the availability of the SeTR chip to ensure that the probe chip could be used for subsequent detection studies. All three samples were sequenced using SeTR capture to obtain sequencing sequences ( rea ds). First, we remove the contamination that is contaminated by the adapter and the lower quality, such as the average quality value below 20, the remaining reads are clean reads (clean reads), the clean reads are compared to the reference sequence hgl9, The reads of 98.13%~99.29 were compared to the reference genome, and the target area reached 67.43%~67.87%. In addition, the target area of 99.73%~99.95 was covered by at least one read, and over 99%. The area was covered at least 10 times, as shown in Table 2, which performed better than the same type of exome capture chip, such as the exome liquid chip produced by Roche Nimblegen. In addition, the depth distribution of the target region is similar to the Poisson distribution as shown in Fig. 4A, and the non-reference sequence base type of the most heterozygous sites in the target region is shown in Fig. 4B (the non The number of reads supported by -reference allele is almost the same as the number of reads supported by the reference allele, that is, the number of positive and negative reads supported by a high heterozygous site is comparable (positive and negative reads are respectively sourced from two). A homologous chromosome), which shows that this probe has no obvious haplotype (commonly referred to as the base sequence, ie ref type). The bias of capture is better, and the capture uniformity of the target region is better.
表 2 三个样品的比对结果  Table 2 Comparison results of three samples
Figure imgf000013_0001
target covered >=30X, % )
Figure imgf000013_0001
Target covered >=30X, % )
目标区域中测序深度 >=40X的部分所占的比 (Fraction of 80.03 79.01 77.81 target covered >=40X, %) 实施例 2: 目标区域文库构建、 测序 Ratio of the portion with the sequencing depth >= 40X in the target region (Fraction of 80.03 79.01 77.81 target covered >= 40X, %) Example 2: Target region library construction, sequencing
1、 试验材料、 试剂、 仪器  1. Test materials, reagents, instruments
样本: 15例目标 gDNA样本 (人基因组 DNA, 样本编号见以下表 3, "GM""开头的都 是人成纤维母细胞), 24个参照 DNA样本。  Samples: 15 target gDNA samples (human genomic DNA, sample numbers are shown in Table 3 below, "GM" begins with human fibroblasts), and 24 reference DNA samples.
主要试剂仪器: PCR仪、 移液器、 离心机、 舒适型恒温混匀仪、 DNA打断仪、 涡旋振 荡器、 磁力架、 电泳仪、 Hiseq2000测序仪、 Nanodrop紫外分光光度计等, 所用试剂或仪器 未注明生产厂商者, 均为可以通过市购获得的常规产品。  Main reagent instruments: PCR instrument, pipette, centrifuge, comfort thermomixer, DNA interrupter, vortex shaker, magnetic stand, electrophoresis instrument, Hiseq2000 sequencer, Nanodrop UV spectrophotometer, etc. Or if the instrument does not indicate the manufacturer, it is a regular product that can be obtained through the market.
探针设计及合成: 通过实施例一获得, 在人的全基因组范围内, 选取约 41M的目标区 域, 从罗氏公司 (Roche)定制 NimbleGen SeqCap EZ 液相探针, 该探针集能够捕获对应的 所设计的目标区域。  Probe Design and Synthesis: By the first example, a target area of approximately 41M was selected from the human genome, and a NimbleGen SeqCap EZ liquid probe was customized from Roche. The probe set was able to capture the corresponding The target area of the design.
2、 文库构建  2, library construction
1 ) 基因组 DNA提取  1) Genomic DNA extraction
使用 QIAGEN DNA提取试剂盒 (DNA Mini Kit), 并按照试剂盒说明书, 从目标样本 中提取基因组 DNA约 3-5μ§, 用于后续实验。 将提取好的 DNA (30-100ng) 跑电泳检测, 判断是否完整以及降解程度。 Use the QIAGEN DNA Extraction Kit (DNA Mini Kit) and follow the kit instructions to extract genomic DNA from the target sample for approximately 3-5 μ § for subsequent experiments. The extracted DNA (30-100 ng) was run for electrophoresis to determine whether it was intact and the degree of degradation.
2) 基因组 DNA打断及纯化  2) Genomic DNA disruption and purification
使用 covaris E210 仪器对基因组 DNA进行打断(参照仪器使用说明进行操作)。将 DNA 打断成 200-250bp。 使用 QIAquick PCR Purification kit (250)试剂盒, 按照试剂盒说明书操 作, 对打断后的 DNA片段进行纯化, 电泳检测主带大小是否符合要求, 即主带大小是否为 200-250bpo Interrupt the genomic DNA using the covaris E210 instrument (refer to the instrument instructions). The DNA was interrupted to 200-250 bp. Using the QIAquick PCR Purification kit (250) kit according to the kit instructions, the DNA fragment was purified interrupted, electrophoresis main band size meets the requirements, i.e., whether the size of the main belt 200-250bp o
3 ) 末端修复、 末端加 A、 加接头、 预扩增  3) end repair, end plus A, add linker, preamplification
按建库要求, 按双末端标签文库构建说明书步骤及其列明的试剂、 反应条件等, 对上 述断裂纯化后的 DNA片段进行末端修复, 并进行纯化;加个碱基 A于经末端修复纯化后的 DNA片段的两端, 纯化末端加 A产物; 在末端加 A产品的两端连接测序接头, 并利用能与 测序接头互补结合的磁珠纯化带接头的 DNA片段。配制 PCR反应体系,扩增带接头的 DNA 片断, 磁珠纯化 PCR产物, 电泳检测扩增产物主带大小是否在 300-350bp; 用 Nanodrop紫 外分光光度计检测 DNA量, 总量需大于 1.0μ§According to the requirements of database construction, the DNA fragment of the above-mentioned fragmentation was subjected to end repair and purification according to the steps of the construction of the double-end tag library and the reagents and reaction conditions, and the base A was added and purified by terminal repair. At both ends of the latter DNA fragment, the purified end is added with the A product; the sequencing linker is ligated at both ends of the terminal A product, and the DNA fragment with the linker is purified using magnetic beads capable of complementary binding to the sequencing linker. The PCR reaction system is prepared, the DNA fragment with the linker is amplified, the PCR product is purified by magnetic beads, and the main band size of the amplified product is 300-350 bp by electrophoresis; the amount of DNA is detected by Nanodrop ultraviolet spectrophotometer, and the total amount needs to be greater than 1.0 μ § .
4) SeTR探针杂交及洗脱, 扩增  4) SeTR probe hybridization and elution, amplification
依照市售的 NimbleGen SeqCap EZ杂交洗脱试剂盒说明书进行, 购买或配置试剂盒说 明书中的杂交、洗脱相关试剂。准备 1.5mL离心管,加入 Cot-1 DNA,通用封闭序列(Block), 标签的封闭序列 (index N Block) 和经步骤 3 ) 后的 DNA样品。 然后离心 lmin, 60°C真空 浓缩干燥, 然后加入杂交缓冲液等, 震荡离心, 放到 95°C的金属干浴锅里变性 lOmin, 震 荡后高速离心。 在离心管中加入 4.5ul探针, 在 PCR仪上杂交 (47°C, 64-72hours)。 杂交 完成后进行洗脱。 然后按照文库构建说明书最后的扩增步骤进行 PCR, 按要求配制 PCR反 应体系, 将杂交洗脱获得的 DNA, 聚合酶、 底物、 PCR反应缓冲液, Flowcell引物 (依测 序仪的测序芯片 flowcell上带有的固定序列设计的引物) 等反应物混合均匀。 PCR程序为 94°C预变性 2min, 94°C变性 15s, 58°C退火 30s, 72 °C 延伸 30s, 反应 15个循环后,再 72 °C延伸 5min。PCR完成后,取出 PCR产物,离心,磁珠纯化,获得目标区域文库。用 Nanodrop 紫外分光光度计测文库的浓度, 准备上机测序。 Purchasing or configuring the hybridization and elution reagents in the kit instructions according to the instructions of the commercially available NimbleGen SeqCap EZ Hybrid Elution Kit. A 1.5 mL centrifuge tube was prepared, Cot-1 DNA was added, the universal blocking sequence (Block), the tagged sequence (index N Block) and the DNA sample after step 3). Then centrifuge for lmin, 60 ° C vacuum Concentrated and dried, then added to the hybridization buffer, etc., shaken and centrifuged, placed in a metal dry bath at 95 ° C for 10 min, shaken and centrifuged at high speed. A 4.5 ul probe was added to the centrifuge tube and hybridized on a PCR machine (47 ° C, 64-72 hours). Elution is performed after completion of the hybridization. Then, according to the final amplification step of the library construction specification, PCR is carried out, and the PCR reaction system is prepared as required, and the DNA obtained by hybridization elution, polymerase, substrate, PCR reaction buffer, Flowcell primer (based on the sequencer of the sequencing chip flowcell) The primers with a fixed sequence design are homogeneously mixed. The PCR program was predenatured at 94 °C for 2 min, denatured at 94 °C for 15 s, annealed at 58 °C for 30 s, extended at 72 °C for 30 s, reacted for 15 cycles, and extended at 72 °C for 5 min. After the completion of the PCR, the PCR product was taken out, centrifuged, and magnetic beads were purified to obtain a target region library. The concentration of the library was measured using a Nanodrop UV spectrophotometer and prepared for sequencing on the machine.
3、 Hiseq2000高通量测序  3. Hiseq2000 high-throughput sequencing
质检合格的 DNA文库, 根据 Hiseq2000操作说明进行上机测序。 每个样本的数据量约 A quality-qualified DNA library was sequenced according to the Hiseq2000 operating instructions. The amount of data per sample is approximately
4.5G, 平均测序深度达到 100X, 但由于捕获芯片的效率很难达到 100%, 通过分析, 最终 的目标区域的有效测序深度为 30X~45X。 实施例 3: CNV、 LOH和 UPD的检测 At 4.5G, the average sequencing depth reaches 100X, but it is difficult to reach 100% due to the efficiency of the capture chip. Through analysis, the effective sequencing depth of the final target region is 30X~45X. Example 3: Detection of CNV, LOH and UPD
总体流程参见图 3。 测序完成之后, 下机数据为 fastq文件格式。 然后将过滤后得到高 质量的 reads与参考基因组 (Hgl9, Build 37) 采用 BWA软件进行比对得到 SAM格式的比 对文件, 之后使用 samtools软件将 SAM比对文件格式转换成二进制的 BAM文件, 并对比 对结果进行去重复和排序处理,接着,将再次使用 samtools软件,将 BAM格式转换成 PILEUP 格式具体详情请见生物信息分析策略部分。  See Figure 3 for the overall process. After the sequencing is completed, the offline data is in the fastq file format. Then, the high-quality reads and reference genomes (Hgl9, Build 37) are filtered to obtain the SAM format comparison file by BWA software, and then the SAM comparison file format is converted into a binary BAM file by using samtools software, and The results are deduplicated and sorted. Next, the samtools software will be used again to convert the BAM format to the PILEUP format. See the Bioinformatics Strategy section for details.
一、 测序数据过滤、 比对  First, sequencing data filtering, comparison
先将上述实施例 illumina Hiseq2000 下机的测序数据进行简单的数据过滤,将被 adapter 污染, 含 N比例高于 5%, 平均质量值低于 Q20的 reads进行去除。然后使用 bwa比对软件 将过滤后的数据比对到人类参考基因组上 (hgl9, Build 37), 输出序列比对结果即 SAM ( sequence alignment/ map)格式的比对文件 (简称 SAM文件), 接着使用 Samtools软件将 SAM文件转换成二进制的 BAM文件、 去除掉 PCR引起的重复(PCR duplicates)和进行排 序处理, 使用 GATK软件对比对结果进行重比对和重校正。  The sequencing data of the above-mentioned embodiment illumina Hiseq2000 is first subjected to simple data filtering, and the reads which are contaminated by the adapter, containing N ratio higher than 5% and average mass value lower than Q20 are removed. The bwa alignment software is then used to compare the filtered data to the human reference genome (hgl9, Build 37), and the output sequence alignment result is a SAM (sequence alignment/map) format comparison file (referred to as SAM file), and then The SAM files were converted to binary BAM files using the Samtools software, PCR duplicates were removed and sorted, and the results were compared and recalibrated using GATK software.
二、 计算目标区域的经第二校正的覆盖深度系数 r P片段的杂合度 Rhet 根据上述过滤比对后获得的探针区域文件包含的信息计算出每个目标区域的 n和片段 杂合度 Rhet值。根据 n值,采用 t检验预测 CNV,根据 RHet,采用 F检验预测 LOH和 UPD。 2. Calculating the second corrected coverage depth coefficient r P segment heterozygosity R het of the target region calculates the n and segment heterozygosity R of each target region according to the information contained in the probe region file obtained after the above filtering ratio Het value. Based on the value of n, the CNV was predicted by t test, and according to R Het , the F test was used to predict LOH and UPD.
三、 检测 CNV, LOH禾 P UPD的分析  Third, the detection of CNV, LOH and P UPD analysis
1、 CNV检测  1, CNV detection
1.1计算每个目标区域的深度系数  1.1 Calculate the depth factor of each target area
计算目标区域的深度, 并用 表示 (如公式 1 ), 为了保持连续的几个目标区域 TD 的稳定性, 采用了公式 2的方法来校正 TD 即利用第 i区域后面的 n'个区域的深度来校正 TD,, 得到 TDai, 然后利用公式 3和 4对 TDai进行均一化, 此时得到每个目标区域的深度系 数 。 公式 1: TDi = Tibase I Tden 公式 2: TDa, = C TD / (w '+ 1), w '≥ 9 公式 3: TDa, = (∑:+" ίλ») I (n '+ 1) Calculate the depth of the target area, and use the representation (as in Equation 1). In order to maintain the stability of several consecutive target areas TD, the method of Equation 2 is used to correct the TD, that is, the depth of the n' areas behind the i-th area is used. The TD is corrected to obtain TD ai , and then TD ai is normalized by Equations 3 and 4, at which time the depth coefficient of each target region is obtained. Equation 1: TDi = Tibase I Tden Equation 2: TDa, = C TD / (w '+ 1), w '≥ 9 Equation 3: TDa, = (∑:+" ίλ») I (n '+ 1)
/ : R=m1/m1 , Tlbase: 比对到目标区域 i的碱基数; TJen: 目标区域 i的长度。 / : R = m 1 / m 1 , Tl base: number of bases aligned to the target area i; TJen: length of the target area i.
1.2利用多个参照样本 (k=24) 数据创建基准线, 校正 获得  1.2 Create a baseline using multiple reference samples (k=24) data, correct
由于每次实验条件的波动和样品间本身的差异导致每次捕获的效率也存在一定的波 动, 进而引起 的波动, 容易导致出现 CNV假信号。 因此, 根据多个样的波动情况, 创建 统一的一 图 4 很好的体现出创建基准线利于这个检测, 前体  Due to the fluctuation of each experimental condition and the difference between the samples themselves, there is a certain fluctuation in the efficiency of each capture, and the resulting fluctuations are likely to cause CNV false signals. Therefore, according to multiple fluctuations, create a unified one. Figure 4 is a good example of creating a baseline for this detection, precursor
( preRi)
Figure imgf000016_0001
的分布如图波动很大, 而 波动相对小些, 通过基准 线的校正后得到 Γι, 其波动更小, 更敏感更易检测 CNV的发生。 理论上, 认为不同的样品 中, 不发生 CNV的情况下, 在同一个目标区域内, 值理论上是符合泊松分布的, 并且都 围绕各自特有的值相对稳定的上下波动, 为了保持各自特有值的稳定性, 通过调查多个样 品的同一区域的1¾值, 采用 平均值 (mean 来代替这个各自特有的值, 为每个目标 区域构建一个各自特有的基准线(robust baseline)。基于每个目标区域的 是围绕着 mean ly 值上下波动的假设, 我们将 除以 mean Ri转化成 Γι, 进而使得 η围绕 1上下波动的正态 分布。
( preRi)
Figure imgf000016_0001
The distribution of the distribution is very large, and the fluctuation is relatively small. After the correction of the baseline , Γι is obtained, which has smaller fluctuations and is more sensitive and easier to detect the occurrence of CNV. Theoretically, in the case where CNV does not occur in different samples, the values are theoretically consistent with the Poisson distribution in the same target region, and both are relatively stable around the respective unique values, in order to maintain their unique characteristics. The stability of the value, by investigating the 13⁄4 values of the same region of multiple samples, using the mean (mean instead of this unique value, constructing a unique baseline for each target region. Based on each The target area is the assumption that the mean ly value fluctuates around the value of mean ly. We divide the mean Ri into Γι , which in turn causes η to fluctuate around a normal distribution.
1.3检测目标区域的 CNV  1.3 Detection of CNV in the target area
理论上,来自多个样品的同一目标区域的 η值都应符合正态分布, 因此调查某个样品的 目标区域 i时, 可以通过比较多个样品此区域的 n值, 利用 τ检验, t统计量的计算公式如 下,
Figure imgf000016_0002
公式中各参数下标中的 1代表目标样本, 2代表多个参照样本, ^表示 个待测样本 的!^的平均数, 为 个参照样本的 Γι的平均数, ^为理论上所有待测样本的!^平均数,
Theoretically, the η values of the same target region from multiple samples should conform to the normal distribution, so when investigating the target region i of a sample, you can compare the n values of this region with multiple samples, using the τ test, t statistics The formula for calculating the quantity is as follows.
Figure imgf000016_0002
The 1 in the subscript of each parameter in the formula represents the target sample, 2 represents multiple reference samples, and ^ represents a sample to be tested! The average of ^ is the average of the ι of the reference sample, ^ is theoretically all the samples to be tested! ^Average,
^理论上全部参照样本 rl2平均数, 81和 S2分别为待测样本和参照样本的标准差, df为自由 度, df^+n^ ^ Theoretically all reference sample r l2 average, 8 1 and S 2 are the standard deviation of the sample to be tested and the reference sample, respectively, df is the degree of freedom, df ^ + n ^
当待测样本为 1即 =i,待测样本和参照样本理论均值相同, 上式化简为:
Figure imgf000017_0001
When the sample to be tested is 1 or = i, the theoretical mean values of the sample to be tested and the reference sample are the same, and the above formula is simplified as:
Figure imgf000017_0001
通过上面的简化公式,每个目标区域都对应一个可检测 CNV的 t值,进而得到 P值(置 信度), 当某区域的 P<0.05的时候, 此区域则为一个发生 CNV的区域。  Through the above simplified formula, each target region corresponds to a t value of a detectable CNV, and then a P value (confidence) is obtained. When P<0.05 of a certain region, this region is a region where CNV occurs.
1.4检测大 CNV  1.4 Detection of large CNV
基于单个区域 t检验的 p值, 为每个区域附上一个伪信号值来表征是否被下一步 CNV 区域连接所考虑, 再沿着染色体, 将可能具有一致 CNV 的目标区域连接成块, 从而确定 CNV最终的大小及拷贝数。  Based on the p-value of the single-region t-test, a pseudo-signal value is attached to each region to characterize whether it is considered by the next CNV region connection, and then along the chromosome, the target regions that may have a consistent CNV are connected into blocks to determine The final size and copy number of CNV.
伪信号值的标记规则为, 当至少四个连续目标区域的测量值同方向 (t值同时大于或者 同时小于 0) 即偏离参照样品的相应区域时, 若有 3个区域的 P值小于第一阈值 (如 0.05, 常用的显著水平阈值), 而且第四个不超过第二阈值 (0.2, 第一阈值的四倍), 则四个区域 均标记为偏离方向 (比如偏大标为 +, 偏小标为 -), 合并成一个块; 这里连续且同方向的区 域个数和第一、 第二阈值数值都是可调整的。 如若一个块与另一个块之间的距离不超过 5 个区域的跨度, 则这两个块进行合并为一个大块, 依此类推, 最后获得区块; 参考前面 1.3 的方法公式, 这个区块的 r值以其所包含的所有区域的!^的平均值表示, 对待测样本和参照 样本的该区块域的 r值进行 t检验, 计算该区块的 P值。 当该区块的 P<0.05, 此区块发生 CNV, 从而确定该区块的边界与大小, 获得大 CNV的边界和大小。  The marking rule of the pseudo signal value is that when the measured values of at least four consecutive target regions are in the same direction (the t value is greater than or at the same time less than 0), that is, when the corresponding region of the reference sample is deviated, if the P values of the three regions are smaller than the first Threshold (such as 0.05, commonly used significant horizontal threshold), and the fourth does not exceed the second threshold (0.2, four times the first threshold), then the four regions are marked as off-direction (such as the partial mark is +, partial The small mark is -), merged into one block; here the number of consecutive and same direction regions and the first and second threshold values are adjustable. If the distance between one block and another block does not exceed the span of 5 regions, the two blocks are merged into one large block, and so on, and finally the block is obtained; refer to the method formula of 1.3 above, this block The r value is in all the areas it contains! The average value of ^ indicates that the r value of the block domain of the sample to be tested and the reference sample is subjected to t-test to calculate the P value of the block. When P<0.05 of the block, CNV occurs in this block, thereby determining the boundary and size of the block, and obtaining the boundary and size of the large CNV.
通过对目标 15例样品的分析, 我们得到的 CNV结果与已知的验证结果 (S P-array结 果) 高度一致, 并且不存在假阳性和假阴性, 见表 3。 再者, 我们模拟了 8个 30X的全基 因组数据,其中包括 5个正常样品, 3个含有 CNV的样品,通过对这 8个模拟数据进行 CNV 检测分析, 比较了当前已报到的 exome 区域 CNV预测软件 CONTRA (Li J, Lupat R, et a/, CONTRA: copy number analysis for targeted resequencing, Bioinformatics. 2012 May 15;28(10): 1307-13 ), 结果显示, 我们的方法敏感度和特异性均达到了 100%, 且各自的拷贝 数也被精确的检测出来, 对 CNV的检测精度可达到 500Kb且能精确定位, 而 CONTRA的 敏感度为 88.9%,特异性只为 66.7%, 拷贝数未给出, 如表 4所示。  By analyzing the target sample of 15 samples, the CNV results we obtained were highly consistent with the known verification results (S P-array results), and there were no false positives or false negatives, as shown in Table 3. Furthermore, we simulated eight 30X genome-wide data, including five normal samples and three CNV-containing samples. By comparing CNV detection of these eight simulated data, we compared the CNV predictions of the currently reported exome regions. Software CONTRA (Li J, Lupat R, et a/, CONTRA: copy number analysis for targeted resequencing, Bioinformatics. 2012 May 15; 28(10): 1307-13), the results show that our method sensitivity and specificity are both It has reached 100%, and the copy number of each is also accurately detected. The detection accuracy of CNV can reach 500Kb and can be accurately located, while the sensitivity of CONTRA is 88.9%, the specificity is only 66.7%, and the copy number is not given. Out, as shown in Table 4.
表 3  table 3
Figure imgf000017_0002
Figure imgf000018_0002
Figure imgf000018_0001
Figure imgf000017_0002
Figure imgf000018_0002
Figure imgf000018_0001
CONTR 1-5 Normal 88.9% 66.7%CONTR 1-5 Normal 88.9% 66.7%
A 6 chr20 15007645 15492763 0.49- ΝΑ A 6 chr20 15007645 15492763 0.49- ΝΑ
chrl9 45003283 45496699 0.49+ ΝΑ  Chrl9 45003283 45496699 0.49+ ΝΑ
7 chr20 16009467 17992034 1.98- ΝΑ  7 chr20 16009467 17992034 1.98- ΝΑ
chrl9 15009149 15993267 0.98+ ΝΑ  Chrl9 15009149 15993267 0.98+ ΝΑ
chrl9 50000342 50998777 1 - ΝΑ  Chrl9 50000342 50998777 1 - ΝΑ
8 chr20 63704 9990568 9.93- ΝΑ  8 chr20 63704 9990568 9.93- ΝΑ
chr20 10007121 12995181 2.99- ΝΑ  Chr20 10007121 12995181 2.99- ΝΑ
chr20 19869958 35830028 ΝΑ chr20 42008751 42442770 0.43+ ΝΑ  Chr20 19869958 35830028 ΝΑ chr20 42008751 42442770 0.43+ ΝΑ
chrl9 3430053 31917721 28.49 ΝΑ  Chrl9 3430053 31917721 28.49 ΝΑ
+  +
chrl9 35004532 35595304 0.59- ΝΑ  Chrl9 35004532 35595304 0.59- ΝΑ
+ ―  + ―
结果 8个真阳性 CNVs, 3个假阳性 CNVs  Results 8 true positive CNVs, 3 false positive CNVs
2、 LOH检测  2, LOH detection
2.1全基因组各区域的杂合状态检测  2.1 Detection of heterozygous status in all genomes
在待测样本基因组某区域内, 找出在千人数据中基因频率 (AF) 为 0.1 0.9的 S P位 点, 并按以下公式计算出千人中和待测样本中的这些 S P位点所在区域的 RHet值。 当待测 样本中区域 i为绝对杂合状态时, 则 Rhet=l, 反之, 为绝对纯合的时候, Rhet=0。 In a region of the sample genome to be tested, find the SP site with a gene frequency (AF) of 0.1 0.9 in the data of thousands of people, and calculate the area where these SP sites are among the thousands of people and the sample to be tested according to the following formula. R Het value. When the region i in the sample to be tested is in an absolute hybrid state, then R het = l, otherwise, when it is absolutely homozygous, R het =0.
Rhet = MAF l (\ - MAF) ^ MAF (minor allele frequency) 为次等位基因频率。 在待测样品中, 以某区域内任意一个 S P位点 m作为起始点, 向后连续取 n个 S P 位点作为该区域内的杂合度集 Sm, 即^ = H , Rh 、, . . . , R n、、, 以同样的方式, 在千人数据库中 , 取相 同位置的 S P 位点 , 构成杂合度集 Pm, 即R het = MAF l (\ - MAF) ^ MAF ( min or allele frequency) is the minor allele frequency. In the sample to be tested, any SP site m in a certain region is taken as the starting point, and n SP sites are successively taken as the heterozygosity set Sm in the region, that is, ^ = H , Rh , , . , R n ,,, in the same way, in the thousand person database, take the SP position of the same position, constitute the heterozygosity set Pm, ie
Pm = {Rket, pketm , Rket, P (m +
Figure imgf000019_0001
P(m +„)} ? f 检验两个杂合度集的方差是否相等, 具体的, 分别计算待测样本该区域的杂合度集的方差 2和千人样本的同样区域的杂合度集的方差
P m = {Rket, pketm , Rket, P (m +
Figure imgf000019_0001
P (m + ")}? F heterozygosity two test sets variances are equal. Specifically, the variance are calculated in the same set of heterozygosity region 2 and the variance thousand samples of the sample to be tested heterozygosity set region
Sp , 以及待测样本该区域杂合度集 Sm的 p值。 Ss— ~n-
Figure imgf000019_0002
Sp , and the p value of the heterozygosity set Sm of the region of the sample to be tested. S s — ~n-
Figure imgf000019_0002
Sp *J c max = max{ c ^ , Ό c }, Ό c mm = min ^ ^ , c } Sp *J c max = max{ c ^ , Ό c }, Ό c mm = min ^ ^ , c }
Ho '. Gs = OpHo '. Gs = Op
Figure imgf000020_0001
Figure imgf000020_0001
F upper― ^"^Χ , dfs― dfp— ΪΙ— \ F upper― ^"^ Χ , dfs― dfp— ΪΙ— \
S min  S min
F under― ^ ^n , dfs― dfp— fl— \ F under― ^ ^ n , dfs― dfp— fl— \
S max  S max
p― pupper + (1― p unde )  P― pupper + (1― p unde )
当 ≤0.01的时候,我们接受 HA,判断杂合度集 Sm失去了群体中的杂合性, 即集合 Sm 所在区域发生 LOH。 When ≤0.01, we accept H A and judge that the heterozygosity set Sm loses the heterozygosity in the population, that is, the LOH occurs in the region where the set Sm is located.
2.2 检测大的 LOH  2.2 Detecting large LOH
结合 2.1的结果, 采用检测大 CNV步骤 1.4的方式, 记录连续的 4个失去杂合状态的 子集为一个最小单元。 如若两个单元之间不超过 2个子集跨度, 则将两个单元合并成更大 的单元,依此类推, 最后连接成区块, 此时, 再根据待测样品和千人参考集之间的 RHet值进 行 F检验, 计算区块的 p值, 当!)≤0.01的时候, 我们则认为此区块发生 LOH, 否则为非 LOH区块。 Combined with the results of 2.1, a subset of four consecutive lost heterozygous states is recorded as a minimum unit by means of detecting large CNV step 1.4. If there is no more than 2 subset spans between the two units, the two units are merged into a larger unit, and so on, and finally connected into a block. At this time, according to the sample to be tested and the reference set of thousands The R Het value is subjected to the F test, and the p value of the block is calculated, when! When ≤0.01, we think that LOH occurs in this block, otherwise it is non-LOH block.
或者, 为更准确地检测对合并条件可设置更严格, 如为避免某些随机误差导致的假阳 性, 定义至少大于 5M的区域才可能为一个真实的 LOH。 在此基础上, 设置区块容错为 1 (即允许区块中 1个子集的 p值大于 0.01 )的条件下,将在 2.1中 ≤0.01的子集附近的 p≤0.01 的连续的子集与之合并。 最后, 对合并以后的区域内的 RHet再进行了一次 F检验, 若其 p 值小于 0.01, 则认为该区块是一个真实的 LOH。  Alternatively, to more accurately detect the merging conditions can be set more stringent, such as to avoid false positives caused by some random errors, the area defined at least greater than 5M may be a true LOH. On this basis, under the condition that the block fault tolerance is 1 (that is, the p value of one subset in the block is allowed to be greater than 0.01), a continuous subset of p ≤ 0.01 near the subset of ≤ 0.01 in 2.1 and Merger. Finally, an additional F-test is performed on the RHet in the merged region. If the p-value is less than 0.01, the block is considered to be a true LOH.
3、 UPD检测  3, UPD detection
结合上述全基因组的 CNV和 LOH检测结果, 根据孟德尔遗传规律, 进行 UPD检测。 如果某一 DNA区域在千人数据中显示为杂合状态, 即 RHet=l, 而在实际检测中, 其杂 合状态消失, 即 RHet趋近于 0, 则判定此区域为发生了 LOH, 而如果在这个区域同时发生 有 CNV且有两个拷贝 (CN=2), 即拷贝数没有发生变化 (本实施例的样本是二倍体样本, 正常二倍体样本基因组各区域都是两个拷贝), 则判定此区域发生了单亲二倍体 (UPD)。 Combined with the above-mentioned genome-wide CNV and LOH detection results, UPD detection was performed according to the Mendelian inheritance rule. If a DNA region is shown as heterozygous in the thousand-person data, ie, R Het =l, and in the actual detection, the heterozygous state disappears, that is, R Het approaches 0, then it is determined that this region is LOH. , if there are CNVs in this area and there are two copies (CN=2), that is, the copy number does not change (the sample in this example is a diploid sample, and the normal diploid sample genome is two regions) A copy), it is determined that a single parent diploid (UPD) has occurred in this region.
在 15个样品的 13个中, 检测出了 10个大于 5M的 LOH和 4个大于 5M的 UPD, 结 果请见表 5, LOH和 UPD的检测在没有配对样本的情况下 (一般是拿自身病变的组织和正 常的组织进行比较的, 这是配对样本, 即有某种关联的样本, 而本实施方式方检测 LOH和 UPD是把目标样本和多个参照样本集合做比较的, 目标样本和参照样本集合没有相关性, 所以不是配对样本), >5M的 LOH检测结果与 CN=1的 CNV结果一致 (可利用 CNV检测 结果验证 LOH检测结果的准确性),本发明方案检测 LOH、 UPD的准确性高,且可达到 5M 级别的精度。 In 13 of the 15 samples, 10 LOHs greater than 5M and 4 UPDs greater than 5M were detected. The results are shown in Table 5. The detection of LOH and UPD in the absence of paired samples (usually taking their own lesions) The comparison between the organization and the normal organization is a paired sample, that is, there is a certain associated sample, and the present embodiment detects LOH and UPD by comparing the target sample with a plurality of reference sample sets, the target sample and the reference. The sample set has no correlation, so it is not a paired sample. The LOH detection result of >5M is consistent with the CNV result of CN=1 (the accuracy of the LOH detection result can be verified by the CNV detection result), and the solution of the present invention detects the accuracy of LOH and UPD. High and can reach 5M The accuracy of the level.
Circos图 (图 5 ) 综合展示了 GM50275样本的 CNV、 LOH和 UPD检测结果。  The Circos diagram (Figure 5) shows the CNV, LOH and UPD results of the GM50275 sample.
表 5  table 5
Figure imgf000021_0001
工业实用性
Figure imgf000021_0001
Industrial applicability
本发明的基于参考序列确定探针序列的方法, 能够有效用于确定探针序列, 并且获得 的探针, 用于杂交捕获基因组获得多个基因组局部区域, 捕获得的多个局部区域能够代表 全基因组、 能够反映全基因组变异信息, 用于发现全基因范围的结构变异的发生。 尽管本发明的具体实施方式已经得到详细的描述, 本领域技术人员将会理解。 根据已 经公开的所有教导, 可以对那些细节进行各种修改和替换, 这些改变均在本发明的保护范 围之内。 本发明的全部范围由所附权利要求及其任何等同物给出。  The method for determining a probe sequence based on a reference sequence of the invention can be effectively used for determining a probe sequence, and the obtained probe is used for hybridization to capture a genome to obtain a plurality of local regions of the genome, and the captured plurality of local regions can represent the whole The genome, which reflects the genome-wide variation, is used to discover the occurrence of structural variations across the entire genome. Although specific embodiments of the invention have been described in detail, those skilled in the art will understand. Various modifications and alterations of those details are possible in light of the teachings of the invention. The full scope of the invention is given by the appended claims and any equivalents thereof.
在本说明书的描述中, 参考术语"一个实施例"、 "一些实施例"、 "示意性实施例"、 "示 例"、 "具体示例"、 或 "一些示例"等的描述意指结合该实施例或示例描述的具体特征、 结 构、 材料或者特点包含于本发明的至少一个实施例或示例中。 在本说明书中, 对上述术语 的示意性表述不一定指的是相同的实施例或示例。 而且, 描述的具体特征、 结构、 材料或 者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。  In the description of the present specification, the description of the terms "one embodiment", "some embodiments", "illustrative embodiment", "example", "specific example", or "some examples", etc. Particular features, structures, materials or features described in the examples or examples are included in at least one embodiment or example of the invention. In the present specification, the schematic representation of the above terms does not necessarily mean the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples.

Claims

权利要求书 Claim
1、 一种基于参考序列确定探针序列的方法, 其特征在于, 包括: A method for determining a probe sequence based on a reference sequence, comprising:
( 1 )基于多个离散高频 S P位点, 构建第一候选探针集, 其中, 所述第一候选探针集 由多个候选探针构成, 其中, 所述多个候选探针中的每一个均含有至少一个所述的离散高 频 S P位点;  (1) constructing a first candidate probe set based on a plurality of discrete high frequency SP sites, wherein the first candidate probe set is composed of a plurality of candidate probes, wherein, among the plurality of candidate probes Each one contains at least one of said discrete high frequency SP sites;
(2)将所述第一候选探针集中的所述多个候选探针与参考序列进行比对, 以便获得比 对结果;  (2) aligning the plurality of candidate probes in the first candidate probe set with a reference sequence to obtain a comparison result;
(3 ) 基于所述比对结果, 对所述第一候选探针集进行第一筛选, 以便获得由多个候选 探针构成的第二候选探针集;  (3) performing a first screening on the first candidate probe set based on the comparison result to obtain a second candidate probe set composed of a plurality of candidate probes;
(4)将所述参考序列划分为多个分别具有预定长度的窗口, 分别将所述第二候选探针 集中的多个候选探针分配至各自匹配的窗口, 以便确定所述多个候选探针各自的位置信息; (4) dividing the reference sequence into a plurality of windows each having a predetermined length, respectively assigning a plurality of candidate probes in the second candidate probe set to respective matching windows, so as to determine the plurality of candidate probes The respective position information of the needle;
( 5 )基于所述位置信息以及所述离散高频 S P的等位基因频率, 对所述第二候选探针 集进行第二筛选, 以便确定所述探针序列。 (5) performing a second screening on the second candidate probe set based on the position information and the allele frequency of the discrete high frequency S P to determine the probe sequence.
2、根据权利要求 1所述的方法, 其特征在于, 所述多个离散高频 S P位点的每一个的 等位基因频率分别为至少 10%, 优选不超过 90%。  The method according to claim 1, characterized in that the allelic frequency of each of said plurality of discrete high frequency S P sites is at least 10%, preferably not more than 90%, respectively.
3、根据权利要求 1所述的方法, 其特征在于, 所述多个离散高频 S P位点中任意两个 相邻离散高频 S P位点在所述参考序列上的物理距离不小于所述候选探针的长度。  The method according to claim 1, wherein a physical distance of any two adjacent discrete high frequency SP sites of the plurality of discrete high frequency SP sites on the reference sequence is not less than The length of the candidate probe.
4、 根据权利要求 1所述的方法, 其特征在于, 所述候选探针的长度为 50~250mer, 优 选 100mer。  4. The method according to claim 1, wherein the candidate probe has a length of 50 to 250 mer, preferably 100 mer.
5、根据权利要求 1所述的方法,其特征在于,所述候选探针包含一个所述离散高频 S P 位点, 并且所述离散高频 S P位点位于所述候选序列的中段。  5. The method of claim 1 wherein said candidate probe comprises one of said discrete high frequency S P sites and said discrete high frequency SP sites are located in a middle segment of said candidate sequence.
6、根据权利要求 5所述的方法, 其特征在于, 所述离散高频 S P位点位于所述候选探 针的中点。  The method according to claim 5, wherein the discrete high frequency S P site is located at a midpoint of the candidate probe.
7、根据权利要求 1所述的方法, 其特征在于,所述候选探针是从所述参考序列截取的。 7. The method of claim 1 wherein the candidate probe is taken from the reference sequence.
8、 根据权利要求 1所述的方法, 其特征在于, 在进行所述比对之前, 基于所述候选探 针的 GC含量以及单碱基重复数的至少之一, 预先对所述第一候选探针集进行预筛选。 8. The method according to claim 1, wherein the first candidate is pre-selected based on at least one of a GC content of the candidate probe and a single base repetition number before the performing the comparison The probe set is pre-screened.
9、 根据权利要求 8所述的方法, 其特征在于, 所述预筛选包括保留满足下列至少之一 的候选探针:  9. The method of claim 8, wherein the pre-screening comprises retaining candidate probes that satisfy at least one of:
GC含量为 35%-65%; 以及  GC content is 35%-65%;
单碱基重度小于 7。  The single base is less than 7.
10、 根据权利要求 1 所述的方法, 其特征在于, 所述第一筛选包括保留满足下列条件 至少之一的候选探针:  10. The method of claim 1, wherein the first screening comprises retaining candidate probes that satisfy at least one of the following conditions:
与所述参考序列唯一比对的候选探针;  a candidate probe that is uniquely aligned with the reference sequence;
比对到所述参考序列的多个位置, 并且所述多个位置中的至少两个位置的错配比例均 小于 10%的候选探针。 Aligning to a plurality of locations of the reference sequence, and at least two of the plurality of locations have a mismatch ratio of less than 10% of the candidate probes.
11、 根据权利要求 1所述的方法, 其特征在于, 在步骤 (4) 中, 将所述参考序列划分 为多个分别具有相同预定长度的窗口。 The method according to claim 1, wherein in the step (4), the reference sequence is divided into a plurality of windows each having the same predetermined length.
12、根据权利要求 11所述的方法,其特征在于,将所述参考序列划分为多个长度为 10Kb 的窗口。  The method according to claim 11, wherein the reference sequence is divided into a plurality of windows having a length of 10 Kb.
13、 根据权利要求 1所述的方法, 其特征在于, 在步骤 (5 ) 中, 按照下列步骤, 确定 所述探针:  13. The method according to claim 1, wherein in the step (5), the probe is determined according to the following steps:
(a)如果存在多个所述候选探针位于同一个窗口, 则确定所述离散高频 S P的等位基 因频率最高的候选探针;  (a) if there are a plurality of said candidate probes located in the same window, determining a candidate probe having the highest frequency of the allele of the discrete high frequency S P ;
(b)如果仅存在一个所述离散高频 S P的等位基因频率最高的候选探针, 则选择该所 述离散高频 S P的等位基因频率最高的候选探针作为所述探针, 如果存在多个所述离散高 频 S P的等位基因频率最高的候选探针,则选择所述多个所述离散高频 S P的等位基因频 率最高的候选探针中距离所述窗口中心最近的候选探针作为所述探针。  (b) if there is only one candidate probe having the highest allele frequency of the discrete high frequency SP, selecting the candidate probe having the highest allele frequency of the discrete high frequency SP as the probe, if There are a plurality of candidate probes having the highest allele frequency of the discrete high frequency SP, and selecting the candidate probes having the highest allele frequency of the plurality of the discrete high frequency SPs is closest to the center of the window A candidate probe is used as the probe.
14、 根据权利要求 1所述的方法, 其特征在于, 在确定所述探针之后, 进一步包括: 在所述参考序列上, 分别确定相邻两个探针之间的距离;  The method according to claim 1, wherein after determining the probe, the method further comprises: determining a distance between two adjacent probes on the reference sequence;
如果所述相邻两个探针之间的距离大于所述相邻探针所位于的两个窗口的最大长度, 则进一步在所述两个窗口之间选择短串连重复序列或者短串联重复序列的一部分作为探 针。  If the distance between the adjacent two probes is greater than the maximum length of the two windows in which the adjacent probes are located, further selecting a short serial repeat sequence or a short tandem repeat between the two windows A portion of the sequence acts as a probe.
15、 根据权利要求 1 所述的方法, 其特征在于, 所述参考序列为参考基因组或其一部 分。  15. The method of claim 1, wherein the reference sequence is a reference genome or a portion thereof.
16、 一种检测基因组结构变异的方法, 所述基因组结构变异包括染色体非整倍性、 拷 贝数变异和插入缺失的至少之一, 其特征在于, 所述方法包括,  16. A method of detecting genomic structural variation, the genomic structural variation comprising at least one of chromosomal aneuploidy, copy number variation, and insertion deletion, wherein the method comprises
( 1 ) 对目标样本基因组核酸进行测序, 以便获得基因组测序结果, 所述基因组测序结 果由多个读段构成, 其中, 任选地, 所述测序包括采用探针进行筛选, 其中, 所述探针是 通过权利要求 1~15任一项所述的方法获得的;  (1) sequencing the target sample genomic nucleic acid to obtain a genome sequencing result, the genome sequencing result being composed of a plurality of reads, wherein, optionally, the sequencing comprises screening by using a probe, wherein the probe A needle obtained by the method of any one of claims 1 to 15;
(2) 将参考基因组分为 m个区域, 利用落入区域 i的所述读段的数目, 计算区域 i的 覆盖深度 TD1 m和 i为自然数, i表示区域的编号, l i m, 10<m; (2) The reference gene component is m regions, and the coverage depth TD 1 m and i of the region i is calculated as a natural number using the number of the read segments falling into the region i, and i represents the number of the region, lim, 10<m ;
(3 ) 基于所述区域 i的覆盖深度与 k个参照样本的区域 i的覆盖深度的差异程度, 确 定所述区域 i是否存在结构变异, 其中, k为自然数, k 2。  (3) determining whether the region i has a structural variation based on the degree of difference between the coverage depth of the region i and the coverage depth of the region i of the k reference samples, where k is a natural number, k 2 .
17、 根据权利要求 16所述的方法, 其特征在于, 利用下列公式确定所述区域 i的覆盖 深度:  17. Method according to claim 16, characterized in that the coverage depth of the region i is determined using the following formula:
落入区域 i的读段数目 Number of reads falling into area i
Figure imgf000023_0001
Figure imgf000023_0001
Tn 落入区域 i的读段所包含的碱基总数 甘出 ·丰一 RT +懷 ή口 The total number of bases contained in the read segment of T n falling into region i is out. Feng Yi RT + Huaikou
TDi = 「丄 ^ , 其中, 1表不区域的编 °  TDi = "丄 ^ , where 1 is the area of the area
区域 1的长度  Length of area 1
18、 根据权利要求 16所述的方法, 其特征在于, 所述目标样本基因组区域 i的覆盖深 度与 k个参照样本的区域 i的覆盖深度的差异程度的检验, 是通过 t检验进行的。18. The method according to claim 16, wherein the target sample genomic region i has a deep coverage The test of the degree of difference between the degree and the coverage depth of the region i of the k reference samples was performed by t-test.
19、 根据权利要求 16所述的方法, 其特征在于, 所述区域 i的覆盖深度与 k个参照样 本的区域 i的覆盖深度的差异程度的比较, 是通过比较目标样本和参照样本的基因组区域 i 的覆盖深度系数进行的, 其中, 所述区域 i的覆盖深度系数 的确定包括以下步骤, 19. The method according to claim 16, wherein the comparison between the coverage depth of the region i and the coverage depth of the region i of the k reference samples is by comparing the genomic regions of the target sample and the reference sample The coverage depth coefficient of i is determined, wherein the determining the coverage depth coefficient of the area i includes the following steps,
(a) 对 进行第一校正以获得第一校正覆盖深度 TDai, 所述第一校正是通过对包含 区域 i在内的 2η个连续区域的覆盖深度值进行线性回归实现的, 其中, η为自然数, 10<η =¾m/2; (a) performing a first correction to obtain a first corrected coverage depth TD ai , the first correction being performed by linearly regressing a coverage depth value of 2 n consecutive regions including the region i, where η is Natural number, 10<η = 3⁄4m/2;
(b) 对 TDai®行均一化获得 TD 进而获得 Rl=TI TD ai(b) TD ai ® line homogenization to obtain TD and obtain Rl = TI TD ai .
20、 根据权利要求 19所述的方法, 其特征在于, 在步骤(a) 中, 基于下列公式, 确定 第一校正覆盖深度 TDai: TDai =(∑j TDJ )/n, 其中, TDj表示所述 n个连续区域中的第 j 个区域的覆盖深度, j为自然数, 1 η。 20. The method according to claim 19, wherein in the step (a), the first corrected coverage depth TD ai is determined based on the following formula : TDai = ( ∑ j TD J ) /n , where TDj represents The coverage depth of the j-th region of the n consecutive regions, j is a natural number, 1 η.
21、 根据权利要求 20所述的方法, 其特征在于, 在步骤 (b) 中, 基于下列公式, 对 丁031进行均一化获得
Figure imgf000024_0001
21. The method according to claim 20, wherein in step (b), the homogenization of D0 31 is obtained based on the following formula
Figure imgf000024_0001
.
22、 根据权利要求 18~21任一所 法, 其特征在于, 在获得目标样本的 后进一 步包括对 进行第二校正以获得 Γι
Figure imgf000024_0002
, 其中, Ralk个参照样本基因组区域 i的覆
22. The method according to any one of claims 18 to 21, characterized in that, after obtaining the target sample, further comprising performing a second correction to obtain Γι ,
Figure imgf000024_0002
, where Ral is the coverage of the k reference sample genomic regions i
k  k
R ― y=1 R ― y =1
盖深度系数的平均值, 31 k , y为自然数表示参照样本编号, R Y表示参照样本 y 基因组区域 i的覆盖深度系数。 The average value of the depth coefficient of the cover, 31 k , y is a natural number indicating the reference sample number, and R Y is the coverage depth coefficient of the reference sample y genomic region i.
23、 根据权利要求 18~21任一所述的方法, 其特征在于, 在获得目标样本的 后进一 步包括对1¾进行第二校正以获得 , R 其中, RAI为 k个参照样本和一个目标样本的 The method according to any one of claims 18 to 21, wherein after obtaining the target sample, further comprising performing a second correction on the image to obtain R, wherein R AI is k reference samples and one target sample of
R y=i R y=i
基因组区域 i的覆盖深度系数的平均值, 31 k+1 , 为自然数表示参照样本编号, y表示参照样本 y基因组区域 i的覆盖深度系数。 The average value of the coverage depth coefficient of the genomic region i, 31 k+1 , is a natural number indicating the reference sample number, and y is the coverage depth coefficient of the reference sample y genomic region i.
24、 根据权利要求 22或 23所述的方法, 其特征在于, 进行所述 t检验, 目标样本基
Figure imgf000024_0003
的平
24. The method according to claim 22 or 23, wherein the t-test is performed, the target sample base
Figure imgf000024_0003
Flat
均值, ^为参照样本 y 基因组区域 i 的经所述第二校正的覆盖深度系数, ' R
Figure imgf000025_0001
, 为1^个参照样本标准差, 。
Mean, ^ is the second corrected coverage depth coefficient of the reference sample y genomic region i, ' R
Figure imgf000025_0001
, for 1^ reference sample standard deviation, .
25、 根据权利要求 24所述的方法, 其特征在于, 基于目标样本基因组区域 i的 值, 获得显著水平 当 P^O.05 , 判定所述区域 i存在结构变异; 反之, 则判定所述区域 i不 存在结构变异。  The method according to claim 24, wherein, based on the value of the target sample genomic region i, obtaining a significant level when P^O.05, determining that the region i has a structural variation; otherwise, determining the region i There is no structural variation.
26、 根据权利要求 24所述的方法, 其特征在于, 基于目标样本基因组区域 i的 值和 预先确定的显著水平 PlQ, 获得 理论值 tlQ, 当 tl0, 判定所述区域 i存在结构变异, 反 之, 则判定所述区域 i不存在结构变异; 所述预先确定的1¾ 0.05。 The method according to claim 24, wherein the theoretical value t lQ is obtained based on the value of the target sample genomic region i and the predetermined significant level P lQ , and when t l0 , the structural variation of the region i is determined On the contrary, it is determined that there is no structural variation in the region i; the predetermined 13⁄4 0.05.
27、 根据权利要求 16~21任一所述的方法, 其特征在于, 在进行步骤 (3 ) 之后, 将同 方向且连续的 W个区域合并, 获得一级合并区域, 合并两个一级合并区域当所述两个一级 合并区域是同方向的并且之间的跨度不超过 L个区域, 获得二级合并区域, 基于目标样本 基因组的所述二级合并区域的覆盖深度与多个参照样本基因组上对应的区域的覆盖深度的 差异程度, 来检测二级合并区域的结构变异; 其中, 同方向区域指区域的 t统计量都大于 0 或者都小于 0的区域, W和 L均为自然数, W 2, L-W^ l o  The method according to any one of claims 16 to 21, wherein after performing step (3), the W regions in the same direction and consecutively are merged to obtain a first-level merged region, and the two first-level merges are merged. Region when the two primary merged regions are in the same direction and the span between them does not exceed L regions, obtaining a secondary merged region, based on the coverage depth of the secondary merged region of the target sample genome and a plurality of reference samples The degree of difference in the depth of coverage of the corresponding region on the genome to detect the structural variation of the secondary merged region; wherein, the same direction region refers to the region where the t statistic of the region is greater than 0 or both are less than 0, and both W and L are natural numbers. W 2, LW^ lo
28、 一种检测杂合性丢失的方法, 其特征在于, 包括,  28. A method of detecting loss of heterozygosity, characterized in that
( 1 ) 对目标样本基因组核酸进行测序, 以便获得基因组测序结果, 所述基因组测序结 果由多个读段构成, 其中, 任选地, 所述测序包括采用探针进行筛选, 其中, 所述探针是 通过权利要求 1~15任一项所述的方法获得的;  (1) sequencing the target sample genomic nucleic acid to obtain a genome sequencing result, the genome sequencing result being composed of a plurality of reads, wherein, optionally, the sequencing comprises screening by using a probe, wherein the probe A needle obtained by the method of any one of claims 1 to 15;
(2) 将参考基因组分成 m' 个区域, 基于所述基因组测序结果中落在区域 i中的读段 信息和群体区域 i的数据,获得目标样本基因组区域 i和群体区域 i共有的 SNP位点构成共 有 S P集, 分别计算目标样本和群体的共有 S P集中的各个 S P位点所在片段的杂合度, 获得目标样本基因组区域 i的杂合度集 1^, 和群体区域 i的杂合度集 UQl, 比较目标样本 和群体 UQl以确定目标样本区域 i是否存在杂合性丢失; 其中, 所述 S P位点所在片段是以 与该 S P相邻的上下游两个 SNP为边界点的, m' 和 i为自然数, m' ^ i ^ l , m' 6。 (2) The reference gene component is divided into m' regions, and the SNP sites shared by the target sample genomic region i and the population region i are obtained based on the data of the read data falling in the region i and the data of the population region i in the genome sequencing result. A consensus SP set is constructed, and the heterozygosity of each SP spot in the common SP set of the target sample and the population is calculated, and the heterozygosity set 1^ of the target sample genomic region i and the heterozygosity set U Ql of the population region i are obtained. Comparing the target sample and the population U Q1 to determine whether there is a loss of heterozygosity in the target sample region i; wherein the segment where the SP site is located is bordered by two SNPs upstream and downstream of the SP, m′ and i is a natural number, m' ^ i ^ l , m' 6.
29、 根据权利要求 28所述的方法, 其特征在于, 所述共有 SNP集中的每个 S P的等 位基因频率都大于 0.1。  29. The method of claim 28, wherein the allele frequencies of each of the SNPs in the shared SNP set are greater than 0.1.
30、根据权利要求 28所述的方法, 其特征在于, 所述 S P位点所在片段的杂合度是以 该 S P位点的次等位基因频率系数表示的,所述 S P位点的次等位基因频率系数 Rhet=MAF/30. The method according to claim 28, wherein the heterozygosity of the fragment in which the SP site is located is represented by a sub-allelic frequency coefficient of the SP site, and the sub-equal of the SP site Gene frequency coefficient R het =MAF/
( 1-MAF), MAF为该 SNP的次等位基因频率。 (1-MAF), MAF is the minor allele frequency of the SNP.
31、 根据权利要求 30所述的方法, 其特征在于, 所述比较目标样本 U和群体 UQl以确 定目标样本区域 i杂合性丢失是否发生,包括利用 F检验判断 的方差 和 UQl的方差 是 否有显著差异, 若 U^P UQl的方差差异显著, 则判定所述目标样本区域 i存在杂合性丢失, 反之, 则判定所述目标样本区域 i存在杂合性丢失。 31. The method according to claim 30, wherein the comparing the target sample U and the population U Q1 to determine whether a heterozygosity loss of the target sample region i occurs, including a variance determined by an F test and a variance of U Q1 Whether there is a significant difference, if the variance of U^PU Q1 is significant, it is determined that the target sample region i has a loss of heterozygosity, and conversely, it is determined that the target sample region i has a loss of heterozygosity.
32、 根据权利要求 31所述的方法, 其特征在于, 所述 F检验包括分别计算 U P Ul0的 方差, 利用所得目标样本 1^的方差 和群体 UlQ的方差 ^。计算获得两个互为倒数的统计量32. The method according to claim 31, wherein the F-test comprises separately calculating a UPU l0 . Variance, using the variance of the resulting target sample 1^ and the variance of the population U lQ ^. Calculate two statistics that are reciprocal to each other
Fup ^和 F^to, 利用所述互为倒数的统计量获得显著水平 pF, 比较 pF与预定显著水平 pF0 的大 F up ^ and F^to, using the reciprocal statistics to obtain a significant level p F , comparing p F with a predetermined significant level p F0
Figure imgf000026_0001
, , ; 其中, v为目标样本基因组区域 i和群体 区域 i共有的高频 S P集中 SNP的编号, q为目标样本基因组区域 i和群体区域 i共有的高 频 S P集中 SNP的个数, 为目标样本基因组区域 i的共有高频 S P集中的第 V个
Figure imgf000026_0001
,,; Wherein, v is the target sample genomic region i and i shared frequency region group concentration SNP SP number, q is the number of high-frequency SP target sample concentration SNP genomic region i and i shared area groups, the target The Vth of the common high frequency SP set in the sample genomic region i
S P的次等位基因频率系数, Rte,i为目标样本基因组区域 i的共有高频 S P集中的 q 水 S P的次等位基因频率系数的平均值, R ^1^1^群体样本基因组区域 i的共有高频 S P集中 的第 V个 S P的次等位基因频率系数, !^^°为群体样本基因组区域 i的共有高频 S P集 中的 q个 S P的次等位基因频率系数的平均值, pupP 和 puncto分别根据 Fup ^和 Functa获得, pF0 0.05。 The sub-allelic frequency coefficient of SP, Rte , i is the average of the sub-allelic frequency coefficients of the q-water SP in the shared high-frequency SP concentration of the target sample genomic region i, R ^ 1 ^ 1 ^ population sample genomic region i The total allele frequency coefficient of the Vth SP of the high frequency SP concentration is shared! ^^° is the average of the sub-allelic frequency coefficients of q SPs in the shared high-frequency SP concentration of the population sample genomic region i, p upP and puncto were obtained according to F up ^ and F uncta , respectively, p F0 0.05.
33、 根据权利要求 28~32任一所述的方法, 其特征在于, 在步骤 (2)之后, 将 W' 个 发生杂合性丢失且连续的区域合并, 获得三级合并区域, 合并两个三级合并区域当所述两 个三级合并区域之间的跨度不超过 L' 个区域时, 获得四级合并区域, 分别获得目标样本四 级合并区域的杂合度集和群体同样区域的杂合度集, 比较两个杂合度集, 以确定目标样本 四级合并区域是否发生杂合性丢失, 其中, W' 和 L' 均为自然数, W' ^2, W' 12 L'。  33. The method according to any one of claims 28 to 32, characterized in that after step (2), W's heterogeneous loss and continuous regions are merged to obtain a three-level merged region, and two merged When the span between the two tertiary merged regions does not exceed L' regions, a four-level merged region is obtained, and the heterozygosity set of the four-level merged region of the target sample and the heterozygosity of the same region of the group are respectively obtained. Set, compare the two heterozygosity sets to determine whether the heterogeneity loss occurs in the four-level merged region of the target sample, where W' and L' are both natural numbers, W'^2, W'12 L'.
34、 一种检测单亲二倍体的方法, 其特征在于, 当检测目标样本基因组区域存在杂合 性丢失时, 计算这个基因组区域的拷贝数, 当这个基因组区域的拷贝数与同物种正常基因 组该区域的拷贝数一样时, 判定所述目标样本基因组区域为单亲二倍体; 目标样本基因组 区域杂合性丢失的确定是通过权利要求 27~32任一所述方法进行的。  34. A method for detecting a diploid of a single parent, characterized in that, when detecting loss of heterozygosity in a genomic region of a target sample, calculating a copy number of the genomic region, when the copy number of the genomic region is different from the normal genome of the same species When the copy number of the region is the same, it is determined that the target sample genomic region is a single parent diploid; and the loss of heterozygosity in the target sample genome region is determined by the method according to any one of claims 27 to 32.
PCT/CN2014/081686 2014-07-04 2014-07-04 Method for determining the sequence of a probe and method for detecting genomic structural variation WO2016000267A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480080426.0A CN106715711B (en) 2014-07-04 2014-07-04 Method for determining probe sequence and method for detecting genome structure variation
PCT/CN2014/081686 WO2016000267A1 (en) 2014-07-04 2014-07-04 Method for determining the sequence of a probe and method for detecting genomic structural variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/081686 WO2016000267A1 (en) 2014-07-04 2014-07-04 Method for determining the sequence of a probe and method for detecting genomic structural variation

Publications (1)

Publication Number Publication Date
WO2016000267A1 true WO2016000267A1 (en) 2016-01-07

Family

ID=55018343

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/081686 WO2016000267A1 (en) 2014-07-04 2014-07-04 Method for determining the sequence of a probe and method for detecting genomic structural variation

Country Status (2)

Country Link
CN (1) CN106715711B (en)
WO (1) WO2016000267A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110079589A (en) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 A kind of accurate method for obtaining structure variation within the scope of full-length genome
CN110462063A (en) * 2017-05-23 2019-11-15 深圳华大生命科学研究院 A kind of mutation detection method based on sequencing data, device and storage medium
CN110600078A (en) * 2019-08-23 2019-12-20 北京百迈客生物科技有限公司 Method for detecting genome structure variation based on nanopore sequencing
CN110592208A (en) * 2019-10-08 2019-12-20 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN110872618A (en) * 2018-09-04 2020-03-10 北京果壳生物科技有限公司 Method for judging sex of detected sample based on Illumina human whole genome SNP chip data and application
CN111383714A (en) * 2018-12-29 2020-07-07 安诺优达基因科技(北京)有限公司 Method for simulating target disease simulation sequencing library and application thereof
CN112522382A (en) * 2020-12-22 2021-03-19 广州深晓基因科技有限公司 Y chromosome sequencing method based on liquid phase probe capture
CN112662767A (en) * 2020-11-25 2021-04-16 深圳华大基因股份有限公司 Kit and probe for measuring genomic instability and application of kit and probe
CN112885410A (en) * 2021-01-28 2021-06-01 陈晓熠 Genotyping chip for CNV structural variation detection
CN113971986A (en) * 2021-10-12 2022-01-25 江苏先声医疗器械有限公司 Method for checking cross contamination of sequencing sample through sequence similarity
CN114220481A (en) * 2021-11-25 2022-03-22 深圳思勤医疗科技有限公司 Method, system and computer readable medium for performing karyotyping of a sample to be tested based on whole genome sequencing
CN114582427A (en) * 2022-03-22 2022-06-03 成都基因汇科技有限公司 Method for identifying introgression section and computer readable storage medium
CN115101128A (en) * 2022-06-29 2022-09-23 纳昂达(南京)生物科技有限公司 Method for evaluating off-target risk of hybridization capture probe
CN115631789A (en) * 2022-10-25 2023-01-20 哈尔滨工业大学 Pangenome-based group joint variation detection method
CN115713971A (en) * 2022-09-28 2023-02-24 上海睿璟生物科技有限公司 Method, system and terminal for selecting design strategy of target sequence capture probe of next generation sequencing
CN116144794A (en) * 2023-03-09 2023-05-23 华中农业大学 Bovine 12K SV liquid phase chip and design method and application thereof

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019237230A1 (en) * 2018-06-11 2019-12-19 深圳华大生命科学研究院 Method and system for determining type of sample to be tested
CN109584963A (en) * 2018-09-30 2019-04-05 南京派森诺基因科技有限公司 A kind of diversified abstracting method of high-flux sequence data
CN113593644B (en) * 2021-06-29 2024-03-26 广东博奥医学检验所有限公司 Method for detecting chromosome single parent dimer based on family low depth sequencing
WO2023030233A1 (en) * 2021-08-30 2023-03-09 广州燃石医学检验所有限公司 Copy number variation detection method and application thereof
CN114678067B (en) * 2022-03-21 2023-03-14 纳昂达(南京)生物科技有限公司 Method and device for constructing multi-population non-exon region SNP probe set
CN115713967B (en) * 2022-11-17 2023-08-29 纳昂达(南京)生物科技有限公司 Method for designing probe pool and related device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101712959A (en) * 2008-10-08 2010-05-26 中国人民解放军军事医学科学院放射与辐射医学研究所 Novel human cell growth inhibiting gene THAP11 and application thereof
CN101790731A (en) * 2007-03-16 2010-07-28 吉恩安全网络公司 Be used to remove the system and method that genetic data disturbed and determined the chromosome copies number
CN102127819A (en) * 2010-11-22 2011-07-20 深圳华大基因科技有限公司 Constructing method and application of nucleic acid library in MHC (Major Histocompatibility Complex) region

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020086289A1 (en) * 1999-06-15 2002-07-04 Don Straus Genomic profiling: a rapid method for testing a complex biological sample for the presence of many types of organisms
WO2005001091A1 (en) * 2003-06-27 2005-01-06 Olympus Corporation Probe set for detecting mutation and polymorphism in nucleic acid, dna array having the same immobilized thereon and method of detecting mutation and polymorphism in nucleic acid using the same
US20050042654A1 (en) * 2003-06-27 2005-02-24 Affymetrix, Inc. Genotyping methods
US20050136417A1 (en) * 2003-12-19 2005-06-23 Affymetrix, Inc. Amplification of nucleic acids
AU2006266251A1 (en) * 2005-06-30 2007-01-11 Syngenta Participations Ag Methods for screening for gene specific hybridization polymorphisms (GSHPs) and their use in genetic mapping and marker development
JP2009516516A (en) * 2005-11-21 2009-04-23 シモンズ ハプロミクス リミテッド Methods and probes for identifying nucleotide sequences
JP5237126B2 (en) * 2006-03-01 2013-07-17 キージーン ナムローゼ フェンノートシャップ Methods for detecting gene-related sequences based on high-throughput sequences using ligation assays
US7901882B2 (en) * 2006-03-31 2011-03-08 Affymetrix, Inc. Analysis of methylation using nucleic acid arrays
WO2011146788A2 (en) * 2010-05-19 2011-11-24 The Translational Genomics Research Institute Methods of assessing a risk of developing necrotizing meningoencephalitis
WO2012034251A2 (en) * 2010-09-14 2012-03-22 深圳华大基因科技有限公司 Methods and systems for detecting genomic structure variations
US20150337388A1 (en) * 2012-12-17 2015-11-26 Virginia Tech Intellectual Properties, Inc. Methods and compositions for identifying global microsatellite instability and for characterizing informative microsatellite loci

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101790731A (en) * 2007-03-16 2010-07-28 吉恩安全网络公司 Be used to remove the system and method that genetic data disturbed and determined the chromosome copies number
CN101712959A (en) * 2008-10-08 2010-05-26 中国人民解放军军事医学科学院放射与辐射医学研究所 Novel human cell growth inhibiting gene THAP11 and application thereof
CN102127819A (en) * 2010-11-22 2011-07-20 深圳华大基因科技有限公司 Constructing method and application of nucleic acid library in MHC (Major Histocompatibility Complex) region

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110462063A (en) * 2017-05-23 2019-11-15 深圳华大生命科学研究院 A kind of mutation detection method based on sequencing data, device and storage medium
CN110872618A (en) * 2018-09-04 2020-03-10 北京果壳生物科技有限公司 Method for judging sex of detected sample based on Illumina human whole genome SNP chip data and application
CN110872618B (en) * 2018-09-04 2022-04-19 北京果壳生物科技有限公司 Method for judging sex of detected sample based on Illumina human whole genome SNP chip data and application
CN111383714B (en) * 2018-12-29 2023-07-28 安诺优达基因科技(北京)有限公司 Method for simulating target disease simulation sequencing library and application thereof
CN111383714A (en) * 2018-12-29 2020-07-07 安诺优达基因科技(北京)有限公司 Method for simulating target disease simulation sequencing library and application thereof
CN110079589A (en) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 A kind of accurate method for obtaining structure variation within the scope of full-length genome
CN110600078B (en) * 2019-08-23 2022-03-18 北京百迈客生物科技有限公司 Method for detecting genome structure variation based on nanopore sequencing
CN110600078A (en) * 2019-08-23 2019-12-20 北京百迈客生物科技有限公司 Method for detecting genome structure variation based on nanopore sequencing
CN110592208B (en) * 2019-10-08 2022-05-03 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN110592208A (en) * 2019-10-08 2019-12-20 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN112662767A (en) * 2020-11-25 2021-04-16 深圳华大基因股份有限公司 Kit and probe for measuring genomic instability and application of kit and probe
CN112662767B (en) * 2020-11-25 2021-08-06 深圳华大基因股份有限公司 Kit and probe for measuring genomic instability and application of kit and probe
CN112522382B (en) * 2020-12-22 2024-03-22 广州深晓基因科技有限公司 Y chromosome sequencing method based on liquid phase probe capture
CN112522382A (en) * 2020-12-22 2021-03-19 广州深晓基因科技有限公司 Y chromosome sequencing method based on liquid phase probe capture
CN112885410A (en) * 2021-01-28 2021-06-01 陈晓熠 Genotyping chip for CNV structural variation detection
CN113971986B (en) * 2021-10-12 2023-03-21 江苏先声医疗器械有限公司 Method for checking cross contamination of sequencing sample through sequence similarity
CN113971986A (en) * 2021-10-12 2022-01-25 江苏先声医疗器械有限公司 Method for checking cross contamination of sequencing sample through sequence similarity
CN114220481B (en) * 2021-11-25 2023-09-08 深圳思勤医疗科技有限公司 Method, system and computer readable medium for completing karyotyping of a sample to be tested based on whole genome sequencing
CN114220481A (en) * 2021-11-25 2022-03-22 深圳思勤医疗科技有限公司 Method, system and computer readable medium for performing karyotyping of a sample to be tested based on whole genome sequencing
CN114582427A (en) * 2022-03-22 2022-06-03 成都基因汇科技有限公司 Method for identifying introgression section and computer readable storage medium
CN115101128A (en) * 2022-06-29 2022-09-23 纳昂达(南京)生物科技有限公司 Method for evaluating off-target risk of hybridization capture probe
CN115101128B (en) * 2022-06-29 2023-09-15 纳昂达(南京)生物科技有限公司 Method for evaluating off-target risk of hybridization capture probe
CN115713971B (en) * 2022-09-28 2024-01-23 上海睿璟生物科技有限公司 Target sequence capture probe design strategy selection method, system and terminal
CN115713971A (en) * 2022-09-28 2023-02-24 上海睿璟生物科技有限公司 Method, system and terminal for selecting design strategy of target sequence capture probe of next generation sequencing
CN115631789A (en) * 2022-10-25 2023-01-20 哈尔滨工业大学 Pangenome-based group joint variation detection method
CN115631789B (en) * 2022-10-25 2023-08-15 哈尔滨工业大学 Group joint variation detection method based on pan genome
CN116144794A (en) * 2023-03-09 2023-05-23 华中农业大学 Bovine 12K SV liquid phase chip and design method and application thereof
CN116144794B (en) * 2023-03-09 2023-12-19 华中农业大学 Bovine 12K SV liquid phase chip and design method and application thereof

Also Published As

Publication number Publication date
CN106715711B (en) 2021-09-17
CN106715711A (en) 2017-05-24

Similar Documents

Publication Publication Date Title
WO2016000267A1 (en) Method for determining the sequence of a probe and method for detecting genomic structural variation
US11031100B2 (en) Size-based sequencing analysis of cell-free tumor DNA for classifying level of cancer
Spencer et al. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens
Ankala et al. A comprehensive genomic approach for neuromuscular diseases gives a high diagnostic yield
Abel et al. Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches
Spencer et al. Detection of FLT3 internal tandem duplication in targeted, short-read-length, next-generation sequencing data
US20200277661A1 (en) Methods And Systems For Detecting Genetic Mutations
Sakarya et al. RNA-Seq mapping and detection of gene fusions with a suffix array algorithm
Thind et al. Demystifying emerging bulk RNA-Seq applications: the application and utility of bioinformatic methodology
JP2015521028A (en) Non-invasive prenatal diagnosis of fetal trisomy by allelic ratio analysis using targeted massively parallel sequencing
Wei et al. Massively parallel sequencing reveals an accumulation of de novo mutations and an activating mutation of LPAR1 in a patient with metastatic neuroblastoma
WO2022105629A1 (en) Method for screening snp sites for detecting contamination level of sample and method for detecting contamination level of sample
Conlin et al. Long‐read sequencing for molecular diagnostics in constitutional genetic disorders
Wang et al. Allele-specific copy-number discovery from whole-genome and whole-exome sequencing
Nassar et al. Epigenomic charting and functional annotation of risk loci in renal cell carcinoma
Qi et al. Reproducibility of variant calls in replicate next generation sequencing experiments
Biezuner et al. An improved molecular inversion probe based targeted sequencing approach for low variant allele frequency
CA2871582A1 (en) Method for determining read error of base sequence
Sorrentino et al. CNV analysis in a diagnostic setting using target panel.
Wilson et al. svCapture: efficient and specific detection of very low frequency structural variant junctions by error-minimized capture sequencing
Chambers et al. Mutation detection by clonal sequencing of PCR amplicons and grouped read typing is applicable to clinical diagnostics
Meng Ethics statement
Miyabe et al. Genetic Analysis for Lynch Syndrome
Hsu Molecular Methods for Diagnosis of Genetic Diseases Involving the Immune System
Rieber Performance comparison of four human whole-genome sequencing technologies

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14896804

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14896804

Country of ref document: EP

Kind code of ref document: A1