WO2013127049A1 - 一种检测染色体sts区域微缺失的方法及其装置 - Google Patents

一种检测染色体sts区域微缺失的方法及其装置 Download PDF

Info

Publication number
WO2013127049A1
WO2013127049A1 PCT/CN2012/071648 CN2012071648W WO2013127049A1 WO 2013127049 A1 WO2013127049 A1 WO 2013127049A1 CN 2012071648 W CN2012071648 W CN 2012071648W WO 2013127049 A1 WO2013127049 A1 WO 2013127049A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
region
ratio
sts
value
Prior art date
Application number
PCT/CN2012/071648
Other languages
English (en)
French (fr)
Inventor
刘晓
张俊杰
徐怀前
苏政
张瑞芳
王俊
汪建
杨焕明
Original Assignee
深圳华大基因科技有限公司
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技有限公司, 深圳华大基因研究院 filed Critical 深圳华大基因科技有限公司
Priority to RU2014138794A priority Critical patent/RU2610691C2/ru
Priority to ES12870229T priority patent/ES2701775T3/es
Priority to EP12870229.7A priority patent/EP2821501B1/en
Priority to PCT/CN2012/071648 priority patent/WO2013127049A1/zh
Priority to CN201280070387.7A priority patent/CN104145028B/zh
Publication of WO2013127049A1 publication Critical patent/WO2013127049A1/zh

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism

Definitions

  • the invention relates to the field of genetic engineering technology, in particular to a chromosomal sequence tagging site
  • Deletion is a partial loss of a chromosome or DNA molecule in a genome, which is an important cause of gene mutation.
  • PCR technology is a technique for selectively amplifying a specific region of DNA in vitro by simulating DNA replication in vivo.
  • PCR detection is fast and convenient; in addition, primer design requires prior knowledge of the sequence at both ends of a particular region of DNA, so microdeletions in the STS region of these chromosomes are required to be reported in advance.
  • PCR detection has great limitations; in addition, when there are many STS sites to be detected, especially for the detection of entire chromosome deletions, traditional PCR This demand has not been met and a new technology is needed for research.
  • the technical problem to be solved by the present invention is to provide a method and device for detecting deletion of a STS region based on a chromosomal sequence tag site, which can detect a large number of microdeletions of a chromosome STS region at a limited cost; Unreported microdeletions in the STS region.
  • a technical solution adopted by the present invention is: Providing a method for detecting a micro-deletion based on a chromosome sequence tag site STS region, comprising: selecting an STS region on a chromosome, Corresponding capture probes are designed according to the DNA sequence of the STS region; the capture probe is hybridized with a multi-sample DNA hybrid library to capture the DNA sequence of the STS region in the multi-sample; The DNA sequence of the STS region in the multi-sample of the capture probe is sequenced to obtain sequencing data; the sequencing data is analyzed by mathematical statistics method, and the result of microdeletion of the chromosome STS region in each sample is obtained according to the analysis conclusion.
  • the step of obtaining the result of the micro-deletion of the chromosome STS region in each sample comprises: homogenizing the sequencing depth value of the STS region of the sample Obtaining a uniform depth value; determining a depth value outlier of the STS region of the sample by using a mathematical statistics method according to the uniformized depth value of the STS region of the obtained sample, and obtaining a result of the micro-deletion of the STS region of the sample .
  • the step of homogenizing the sequencing depth value of the STS region of the sample includes: dividing the homogenized depth value.
  • the step of detecting a depth value abnormal value of the sample STS region according to the normalized depth value of the STS region of the obtained sample, and obtaining a result of the micro-deletion of the sample STS region by using a mathematical statistics method includes: Calculating an average value and a variance of the normalized depth values of the same region of all the samples according to the uniformized depth values of the same region of all the samples obtained; and averaging the uniformized depth values of the same region according to the all samples And a variance, obtaining a normal distribution curve of all non-outlier samples in the same region; calculating a probability value of each sample at a specific depth value in each region according to the normal distribution curve; a probability value of the region at a specific depth value, setting a first probability value threshold, if the probability value of the region where the sample is located at a specific depth value is smaller than the first probability value threshold, obtaining the micro-deletion of the sample region
  • the result is Rl.
  • the first probability value threshold is set according to a probability value of each sample at a specific depth value of the corresponding region, and if the region where the sample is located is at a specific depth value, the probability value is smaller than the first probability
  • the micro-deletion result R1 is experimentally verified, and according to the experimental verification result, the second probability value is set. a threshold, where the second probability value threshold is smaller than the first probability value threshold; if the probability value of the sample region at a specific depth value is smaller than the second probability value threshold, obtaining the micro-deletion of the sample region The result is R2.
  • the step of detecting a depth value abnormal value of the sample STS region according to the normalized depth value of the STS region of the obtained sample, and obtaining a result of the micro-deletion of the sample STS region by using a mathematical statistics method includes: Calculating a ratio D/S of the depth value of the uniformity of the sample region to the median of the depth values of all samples according to the uniformized depth value of the STS region of the obtained sample; and normalizing the STS region according to the obtained sample a depth value, a ratio D/R of a depth value of the sampled region normalized to a median of depth values of all regions is calculated; the ratio D/S is trained by the ID3 algorithm to a first ratio threshold, The ratio D/R is trained by the ID3 algorithm to generate a second ratio threshold; if the ratio D/S of the sample region is greater than the first ratio threshold, the result that the sample region is not missing is obtained; if the ratio D/S of the sample region is smaller than the first Ratio threshold, and the ratio D/R
  • the step of detecting a depth value abnormal value of the sample STS region according to the normalized depth value of the STS region of the obtained sample, and obtaining a result of the micro-deletion of the sample STS region by using a mathematical statistics method includes: Calculating an average value and a variance of the normalized depth values of the same region of all the samples according to the uniformized depth values of the same region of all the samples obtained; and averaging the uniformized depth values of the same region according to the all samples And a variance, obtaining a normal distribution curve of all non-outlier samples in the same region; calculating a probability value of each sample at a specific depth value in each region according to the normal distribution curve; Obtaining a probability value of a region at a specific depth value, setting a third probability value threshold, if the probability value of the sample region at a specific depth value is less than the third probability value threshold, obtaining a micro-deletion of the sample region Result R3; calculating a ratio D/S of the depth value of the sample region normal
  • the ratio D/S of the sample region is smaller than the third ratio threshold, and the ratio D/R of the sample region is greater than the fourth ratio threshold, the result that the sample region has no micro-deletion is obtained;
  • the ratio D/S is smaller than the third ratio threshold, and the ratio D/R of the sample region is smaller than the fourth ratio threshold, and the result that the sample region has a micro-deletion is obtained.
  • the step of selecting an STS region on the chromosome, and designing and synthesizing the corresponding capture probe according to the DNA sequence of the STS region comprises: searching a DNA sequence of the STS region on the chromosome in the genome database; A sequence conforming to the design conditions of the capture probe is selected from the DNA sequence of the STS region; the capture probe is designed and synthesized according to the selected sequence that meets the design conditions of the capture probe.
  • the step of preparing the multi-sample DNA hybrid library comprises: preparing a plurality of quality-controlled single-sample DNA libraries with different linkers; mixing the plurality of single-sample DNA libraries in a predetermined ratio; Whether the quality of the mixed multi-sample DNA library is acceptable, and if so, is a prepared multi-sample DNA hybrid library.
  • the step of preparing the single-sample DNA library comprises: breaking the genomic DNA into a DNA fragment of a predetermined size by physical or chemical means, recovering the interrupted DNA fragment; and using the enzyme to treat the recovered DNA
  • the fragment is end-repaired to form a filled-end-terminal phosphorylated DNA fragment, and the complemented terminal-phosphorylated DNA fragment is recovered; an "A" is added to the 3, end of the recovered-filled DNA fragment by an enzyme.
  • a base a DNA fragment in which the 3, "A” base is added at the end; a DNA fragment in which the recovered 3, "A” base is added to the tag link Index Adapter by an enzyme, And recovering the DNA fragment carrying the tag linker; using the primer of the tag linker sequence as a primer, amplifying the DNA fragment carrying the tag linker, recovering the amplified product; and verifying the amplified product
  • the quality control is acceptable, and if so, it is a single sample DNA library prepared.
  • the step of sequencing the DNA sequence of the STS region in the plurality of samples of the captured capture probe further comprises: DNA sequence of the STS region in the pair of the plurality of samples
  • the sequencing data is quality controlled.
  • the step of quality control of the sequencing data of the DNA sequence of the STS region in the multi-sample includes: filtering the unqualified data in the sequencing data of the DNA sequence of the STS region in the multi-sample to obtain a qualified multi-sample Sequencing data; comparing the qualified multi-sample sequencing data with the reference genomic sequence by short sequence comparison software, and counting the relevant parameters of the sequencing depth of each sample and the sequencing depth of the same STS region between different samples Related parameters; according to the statistically obtained parameters of the sequencing depth of each sample, filtering out the sequencing data of the unqualified samples, obtaining the sequencing data of the qualified samples; the same samples obtained according to the statistics The relevant parameters of the sequencing depth of the STS region are filtered out of the sequencing data of the unqualified STS region, and the sequencing data of the qualified STS region is obtained.
  • the unqualified data in the sequencing data of the DNA sequence of the STS region in the multi-sample is filtered, and the step of obtaining the qualified multi-sample sequencing data comprises: performing sequencing quality by proportion of low-quality bases in the sequencing data Filtering, if the number of bases of low-quality values exceeds a predetermined ratio of the number of bases of the entire sequence, it is determined to be unqualified data, and the unqualified sequencing data is filtered to obtain preliminary qualified first sequencing data.
  • the unqualified Sequencing the data is filtered out to obtain a preliminary qualified second sequencing data set; comparing all the sequencing data in the preliminary qualified second sequencing data set with the sequencing linker sequence library, if the preliminary qualified second sequencing data set If there is a sequencing linker sequence, it is judged to be unqualified data, and the unqualified sequencing data is filtered out to obtain a preliminary combination.
  • a third set of sequencing data comparing all of the sequencing data in the preliminary qualified third sequencing data set with all exogenous sequences introduced in the experiment, if the exogenous sequence is present in the preliminary qualified third sequencing data set Then, it is judged to be unqualified data, and the unqualified sequencing data is filtered out to obtain a sequence of qualified multi-sample sequencing data.
  • the sequencing data of the unqualified sample is filtered out, and the step of obtaining the sequencing data of the qualified sample comprises:
  • the sequencing depth values are sorted in ascending order, and the quartile function is used to determine the lower quartile Q1, the upper quartile Q3, and the interquartile range IQR of the sequencing depth values of all the sorted samples.
  • Sequencing data of all samples with sequencing depth values subtracted from 1.5 times IQR and Q3 plus 1.5 times IQR out of the unqualified samples were filtered to obtain sequencing data of qualified samples.
  • the step of filtering out the sequencing data of the unqualified STS region according to the relevant parameters of the sequencing depth of the same STS region between different samples obtained by the statistics, and obtaining the sequencing data of the qualified STS region includes: The sequencing depth values of the STS regions are sorted in ascending order, and the quartile function is used to determine the median and upper quartile Q3 of the sequencing depth values of the same STS region between the sorted different samples. And the interquartile range IQR; the sequencing data of the unqualified STS region with the median sequence value of the same STS region between the different samples being 0 or the median greater than Q3 plus 1.5 times IQR is filtered out, Sequencing data for qualified STS regions.
  • a device for detecting a micro-deletion based on a chromosome sequence tag site STS region comprising: a capture probe obtaining module for selecting an STS region on a chromosome, Corresponding capture probes are designed according to the DNA sequence of the STS region; a hybridization module for hybridizing the capture probe to a multi-sample DNA hybrid library to capture DNA sequences of STS regions in multiple samples; sequencing a data obtaining module, configured to sequence the DNA sequence of the STS region in the plurality of samples of the captured corresponding capture probe to obtain sequencing data; and a micro-missing result obtaining module, configured to analyze the sequencing data by using a mathematical statistical method According to the analysis conclusion, the result of microdeletion of the chromosome STS region in each sample was obtained.
  • the micro-missing result obtaining module includes: a depth value homogenizing unit, configured to homogenize the sequencing depth value of the STS region of the sample to obtain a uniform depth value; and a micro-missing result obtaining unit, configured to obtain The depth value of the uniformity of the STS region of the sample is detected by a mathematical statistics method to detect an abnormal value of the depth value of the STS region of the sample, and the result of the micro-deletion of the STS region of the sample is obtained.
  • the depth value normalization unit is specifically configured to divide the depth value of the same region in all samples by the average value of each sample depth value to obtain a depth value that is uniformized by the sample region.
  • the micro-missing result obtaining unit includes: an average variance obtaining unit, configured to calculate an average value of the uniformized depth values of the same region of all the samples according to the uniformized depth values of the same region of all the samples obtained.
  • a normal distribution curve obtaining unit configured to obtain a normal distribution curve of all non-outlier samples in the same region according to an average value and a variance of the uniformized depth values of the same region of all the samples; a unit, configured to calculate, according to the normal distribution curve, a probability value of each sample at a specific depth value in each region; a first determining unit, configured to perform probability according to each sample at a specific depth value in the corresponding region a value, a first probability value threshold is set, and if the probability value of the region where the sample is located at a specific depth value is less than the probability value first probability value threshold, the result R1 of the sample region having a micro-deletion is obtained.
  • the micro-missing result obtaining unit further includes: a probability value threshold determining unit, configured to perform experimental verification on the result R1 of the micro-deletion in the sample region, and set a second probability value threshold according to the experimental verification result, where The second probability value threshold is smaller than the first probability value threshold; the second determining unit is configured to: if the probability value of the sample region at a specific depth value is smaller than the second probability value threshold, obtain the sample region Missing result R2.
  • the micro-missing result obtaining unit includes: a ratio D/S obtaining unit, configured to calculate a depth value normalized by the sample region and a depth value of all samples according to the normalized depth value of the STS region of the obtained sample. a median ratio D/S; a ratio D/R obtaining unit for calculating a depth value of the sampled region and a depth value of all regions according to the uniformized depth value of the STS region of the obtained sample a median ratio D/R; a first and a second ratio threshold obtaining unit, configured to train the ratio D/S by a first ratio threshold by an ID3 algorithm, and training the ratio D/R by an ID3 algorithm a second ratio threshold, wherein the first determining unit is configured to obtain a result that the sample region has no micro-deletion if the ratio D/S of the sample region is greater than the first ratio threshold; and the second determining unit is configured to compare the ratio D of the sample region /S is smaller than the first ratio threshold, and the ratio D/R of the sample region is
  • the micro-missing result obtaining unit includes: an average variance obtaining unit, configured to calculate an average value of the uniformized depth values of the same region of all the samples according to the uniformized depth values of the same region of all the samples obtained. And a variance; a normal distribution curve obtaining unit, configured to obtain a normal distribution curve of all non-outlier samples in the same region according to an average value and a variance of the uniformized depth values of the same region of all the samples; a unit, configured to calculate, according to the normal distribution curve, a probability value of each sample at a specific depth value in each region; a first determining unit, configured to, according to each sample, a specific depth value in each region a probability value, a third probability value threshold is set, if the probability value of the sample region at a specific depth value is smaller than the third probability value threshold, obtaining a result R3 in which the sample region has a missing value; a ratio D/S obtaining unit , a ratio D/S for calculating a median value of the sample region normalized in
  • the capture probe obtaining module comprises: a region search unit, configured to search a DNA database for a DNA sequence of an STS region on a chromosome; and a sequence selection unit for searching in the search for The sequence of the capture probe design conditions is designed and synthesized to obtain a capture probe.
  • the device further comprises a multi-sample DNA hybrid library preparation module
  • the multi-sample DNA hybrid library preparation module comprises: a single sample DNA library preparation unit, configured to prepare a plurality of different connections a quality-controlled single-sample DNA library of the head; a single-sample library mixing unit for mixing the plurality of single-sample DNA libraries in a predetermined ratio; a multi-sample DNA hybrid library obtaining unit for verifying the mixed
  • the quality of the multi-sample DNA library is acceptable, and if so, it is a prepared multi-sample DNA hybrid library.
  • the device further includes a sequencing data quality control module, and the sequencing data quality control module includes: a qualified sequence obtaining unit, configured to filter unqualified data in the sequencing data of the DNA sequence of the STS region in the multi-sample , obtaining qualified multi-sample sequencing data; sequencing depth statistics unit for comparing the qualified multi-sample sequencing data with the reference genome sequence by short sequence comparison software, and counting the correlation of the sequencing depth of each sample Parameters and related parameters of the sequencing depth of the same STS region between different samples; a qualified sample obtaining unit for filtering out the sequencing data of the unqualified samples according to the relevant parameters of the sequencing depth of each sample obtained by the statistics, Obtaining the sequencing data of the qualified sample; the qualified region obtaining unit is configured to filter out the sequencing data of the unqualified STS region according to the relevant parameters of the sequencing depth of the same STS region between different samples obtained by the statistics, and obtain the qualified STS region. Sequencing data.
  • Still another technical solution adopted by the present invention is to provide a computer readable medium carrying a series of instructions to control a computer processor to perform the method as described above.
  • the beneficial effects of the present invention are: Different from the prior art, the method and device for detecting the micro-deletion of the STS region based on the chromosomal sequence tag site of the present invention, according to the DNA sequence of the STS region, the capture probe and the probe are designed and obtained. Covering the STS region on the whole chromosome, the DNA sequence of the STS region in the multi-sample captured after hybridization with the multi-sample DNA hybrid library can be large and efficient, and the mathematical statistics information analysis process of the present invention is scientific, stable, and highly sensitive. Low false positives can be effectively analyzed for microdeletions.
  • FIG. 1 is a method for performing microdeletion detection based on chromosome sequence tag site STS region in the present invention. Flow chart of the example;
  • Figure 2 is a flow chart showing another embodiment of the method for detecting microdeletion based on chromosomal sequence tagging site STS region of the present invention
  • Figure 3 is a flow chart showing still another embodiment of the method for detecting microdeletion based on the chromosomal sequence tag site STS region of the present invention
  • Figure 4 is a flow chart showing still another embodiment of the method for detecting microdeletion in the STS region based on the chromosomal sequence tagging site of the present invention
  • Figure 5 is a flow chart showing still another embodiment of the method for detecting microdeletion based on the chromosomal sequence tag site STS region of the present invention
  • Figure 6 is a flow chart showing still another embodiment of the method for detecting microdeletion based on the chromosomal sequence tag site STS region of the present invention.
  • Figure 7 is a flow chart showing still another embodiment of the method for detecting microdeletion based on the chromosomal sequence tag site STS region of the present invention.
  • FIG. 8 is a schematic diagram of a method for filtering out depth value outliers in another embodiment of the present invention based on a method for detecting micro-deletion of a chromosome sequence tag site STS region;
  • Figure 9 is a schematic diagram of a sample of a method of micro-deletion detection based on a chromosome sequence tag site STS region in accordance with another embodiment of the present invention.
  • Figure 10 is a flow chart showing still another embodiment of the method for detecting microdeletion based on chromosome sequence tag site STS region of the present invention.
  • FIG. 11 is a schematic diagram of a method for filtering out depth group outliers in another embodiment of the present invention based on the method of detecting chromosomal sequence tag site STS region microdeletion;
  • Figure 12 is a flow chart showing still another embodiment of the method for detecting microdeletion based on the chromosomal sequence tag site STS region of the present invention.
  • Figure 13 is a flow chart showing still another embodiment of the method for detecting microdeletion based on chromosome sequence tag site STS region of the present invention.
  • Figure 14 is a method for detecting microdeletion in the STS region based on the chromosomal sequence tag site of the present invention. Flow chart of the example;
  • Figure 15 is a flow chart showing still another embodiment of the method for detecting microdeletion based on the chromosomal sequence tag site STS region of the present invention.
  • Figure 16 is a schematic diagram showing a decision tree constructed using the JD3 algorithm in another embodiment of the present invention based on the chromosomal sequence tag site STS region micro-deletion detection;
  • 17 is a partial flow chart of a decision tree analysis method in still another embodiment of the method for detecting micro-deletion based on a chromosome sequence tag site STS region;
  • FIG. 18 is a flow chart of a decision tree analysis method in still another embodiment of the method for detecting micro-deletion based on chromosome sequence tag site STS region in the present invention.
  • Figure 19 is a flow chart showing still another embodiment of the method for detecting microdeletion based on the chromosomal sequence tag site STS region of the present invention.
  • Figure 20 is a bar graph and a box plot of the median depth of sample sequencing in another embodiment of the method based on the chromosomal sequence tag site STS region microdeletion detection;
  • Figure 21 is a bar graph and box plot of the median depth of region sequencing in another embodiment of the method for detecting microdeletion based on chromosome sequence tag site STS region in the present invention.
  • Figure 22 is a diagram showing the probability value of obtaining a region depth to a certain value in another embodiment of the method based on the chromosomal sequence tag site STS region microdeletion detection;
  • Figure 23 is a diagram showing the relationship between the probability value threshold value and the expected value of the probability value threshold value and the observed value in another embodiment of the method based on the chromosomal sequence tag site STS region micro-deletion detection;
  • Figure 24 is a schematic diagram showing the ratio of true and false positives in another embodiment of the method for detecting microdeletion in the STS region based on the chromosomal sequence tag site;
  • Figure 25 is a schematic view showing the structure of an apparatus for detecting micro-deletion based on a chromosome sequence tag site STS region according to the present invention.
  • Figure 26 is a schematic view showing the structure of still another embodiment of the apparatus for detecting microdeletion based on the chromosomal sequence tag site STS region of the present invention.
  • Figure 27 is another embodiment of the apparatus for detecting microdeletion in the STS region based on the chromosomal sequence tag site of the present invention. Schematic diagram of the structure of the embodiment;
  • Figure 28 is a schematic view showing the structure of still another embodiment of the apparatus for detecting microdeletion based on the chromosome sequence tag site STS region of the present invention.
  • Figure 29 is a schematic view showing the structure of still another embodiment of the apparatus for detecting microdeletion based on the chromosomal sequence tag site STS region of the present invention.
  • FIG. 1 is a flow chart of an embodiment of a method for detecting micro-deletion based on a chromosome sequence tag site STS region according to the present invention. As shown in FIG. 1, the method includes:
  • Step 101 Select the STS region on the chromosome, and design a corresponding capture probe according to the DNA sequence of the STS region.
  • the STS (sequence-tagged site) sequence tag site is a short, single-copy DNA sequence that is well-defined on the genome, can be used as a landmark and can be amplified by PCR, and is used to generate mapping sites.
  • a sequence of STS can be used to map the genomic region.
  • the probe is a small stretch of single-stranded DNA or RNA fragments (approximately 20 to 500 bp) used to detect nucleic acid sequences complementary thereto.
  • Step 102 Hybridizing the capture probe with a multi-sample DNA hybrid library to capture the DNA sequence of the STS region in multiple samples.
  • a DNA library is a recombinant DNA molecule containing a DNA fragment used by a cell-DNA cloning technique and transformed into a bacterium to form a DNA library.
  • the DNA hybrid library here refers to the DNA fragments used in which all samples are mixed together.
  • Step 103 Sequencing the DNA sequence of the STS region in the multi-sample of the captured corresponding capture probe to obtain sequencing data.
  • Step 104 The sequencing data is analyzed by a mathematical statistics method, and according to the analysis conclusion, the result of the micro-deletion of the chromosome STS region in each sample is obtained.
  • Mathematical statistics is a branch of mathematics developed along with the development of probability theory. It studies how to effectively collect, collate and analyze data affected by random factors, and infer or predict the problem under consideration. And provide evidence or advice for action.
  • the step of selecting the STS region on the chromosome and designing and synthesizing the corresponding capture probe according to the DNA sequence of the STS region includes:
  • Step 201 Find the DNA sequence of the STS region on the chromosome in the genome database.
  • Step 202 Select a sequence that matches the design conditions of the capture probe in the DNA sequence of the found STS region.
  • Sequence repeatability and GC content can affect the capture efficiency of the chip, and even capture errors, so it is important to select sequences that match the capture probe design conditions.
  • Step 203 Design and synthesize a capture probe according to the selected sequence that meets the design conditions of the capture probe.
  • the probes are typically 60-150 bp in length and have a GC content between 40% and 70%.
  • the probe design of the DNA sequence on the same segment multiple probes are required to cover the entire DNA sequence, and there is an overlap between the probe and the probe, wherein the length of the overlapping sequence is generally 20 bp. .
  • the location coordinates of the DNA sequence of the STS region on the chromosome are searched in the genome database, and this coordinate is submitted to the company providing the DNA capture service, and the design and synthesis of the capture probe is completed by these companies.
  • the steps of preparing the multi-sample DNA hybrid library include:
  • Step 301 Prepare a plurality of quality-controlled single-sample DNA libraries with different linkers. Each sample carries a different tag linker. To distinguish libraries from different samples in sequencing, each sample library contains a different 6 bp or 8 bp Index base sequence at the DNA end.
  • Step 302 Mix the plurality of single-sample DNA libraries in a predetermined ratio.
  • the DNA mix amount of each sample library can be mixed in equal amounts or in a certain ratio as needed.
  • Step 303 Verify whether the quality of the mixed multi-sample DNA library is acceptable, and if so, the prepared multi-sample DNA hybrid library.
  • the mixed multi-sample DNA library was quantified to detect the introduction of quality control indicators such as exogenous impurities.
  • the steps of preparing the single sample DNA library include:
  • Step 401 The genomic DNA is broken into a DNA fragment of a predetermined size by physical or chemical means, and the interrupted DNA fragment is recovered.
  • the size of the interrupted DNA fragment is 200-300 bp
  • the length of the probe is generally about 80 bp
  • the length of the fragment is 200-300 bp, which has a high capture efficiency.
  • PE sequencing is used after the capture, and the length of the test is also 200-300 bp. .
  • Step 402 End-repairing the recovered DNA fragment with an enzyme to form a filled-end terminal phosphorylated DNA fragment, and recovering the filled-end terminal phosphorylated DNA fragment;
  • Step 403 using an enzyme to add an "A" base to the 3' end of the recovered flattened DNA fragment, and recover the DNA fragment of the 3, terminal plus "A" base;
  • Step 404 connecting the recovered DNA fragment of the 3' end with "A" base to the tag linker Index Adapter under the action of an enzyme, and recovering the DNA fragment with the tag linker;
  • Step 405 using the primer of the tag adapter sequence as a primer, amplifying the DNA fragment carrying the tag linker, and recovering the amplified product;
  • Step 406 Verify that the quality control of the amplified product is acceptable, and if so, the prepared single sample DNA library.
  • a single-sample DNA library was quantified to detect the introduction of quality control indicators such as exogenous impurities.
  • the DNA sequence of the STS region in the multi-sample of the captured capture probe is sequenced, and the step of obtaining the sequencing data further comprises: performing quality control on the sequencing data of the DNA sequence of the STS region in the plurality of samples.
  • the number of sequencing of the DNA sequence of the STS region in multiple samples include:
  • Step 501 Filtering the unqualified data in the sequencing data of the DNA sequence of the STS region in the multi-sample to obtain qualified multi-sample sequencing data; wherein the high-throughput sequencing technology can be Illumina Hiseq 2000 sequencing technology, of course Other high throughput sequencing technologies available can be used.
  • the high-throughput sequencing technology can be Illumina Hiseq 2000 sequencing technology, of course Other high throughput sequencing technologies available can be used.
  • Step 502 Compare the qualified multi-sample sequencing data with the reference genome sequence by short sequence comparison software, and count the relevant parameters of the sequencing depth of each sample and the sequencing depth of the same STS region between different samples. Related parameters;
  • Step 503 Filter out the sequencing data of the unqualified sample according to the relevant parameters of the sampling depth of each sample obtained by the statistics, and obtain the sequencing data of the qualified sample;
  • Step 504 Filter out the sequencing data of the unqualified STS region according to the relevant parameters of the sequencing depth of the same STS region between different samples obtained by the statistics, and obtain the sequencing data of the qualified STS region.
  • step 501 the unqualified data in the sequencing data of the DNA sequence of the STS region in the multi-sample is filtered, and the step of obtaining the qualified multi-sample sequencing data includes: Step 601: By sequencing The ratio of the low-quality bases in the data is subjected to sequencing quality filtering. If the number of bases of the low-mass value exceeds a predetermined ratio of the number of bases of the entire sequence, it is judged to be unqualified data, and the unqualified sequencing is performed. The data is filtered out to obtain a preliminary qualified first sequencing data set. The quality values used in different sequencing equipment are calculated differently. Standards for low-quality sequences can be consulted by sequencing equipment to provide general standards within the company or reference field.
  • the base corresponds to the ASCII code of the quality character.
  • a base having a mass value of less than 5 is defined as a low-mass base in this embodiment, and if the number of bases of a low-mass value exceeds 50% of the number of bases of the entire sequence, it is judged to be unqualified data. And filtering the unqualified sequencing data to obtain a preliminary qualified first sequencing data set.
  • Step 602 If the number of bases whose sequencing result is undefined in the preliminary qualified first sequencing data set exceeds 10% of the number of bases of the entire sequence, it is judged to be unqualified data, and the unqualified of The sequencing data is filtered out to obtain a preliminary qualified second sequencing data set;
  • Step 603 Aligning all the sequencing data in the preliminary qualified second sequencing data set with the sequencing linker sequence library, and if the sequencing linker sequence exists in the preliminary qualified second sequencing data set, determining that the sequencing sequence is unqualified Data, filtering the unqualified sequencing data to obtain a preliminary qualified third sequencing data set;
  • Step 604 Align all the sequencing data in the preliminary qualified third sequencing data set with all the exogenous sequences introduced in the experiment, and if there is an exogenous sequence in the preliminary qualified third sequencing data set, determine It is unqualified data, and the unqualified sequencing data is filtered out to obtain qualified multi-sample sequencing data.
  • the exogenous sequence introduced in the assay is a human reference genome sequence.
  • high-throughput sequencing technologies can be Illumina Hiseq 2000 sequencing technology or other existing high-throughput sequencing technologies.
  • Different sequencing instruments or conditions may have different criteria for unqualified sequences, such as one of the standards available for sequencing in Illumina Hiseq 2000: Number of bases with sequencing quality below a certain threshold exceeds the number of bases in the entire sequence 50% is considered to be a non-conforming sequence, where the low-quality threshold is determined by the specific sequencing technology and sequencing environment; the number of bases with undefined sequencing results in the sequence (such as N in Illumina Hiseq 2000 sequencing results) exceeds the whole 10% of the number of bases in the sequence is considered to be a non-conforming sequence; in addition to the sample linker sequence, it is compared with other exogenous sequences introduced by other experiments, such as various linker sequences, if there is a foreign sequence in the sequence, it is considered as not Qualified sequence.
  • the sequencing data of the unqualified sample is filtered out, and the steps of obtaining the sequencing data of the qualified sample include:
  • Step 701 Sort the sequencing depth values of all samples in ascending order, using four numbers Q3 and a large inter-quartile IQR.
  • Quartile that is, in statistics, all values are arranged from small to large and divided into four equal parts, and the scores at the three division points are quartiles.
  • the first quartile (Q1) also known as the “minor quartile” is the lower quartile, which is equal to the 25% of all values in the sample from small to large. Number.
  • the second quartile (Q2) also known as the “median”, is equal to the 50% of all values in the sample from small to large.
  • the third quartile (Q3) also known as the "larger quartile" is the upper quartile, which is equal to the 75% of all values in the sample from small to large.
  • the difference between the third quartile and the first quartile is also called the InterQuartile Range (IQR).
  • IQR InterQuartile Range
  • Step 702 Sequencing the sequencing depth data of all samples minus 1.5 times in Q1 IQR and Q3 plus 1.5 times the number of unqualified samples outside the IQR range are filtered to obtain the sequencing data of the qualified samples.
  • the abscissa is the regional depth distribution, and the ordinate is the same depth region frequency.
  • FIG. 8 is to filter the samples with the depth values out of the group, leaving the depth value normal.
  • the sample as shown in Figure 9, is a schematic representation of a sample with a normal depth value.
  • the sequencing data of the unqualified STS region is filtered out, and the steps of obtaining the sequencing data of the qualified STS region include:
  • Step 1001 Sort the sequencing depth values of the same STS region between different samples in ascending order, and determine the sequencing depth value of the same STS region between the sorted different samples by using a quartile function. Number of digits, upper quartile Q3 and interquartile range IQR;
  • Step 1002 Filter the sequencing data of the unsatisfactory STS region with the median sequence value of the same STS region between different samples of 0 or median greater than Q3 plus 1.5 times IQR to obtain a qualified STS region. Sequencing data.
  • the abscissa is the depth of the same region of different samples.
  • the distribution, the ordinate is the frequency of the same depth region, and
  • FIG. 11 is a schematic diagram after filtering the region where the depth value is out of the group.
  • the sequencing data is analyzed by a mathematical statistics method, and according to the analysis conclusion, the steps of obtaining the result of the micro-deletion of the chromosome STS region in each sample include:
  • Step 1201 Normalize the sequencing depth value of the STS region of the sample to obtain a uniform depth value; wherein, the step of homogenizing the sequencing depth value of the STS region of the sample includes: dividing the depth value of all theizations.
  • Step 1202 According to the normalized depth value of the STS region of the obtained sample, the mathematical value method is used to detect the abnormal value of the depth value of the STS region of the sample, and the result of the micro-missing of the STS region of the sample is obtained.
  • step 1202 the specific steps of step 1202 include:
  • Step 1301 Calculate an average value and a variance of the uniformized depth values of the same region of all the samples according to the uniformized depth values of the same region of all the samples obtained;
  • Step 1302 Obtain a normal distribution curve of all non-outlier samples in the same region according to an average value and a variance of the uniformized depth values of the same region of all the samples;
  • Step 1303 Calculate, according to the normal distribution curve, a probability value of each sample at a specific depth value in each region;
  • Step 1304 Set a first probability value threshold according to a probability value of each sample in a corresponding region at a specific depth value, if a probability value of the region where the sample is located at a specific depth value is smaller than the probability value first probability value threshold , the result Rl of the micro-deletion of the sample region is obtained.
  • the method further includes:
  • Step 1401 The result of the micro-deletion of the sample region R1 is experimentally verified, and according to the experimental verification result, a second probability value threshold is set, wherein the second probability value threshold is smaller than the first probability value threshold;
  • Step 1402 If the probability value of the sample region at a specific depth value is smaller than the second probability value The threshold value is obtained as a result of the micro-deletion of the sample region R2.
  • the depth value of the micro-deletion sample should be an abnormal value of the normal distribution without the depth of the micro-deletion sample, and appropriate
  • the mathematical statistics method detects the outliers of the depth of each region, and the region where the depth is abnormal is the micro-missing region.
  • the method for determining whether the depth value of the region is an abnormal value is: for each region after the outlier filtering of the depth value, taking the data of the sample after the outlier filtering of the depth outlier depth value, and obtaining the region according to the average value and the variance thereof. Normal distribution curve for all non-outlier sample data. According to the curve, a probability value (p.value) of the depth of the region to a certain value is obtained, and an appropriate threshold (p.value-cutoff) is set for the probability values, and a sample having a depth whose probability value is less than the threshold is Determine that there is a micro-deletion in this area.
  • the threshold value of the probability value For the determination of the threshold value of the probability value, the following method can be used: Calculate the probability value of the depth of each region of each sample in the obtained normal distribution by using the above method, and adopt a relatively loose threshold, first determine that more sample regions have micro Missing, and then experimental verification of these sample areas to verify whether it is a micro-deletion.
  • the experimental verification method can be to design a suitable primer, and the DNA of the sample region is subjected to a PCR reaction, and it is judged whether there is a normal PCR amplification product from the condition of the PCR product to determine whether there is a microdeletion.
  • the appropriate probability value threshold can be selected to have the best false-positive false-negative index.
  • a mathematical value method is used to detect an abnormal value of the depth value of the STS region of the sample, and obtain the sample.
  • the steps of the results of the micro-deletion of the STS region include:
  • Step 1501 Calculate a ratio DK of the depth value of the uniformity of the sample region to the median of the depth values of all samples according to the uniformized depth value of the STS region of the obtained sample;
  • Step 1502 Calculate a ratio D/R of the depth value of the uniformity of the sample region to the median of the depth values of all regions according to the uniformized depth value of the STS region of the obtained sample;
  • Step 1503 The ratio D/S is trained to generate a first ratio threshold by using an ID3 algorithm, and the ratio D/R is trained to generate a second ratio threshold by using an ID3 algorithm; Step 1504: If the ratio D/S of the sample region is greater than the first ratio threshold, obtaining a result that the sample region has no micro-deletion;
  • Step 1505 If the ratio D/S of the sample region is smaller than the first ratio threshold, and the ratio D/R of the sample region is greater than the second ratio threshold, the result that the sample region has no micro-deletion is obtained;
  • Step 1506 If the ratio D/S of the sample region is smaller than the first ratio threshold, and the ratio D/R of the sample region is smaller than the second ratio threshold, the result that the sample region has a micro-deletion is obtained.
  • the ratio of the depth of the micro-missing region to the median of all sample depths is generally small and small.
  • the ratio of the depth of the missing area to the median of all the depths of the area is generally also small, so the appropriate parameter can be used to test the depth outliers using the decision tree method.
  • the specific steps of the decision tree test are as follows: Calculate the ratio of the depth of the sample area to be tested to the median of the depths of all samples. If the ratio is greater than a certain threshold, it can be judged as no micro-deletion, but if the ratio is smaller than the Threshold, it is further judged: Calculate the ratio of the depth of the sample area to be tested to the median of the depth of all areas. If the ratio is greater than a certain threshold, it can be judged as no micro-deletion, but if the ratio is less than the threshold , then judged to have a micro-deletion.
  • the decision of the threshold in the decision tree test is calculated by the iterative binary tree three generation (ID3) algorithm of the decision tree (Mitchell, Tom M. Machine Learning. McGraw-Hill, 1997); the core of the ID3 algorithm is: When selecting an attribute on the point, information gain is used as the selection criterion of the attribute, so that when the test is performed at each non-leaf node, the category information about the largest record to be tested can be obtained.
  • ID3 iterative binary tree three generation
  • the basic idea of the decision tree ID3 algorithm is: In the first step, select an attribute as the root node of the decision tree, and then create a branch of the tree for all values of the attribute.
  • the second step is to use this tree to classify the training data. If all instances of a leaf node belong to the same class, mark the node with the class. If all leaf nodes have class tags, the algorithm terminates.
  • the third step if there are leaf nodes without tags, select one from The node marks the node to the attribute that has not appeared in the root path, and then continues to create the branch of the tree for all the values of the attribute; repeat the second step of the algorithm step.
  • Different decision trees are generated when the first step selects different attributes, so selecting the appropriate attribute will generate a decision tree for the single order.
  • an information-based heuristic is usually adopted. The way to decide how to pick an attribute.
  • the heuristic method selects the attribute with the highest amount of information, that is, the attribute that generates the least branch decision tree.
  • there are two properties to choose from the ratio of the sample region depth to the sample region depth median Tl (D/S), the ratio of the same region depth of different samples to the median depth of the same region depth of different samples (T2 ( D/R).
  • the highest information amount gain of the attribute is measured by the information gain; the information gain calculation method is as follows: Let D be the division of the training tuple by the category, then the entropy of D is expressed as: Where pi represents the probability that the i-th category appears in the entire training tuple, and can be estimated by dividing the number of elements belonging to this category by the total number of training tuple elements. The actual meaning of entropy is the average amount of information required to represent the class labels of the tuples in D.
  • the ID3 algorithm calculates the gain rate of each attribute each time it needs to split, and then selects the attribute with the largest gain rate for splitting.
  • the data is first discretized. The simple method is to divide the attribute values into Ai less than or equal to ⁇ and Ai greater than ⁇ . For any attribute, all values are limited in one data set. Assuming that the attribute takes values (vl, v2, ...vn), there are a total of n-1 segment values in this set, and then the decision tree is further constructed.
  • the specific steps of determining the threshold by using the decision tree are:
  • calculating the ratio T1, T1 is equal to the ratio of the depth value of the sample region normalization to the median of the depth values of all samples; sorting the obtained ratio T1 from small to large to obtain VI, V2, V3 Vn;
  • ⁇ 2 is equal to the ratio of the depth value of the uniformity of the sample region to the median of the depth values of all regions, and sorting the obtained ratio ⁇ 2 from small to large to obtain U1, U2, U3 Un;
  • the information gain is the difference between the two:
  • the information gain is the difference between the two:
  • the maximum gain of the information gain amount of the two groups is compared, and the larger attribute (a) is the root of the tree; the threshold value of the maximum gain amount is used to classify the information, as shown in FIG. 17;
  • a mathematical value method is used to detect an abnormal value of the depth value of the STS region of the sample, and obtain the sample.
  • the steps of the results of the micro-deletion of the STS region include:
  • Step 1901 Calculate an average value and a variance of the normalized depth values of the same region of all the samples according to the uniformized depth values of the same region of all the samples obtained;
  • Step 1902 Obtain a normal distribution curve of all non-outlier samples in the same region according to an average value and a variance of the uniformized depth values of the same region of all the samples;
  • Step 1903 Calculate, according to the normal distribution curve, a probability value of each sample at a specific depth value in each region;
  • Step 1904 Set a third probability value threshold according to a probability value of each sample at a specific depth value in each region, and if the probability value of the sample region at a specific depth value is smaller than the third probability value threshold, Obtaining a result R3 in the sample region that is missing;
  • Step 1905 Calculate the ratio of the depth value of the sample region normalized in the sample R3 to the median of the depth values of all samples D/S;
  • Step 1906 Calculate the result D3
  • the ratio of the depth value of the sample region normalized to the median of the depth values of all regions in R3 is D/R;
  • Step 1907 The ratio D/S is trained to a third ratio threshold by using an ID3 algorithm, and the ratio D/R is trained to a fourth ratio threshold by using an ID3 algorithm;
  • Step 1908 If the ratio D/S of the sample region is greater than the third ratio threshold, the result that the sample region has no micro-deletion is obtained;
  • Step 1909 If the ratio D/S of the sample region is smaller than the third ratio threshold, and the ratio D/R of the sample region is greater than the fourth ratio threshold, obtaining a result that the sample region has no micro-deletion;
  • Step 1910 If the ratio D/S of the sample region is smaller than the third ratio threshold, and the ratio D/R of the sample region is smaller than the fourth ratio threshold, the result that the sample region has a micro-deletion is obtained.
  • a strategy combining the above two methods may be adopted; first, a relatively loose p.value-threshold value is preset, and then the result is filtered by the decision tree. , the result of microdeletion of the STS region of the chromosome.
  • the invention is based on the method of detecting the microdeletion of the STS region in the chromosomal sequence tag site, and according to the DNA sequence of the STS region, the capture probe is designed, and the probe covers the STS region on the whole chromosome, and after hybridizing with the multi-sample DNA hybrid library,
  • the captured DNA sequence of the STS region in multiple samples can detect STS-related regions that have been reported or not reported on the chromosome in a large, efficient, and accurate manner.
  • the micro-deletion of the domain may be performed according to a normal distribution method, an analysis method according to a decision tree, or a combination of the two methods, and finally verified by an experiment.
  • the process is scientific, stable, sensitive, and has a low false positive, which can effectively analyze microdeletions.
  • the following examples are intended to illustrate the invention and are not intended to limit the invention.
  • the operations in this embodiment are understood by those skilled in the art.
  • the reagents and consumables used in this example are not indicated by the manufacturer, and are all general-purpose products that can be purchased through the market.
  • This embodiment is exemplified by the case of detecting the deletion of the Y chromosome STS region, but is not limited to the Y chromosome.
  • 10 samples of infertility samples and 1 sample of healthy people are used together, and the same Nimblegen (Roche) chip is hybridized after the database is built.
  • the number of samples in this embodiment is used to illustrate rather than limit the sample of one experiment. number.
  • Genomic DNA fragmentation is a step of preparing a single-sample DNA library: physical or chemical methods are used to break genomic DNA into DNA fragments of a predetermined size, and the interrupted DNA fragments are recovered.
  • Covaris-S2 ultrasonic interrupter (Covaris, a 3 g) protein-free, RNA-contaminated and non-degrading inflammatory genomic DNA (http: ⁇ yh.genomics.org.cn/)
  • the disrupted fragments were subjected to electrophoresis, they were recovered by QIAquick PCR Purification Kit, and the samples were dissolved in 75 L Elution Buffer.
  • the fragment is tested by electrophoresis mainly because the main band is concentrated between 200 bp and 300 bp.
  • Fragment DNA end-repair that is, a step of preparing a single-sample DNA library: end-repairing the recovered DNA fragment with an enzyme to form a filled-end terminal-phosphorylated DNA fragment, and recovering the fill-in The terminal phosphorylated DNA fragment.
  • the above 100 reaction mixture was gently mixed, and after warming at 20 ° C for 30 minutes, it was purified by QIAquick PCR Purification Kit, and the recovered DNA was finally dissolved in 32 ⁇ ddH20.
  • the end-repaired DNA fragment was prepared in accordance with Table 5 in a 1.5 ml centrifuge tube with an "A" reaction system:
  • the above 50 L reaction mixture was gently mixed, and after warming at 37 ° C for 30 minutes, it was purified by QIAquick PCR Purification Kit, and the recovered DNA was finally dissolved in 15 ddH20.
  • connection of the tag connector Adapter that is, the step of preparing the single-sample DNA library: the DNA fragment of the recovered 3, end plus "A" base under the action of the enzyme and the tag linker Index Adapter Connect and recover the DNA fragment with the tag linker.
  • pre-hybridization PCR that is, in the step of preparing a single-sample DNA library: using the primer of the tag linker sequence as a primer, amplifying the DNA fragment carrying the tag linker, and recovering the amplified product.
  • the DNA in step (4) is used as a template for amplification, and the primers containing the linker sequence are used for amplification.
  • the amplification system is shown in Table 7:
  • the PCR program was 94 ° C for 2 minutes; 4 cycles of 94 ° C for 15 seconds, 62 ° C for 30 seconds, 72 ° C for 30 seconds; 72 ° C for 5 minutes.
  • the PCR product was purified using a QIAquick PCR Purification Kit with an elution volume of 30 ⁇ M.
  • Hybridization of the target region and the probe that is, a method based on the chromosomal sequence tag site STS region microdeletion detection: hybridizing the capture probe with a multi-sample DNA hybrid library to capture multiple samples The DNA sequence of the STS region.
  • the hybridization method is described in NimbleGen Arrays User's Guide, Version 3.1, 7 Jul 2009, Roche NimbleGen, Inc.
  • the sample was loaded at 35 ⁇ , 42 °C for 64-72 hours, eluted with 90 ( ⁇ L 160 mM NaOH, and the eluted product was purified with MinElute PCR Purification Kit, and finally eluted with 80 ⁇ M ⁇ Elution Buffer.
  • the captured library was subjected to PCR amplification.
  • the system was Phusion Mix 150 ⁇ , and the upstream and downstream primers were each 4.2 ⁇ L (Multixing Sequencing Primers and Phix Control Kit).
  • the above 80 ⁇ L eluted sample was added with 85 ⁇ ddH20, and the mixture was divided into 6 tubes for PCR. The number of PCR cycles is 16.
  • 6 tubes were mixed, and a fragment of 300-450 bp in size was purified by QIAquick PCR Purification Kit magnetic beads, and the elution volume was 50 L.
  • the Bioanalyzer analysis system (Agilent, Santa Clara, USA) was used to detect the size and content of the library insert; Q-PCR accurately quantified the concentration of the library.
  • Sequencing and data analysis that is, a method based on chromosomal sequence tagging site STS region microdeletion detection: sequencing the DNA sequence of the STS region in the multi-sample of the captured corresponding capture probe to obtain sequencing Data; the sequencing data is analyzed by mathematical statistics method, and the result of microdeletion of the chromosome STS region in each sample is obtained according to the analysis conclusion.
  • step (9) the qualified library is sequenced on the machine.
  • the sequencing method is based on the HiSeq 2000 User Guide. Catalog # SY-940-1001 Part # 15011190 Rev B , Illumina .
  • the information analysis process of the embodiment of the present invention is as follows:
  • Illumina Hiseq 2000 high throughput sequencing technology was employed.
  • the sequencing sequence is filtered to remove the unqualified sequence.
  • the unqualified sequence includes: The number of bases whose sequencing quality value is less than 5 is more than 50% of the number of bases of the entire sequence, which is considered to be an unqualified sequence; the number of N in the sequencing result exceeds the entire sequence of bases. A 10% of the number is considered to be an unqualified sequence; it is aligned with the sequence of the sequenced sequence of the sequencer, and if the sequence of the sequenced sequence is present in the sequence, it is considered to be an unqualified sequence.
  • Sample area depth statistic that is, the step of quality control of sequencing data of the DNA sequence of the STS region in the multi-sample: the qualified multi-sample sequencing data and the reference genomic sequence are performed by short sequence comparison software For comparison, the relevant parameters of the sequencing depth of each sample and the relevant parameters of the sequencing depth of the same STS region between different samples were counted.
  • the sequencing data obtained by the high-throughput sequencing technology is compared to the human reference genome sequence using the SOAPaligner alignment program, and the human reference genome sequence is HG19 (http://genome.ucsc.edu/). After the comparison, the sampling area depth of the sample was counted.
  • the area where 0 or greater than Q3 plus 1.5 times IQR (Q3 is the upper quartile and IQR is the interquartile range) is removed as an outlier, and no defect is tested.
  • the legend on the left is a columnar statistical graph of the median depth of the region sequencing, and the box plot on the right represents the median result of the sequencing depth of the sample region. From the histogram and the box plot, the sample region can be seen.
  • the median depth is mainly between 35 and 75X.
  • the step of detecting the depth value abnormal value of the sample STS region by using the mathematical statistics method, and obtaining the result of the micro-deletion of the sample STS region includes: Calculating an average value and a variance of the normalized depth values of the same region of all the samples according to the uniformized depth values of the same region of all the samples obtained; and averaging the uniformized depth values of the same region according to the all samples And a variance, obtaining a normal distribution curve of all non-outlier samples in the same region; calculating a probability value of each sample at a specific depth value in each region according to the normal distribution curve; a probability value of the region at a specific depth value, setting a first probability value threshold, if the probability value of the region where the sample is located at a specific depth value is less than the threshold value of the first probability value of the probability value, obtaining the sample region is slightly Missing result R1; experimentally verified R1 for the micro-deletion of the sample
  • the data of the samples after the outlier filtering is taken, and the normal distribution curve of all the non-outlier sample data of the region is obtained according to the average value and the variance. According to The curve obtains a probability value p.value of the depth of the region to a certain value.
  • the abscissa of the graph is a sample, and the ordinate is -loglO (p-value); the graph shown in Fig. 23 is a result indicating that the p-value is worth the expected value and the observed value.
  • the probability value of the depth of each region of each sample obtained by sequencing is obtained.
  • the PCR reaction was validated in 20 sample regions with low probability values.
  • the threshold value of the micro-deletion test was determined according to the verification result, and the accuracy of the micro-deletion test was evaluated. The specific steps are as follows:
  • the step of detecting the depth value abnormal value of the sample STS region by using the mathematical statistics method, and obtaining the result of the micro-deletion of the sample STS region includes: Calculating a ratio D/S of the depth value of the uniformity of the sample region to the median of the depth values of all samples according to the uniformized depth value of the STS region of the obtained sample; and normalizing the STS region according to the obtained sample a depth value, a ratio D/R of a depth value of the sampled region normalized to a median of depth values of all regions is calculated; the ratio D/S is trained by the ID3 algorithm to generate a first ratio threshold, The ratio D/R is trained by the ID3 algorithm to generate a second ratio threshold; if the ratio D/S of the sample region is greater than the first ratio threshold, the result that the sample
  • the finally obtained sample area to STS deletion regions of the sample includes: obtaining a uniform depth of the same region of all samples according to the obtained a value, an average value and a variance of the normalized depth values of the same region of all the samples are calculated; and all non-outlier samples of the same region are obtained according to an average value and a variance of the uniformized depth values of the same region of all the samples a normal distribution curve; according to the normal distribution curve, calculate a probability value of each sample at a specific depth value in each region; according to the probability value of each sample at a specific depth value in each region, set the first a three probability value threshold, if the probability value of the sample
  • ratio D/R of the value to the median of the depth values of all regions; the ratio D/S is trained to a third ratio threshold by the ID3 algorithm, and the ratio D/R is trained to a fourth ratio threshold by the ID3 algorithm If the ratio D/S of the sample region is greater than the third ratio threshold, the result that the sample region is not missing is obtained; if the ratio D/S of the sample region is less than the third ratio threshold, and the ratio D/R of the sample region is greater than a four-ratio threshold, obtaining a result that the sample region has no missing; if the ratio D/S of the sample region is less than a third ratio threshold, and the ratio D/R of the sample region is less than a fourth ratio threshold, obtaining the sample region The result of microdeletions.
  • the present invention also provides a computer readable medium carrying a series of instructions to control a computer processor to perform the method as described above, and no further details are provided herein.
  • Fig. 25 is a view showing the configuration of an apparatus for detecting microdeletion based on the chromosomal sequence tag site STS region of the present invention.
  • the apparatus includes: a capture probe obtaining module 2501, a hybridization module 2502, a sequencing data obtaining module 2503, and a micro-missing result obtaining module 2504.
  • the capture probe acquisition module 2501 is configured to select an STS region on the chromosome, and a corresponding capture probe is designed according to the DNA sequence of the STS region.
  • the STS (sequence-tagged site) sequence tag site is a short, single-copy DNA sequence that is well-defined on the genome, can be used as a landmark and can be amplified by PCR, and is used to generate mapping sites.
  • a sequence of STS can be used to map the genomic region.
  • the probe is a small stretch of single-stranded DNA or RNA fragments (approximately 20 to 500 bp) used to detect nucleic acid sequences complementary thereto.
  • a hybridization module 2502 is used to hybridize the capture probe to a multi-sample DNA hybrid library to capture the DNA sequence of the STS region in multiple samples.
  • a DNA library refers to a recombinant DNA molecule containing a DNA fragment used by a cell-DNA cloning technique and transformed into a bacterium to form a DNA library.
  • the DNA hybrid library herein refers to the DNA fragment used in which all samples are mixed together.
  • the sequencing data obtaining module 2503 is configured to sequence the DNA sequence of the STS region in the plurality of samples of the captured respective capture probes to obtain sequencing data.
  • the micro-missing result obtaining module 2504 is configured to analyze the sequencing data by a mathematical statistics method, and obtain the result of the micro-deletion of the chromosome STS region in each sample according to the analysis result.
  • the capture probe obtaining module 2501 includes: a region search unit for searching a DNA sequence of an STS region on a chromosome in a genome database (GDB); a sequence selection unit, configured to be used in the search unit, The selected sequences conforming to the design conditions of the capture probe are designed and synthesized to obtain a capture probe.
  • GDB genome database
  • the genomic database (GDB, http://www.gdb.org/) is a database for the preservation and processing of genomic maps by the Human Genome Project (HGP).
  • GDB is an encyclopedia that builds on the human genome.
  • methods for describing genomic content at the sequence level have been developed, including sequence variations and other descriptions of functions and phenotypes.
  • the GDB database stores data in an object model, providing a web-based data object retrieval service that allows users to search for various types of objects and graphically view the genome map.
  • the DNA of the STS region on the ⁇ chromosome can be found based on the position coordinates. sequence.
  • the positional coordinates of the DNA sequence of the STS region on the chromosome are searched in the genome database, and the coordinates are submitted to Roche-NimbleGen or other companies providing DNA capture services, and the capture probe design is completed by these companies. And synthesis.
  • the device further comprises a multi-sample DNA hybrid library preparation module
  • the multi-sample DNA hybrid library preparation module comprises: a single-sample DNA library preparation unit, configured to prepare a plurality of quality-controlled single-samples with different joints a DNA library; a single-sample library mixing unit for mixing the plurality of single-sample DNA libraries in a predetermined ratio; a multi-sample DNA hybrid library obtaining unit for verifying whether the quality of the mixed multi-sample DNA library is qualified If it is, it is a prepared multi-sample DNA hybrid library.
  • the device further includes a sequencing data quality control module, and the sequencing data quality control module includes: a qualified sequence obtaining unit, configured to filter unqualified data in the sequencing data of the DNA sequence of the STS region in the multi-sample , obtaining qualified multi-sample sequencing data; sequencing depth statistics unit for comparing the qualified multi-sample sequencing data with the reference genome sequence by short sequence comparison software, and counting the correlation of the sequencing depth of each sample Parameters and related parameters of the sequencing depth of the same STS region between different samples; a qualified sample obtaining unit for filtering out the sequencing data of the unqualified samples according to the relevant parameters of the sequencing depth of each sample obtained by the statistics, Obtaining the sequencing data of the qualified sample; the qualified region obtaining unit is configured to filter out the sequencing data of the unqualified STS region according to the relevant parameters of the sequencing depth of the same STS region between different samples obtained by the statistics, and obtain the qualified STS region. Sequencing data.
  • the micro-missing result obtaining module includes: a depth value homogenizing unit, configured to homogenize the sequencing depth value of the STS region of the sample to obtain a uniform depth value; and a micro-missing result obtaining unit, configured to obtain The depth value of the uniformity of the STS region of the sample is detected by a mathematical statistics method to detect an abnormal value of the depth value of the STS region of the sample, and the result of the micro-deletion of the STS region of the sample is obtained.
  • the depth value normalization unit is specifically configured to divide the depth value of the same region in all samples by the average value of each sample depth value to obtain a depth value that is uniformized by the sample region.
  • the micro-missing result obtaining unit includes:
  • the average variance obtaining unit 2601 is configured to calculate an average value and a variance of the normalized depth values of the same region of all the samples according to the uniformized depth values of the same region of all the samples obtained; the normal distribution curve obtaining unit 2602 And obtaining, according to an average value and a variance of the uniformized depth values of the same region of all the samples, a normal distribution curve of all non-outlier samples in the same region; a probability value calculation unit 2603, configured to State distribution curve, calculating the probability value of each sample at a specific depth value in each region;
  • the first determining unit 2604 is configured to set a first probability value threshold according to a probability value of each sample when the corresponding region is at a specific depth value, and if the probability that the region where the sample is located is at a specific depth value is smaller than
  • the probability value first probability value threshold obtains a result R1 in which the sample region has a micro-deletion.
  • the missing result obtaining unit further includes: a probability value threshold determining unit 2701, configured to perform experimental verification on the result R1 of the micro-deletion in the sample region, according to the experimental verification result. And setting a second probability value threshold, where the second probability value threshold is smaller than the first probability value threshold;
  • the second determining unit 2702 is configured to obtain a result R2 in which the sample region has a micro-deletion if the probability value of the sample region at a specific depth value is less than the second probability value threshold.
  • the missing result obtaining unit includes: a ratio D/S obtaining unit 2801, configured to calculate the sample according to the uniformized depth value of the STS region of the obtained sample. a ratio of the depth value of the region normalization to the median of the depth values of all samples D/S; a ratio D/R obtaining unit 2802 for calculating the sample according to the uniformized depth value of the STS region of the obtained sample The ratio of the depth value of the region normalization to the median of the depth values of all the regions D/R; the first and second ratio threshold obtaining unit 2803, configured to train the ratio D/S to the first ratio threshold by the ID3 algorithm And training the ratio D/R to a second ratio threshold by using an ID3 algorithm;
  • the first determining unit 2804 is configured to obtain a result that the sample region has no micro-deletion if the ratio D/S of the sample region is greater than the first ratio threshold;
  • the second determining unit 2805 is configured to obtain, if the ratio D/S of the sample region is smaller than the first ratio threshold, and the ratio D/R of the sample region is greater than the second ratio threshold, obtain a result that the sample region has no micro-deletion;
  • the third determining unit 2806 is configured to obtain a result that the sample region has a micro-deletion if the ratio D/S of the sample region is smaller than the first ratio threshold, and the ratio D/R of the sample region is smaller than the second ratio threshold.
  • the missing result obtaining unit includes: an average variance obtaining unit 2901, configured to calculate the all according to the uniformized depth value of the same region of all samples obtained. An average value and a variance of the uniformized depth values of the same region of the sample; a normal distribution curve obtaining unit 2902, configured to obtain all of the same region according to the average value and the variance of the uniformized depth values of the same region of all the samples Normal distribution curve of non-outlier samples; a probability value calculation unit 2903, configured to calculate a probability value of each sample at a specific depth value in each region according to the normal distribution curve;
  • a first determining unit 2904 configured to set a third probability value threshold according to a probability value of each sample at a specific depth value in each region, if the probability value of the sample region at a specific depth value is smaller than the third a probability value threshold, obtaining a result R3 of the sample region having a micro-deletion;
  • the ratio D/S obtaining unit 2905 is configured to calculate a ratio of the depth value of the sample region normalized in the result R3 to the median of the depth values of all samples D/S;
  • a ratio D/R obtaining unit 2906 configured to calculate a ratio D*R of the depth value of the sample region normalized in the result R3 to the median of the depth values of all regions;
  • the third and fourth ratio threshold obtaining unit 2907 is configured to train the ratio D/S to learn a third ratio threshold by using an ID3 algorithm, and train the ratio D/R to a fourth ratio threshold by using an ID3 algorithm;
  • a second determining unit 2908 configured to obtain a result that the sample region has no micro-deletion if the ratio D/S of the sample region is greater than a third ratio threshold
  • the third determining unit 2909 is configured to obtain, if the ratio D/S of the sample region is smaller than the third ratio threshold, and the ratio D/R of the sample region is greater than the fourth ratio threshold, obtain a result that the sample region has no micro-deletion;
  • the fourth determining unit 2910 is configured to obtain a result that the sample region has a micro-deletion if the ratio D/S of the sample region is smaller than the third ratio threshold, and the ratio D/R of the sample region is smaller than the fourth ratio threshold.
  • the invention is based on the device for detecting the microdeletion of the STS region in the chromosomal sequence tag site, and according to the DNA sequence of the STS region, the capture probe is designed, and the probe covers the STS region on the whole chromosome, and after hybridizing with the multi-sample DNA hybrid library,
  • the DNA sequence of the STS region in the captured multi-sample can detect the micro-deletion of the STS-related region on the chromosome that has been reported or not reported in a large amount, efficiently and accurately.
  • the mathematical statistics analysis of the present invention can be normalized.
  • the method of distribution can also be based on the analysis method of the decision tree, or the two methods can be combined, and finally verified by experiments.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

一种检测染色体序列标签位点(STS)区域微缺失的方法及其装置。该方法包括:选取染色体上的STS区域,根据STS区域的DNA序列,设计得到相应的捕获探针;将捕获探针与多样本的DNA混合文库进行杂交,以捕获多样本中STS区域的DNA序列;将捕获的相应捕获探针的多样本中STS区域的DNA序列进行测序,得到测序数据;采用数理统计方法对测序数据进行分析,根据分析结论,获得每个样本中染色体STS区域微缺失的结果。

Description

一种检测染色体 STS区域微缺失的方法及其装置
【技术领域】
本发明涉及基因工程技术领域, 特别是涉及一种基于染色体序列标签位点
STS区域微缺失检测的方法及其装置。
【背景技术】
缺失是染色体组中的染色体或 DNA分子发生部分丟失的现象,它是导致基 因突变的一个重要原因。
目前对染色体 STS 区域的微缺失检测主要是应用 PCR ( Polymerase Chain Reaction )技术。 PCR技术是通过模拟体内 DNA复制的方式, 在体外选择性地 将 DNA某个特殊区域扩增出来的技术。 当对少量位点检测时, PCR检测具有快 速、方便的特点;另夕卜,引物设计需要预先知道 DNA某个特殊区域两端的序列, 因此要求这些染色体 STS区域的微缺失是预先已经报道的。
但是,在面对大量样本或未经报道的微缺失时, PCR检测有很大的局限性; 另外, 当需要检测的 STS位点较多、 尤其是对整个染色体缺失进行检测时, 传 统的 PCR已经不能满足这种需求, 需要一种新的技术来进行研究。
【发明内容】
本发明主要解决的技术问题是提供一种基于染色体序列标签位点 STS区域 缺失检测的方法及其装置, 能够在有限的成本上, 进行大量的染色体 STS区 域的微缺失的检测; 也能够检测染色体 STS区域未经报道的微缺失。
为解决上述技术问题, 本发明采用的一个技术方案是: 提供一种基于染色 体序列标签位点 STS区域微缺失检测的方法,包括:选取染色体上的 STS区域, 根据所述 STS区域的 DNA序列,设计得到相应的捕获探针;将所述捕获探针与 多样本的 DNA混合文库进行杂交, 以捕获多样本中 STS区域的 DNA序列; 将 所述捕获的相应捕获探针的多样本中 STS区域的 DNA序列进行测序,得到测序 数据; 采用数理统计方法对所述测序数据进行分析, 根据所述分析结论, 获得 每个样本中染色体 STS区域微缺失的结果。
其中, 所述采用数理统计方法对所述测序数据进行分析, 根据所述分析结 论, 获得每个样本中染色体 STS区域微缺失的结果的步骤包括: 将样本的 STS 区域的测序深度值进行均一化, 得到均一化的深度值; 根据得到的样本的 STS 区域的均一化的深度值, 采用数理统计方法, 检测所述样本 STS区域的深度值 异常值, 并获得所述样本 STS区域微缺失的结果。
其中, 所述将样本的 STS区域的测序深度值进行均一化的步骤包括: 将所 均一化的深度值。
其中, 所述根据得到的样本的 STS区域的均一化的深度值, 采用数理统计 方法, 检测所述样本 STS区域的深度值异常值, 并获得所述样本 STS区域微缺 失的结果的步骤包括: 根据得到的所有样本的同一区域的均一化的深度值, 计 算所述所有样本的同一区域的均一化深度值的平均值以及方差; 根据所述所有 样本的同一区域的均一化深度值的平均值以及方差, 获得所述同一区域所有非 离群样本的正态分布曲线; 根据所述正态分布曲线, 计算每个样本在每个区域 在特定深度值时的概率值; 根据每个样本在相应区域在特定深度值时的概率值, 设置第一概率值阈值, 若所述样本所在区域在特定深度值时的概率值小于所述 第一概率值阈值, 则获得所述样本区域有微缺失的结果 Rl。
其中, 所述根据每个样本在相应区域在特定深度值时的概率值, 设置第一 概率值阈值, 若所述样本所在区域在特定深度值时的概率值小于所述第一概率 本区域有微缺失的结果 R1进行实验验证, 根据实验验证结果, 设置第二概率值 阈值, 其中, 所述第二概率值阈值小于第一概率值阈值; 若所述样本区域在特 定深度值时的概率值小于所述第二概率值阈值, 则获得所述样本区域有微缺失 的结果 R2。
其中, 所述根据得到的样本的 STS区域的均一化的深度值, 采用数理统计 方法, 检测所述样本 STS区域的深度值异常值, 并获得所述样本 STS区域微缺 失的结果的步骤包括: 根据得到的样本的 STS区域的均一化的深度值, 计算所 述样本区域均一化的深度值与所有样本的深度值的中位数的比值 D/S;根据得到 的样本的 STS区域的均一化的深度值, 计算所述样本区域均一化的深度值与所 有区域的深度值的中位数的比值 D/R; 将所述比值 D/S 通过 ID3算法训练出第 一比值阈值, 将所述比值 D/R通过 ID3算法训练出第二比值阈值; 若样本区域 的比值 D/S 大于第一比值阈值, 则获得所述样本区域没有 缺失的结果; 若样 本区域的比值 D/S小于第一比值阈值, 并且样本区域的比值 D/R大于第二比值 阈值, 则获得所述样本区域没有 缺失的结果; 若样本区域的比值 D/S 小于第 一比值阈值, 并且样本区域的比值 D/R小于第二比值阈值, 则获得所述样本区 域有微缺失的结果。
其中, 所述根据得到的样本的 STS区域的均一化的深度值, 采用数理统计 方法, 检测所述样本 STS区域的深度值异常值, 并获得所述样本 STS区域微缺 失的结果的步骤包括: 根据得到的所有样本的同一区域的均一化的深度值, 计 算所述所有样本的同一区域的均一化深度值的平均值以及方差; 根据所述所有 样本的同一区域的均一化深度值的平均值以及方差, 获得所述同一区域所有非 离群样本的正态分布曲线; 根据所述正态分布曲线, 计算每个样本在每个区域 在特定深度值时的概率值; 根据每个样本在每个区域在特定深度值时的概率值, 设置第三概率值阈值, 若所述样本区域在特定深度值时的概率值小于所述第三 概率值阈值, 则获得所述样本区域有微缺失的结果 R3 ; 计算所述结果 R3 中样 本区域均一化的深度值与所有样本的深度值的中位数的比值 D/S;计算所述结果 R3中样本区域均一化的深度值与所有区域的深度值的中位数的比值 D/R; 将所 述比值 D/S 通过 ID3算法训练出第三比值阈值, 将所述比值 D/R通过 ID3算法 训练出第四比值阈值; 若样本区域的比值 D/S 大于第三比值阈值, 则获得所述 样本区域没有 缺失的结果; 若样本区域的比值 D/S 小于第三比值阈值, 并且 样本区域的比值 D/R大于第四比值阈值, 则获得所述样本区域没有微缺失的结 果; 若样本区域的比值 D/S小于第三比值阈值, 并且样本区域的比值 D/R小于 第四比值阈值, 则获得所述样本区域有微缺失的结果。
其中, 所述选取染色体上的 STS区域, 根据所述 STS区域的 DNA序列, 设计并合成相应的捕获探针的步骤包括:在基因组数据库中查找染色体上的 STS 区域的 DNA序列; 在所述查找到的 STS区域的 DNA序列中挑选符合捕获探针 设计条件的序列; 根据所述挑选到的符合捕获探针设计条件的序列, 设计并合 成得到捕获探针。
其中, 所述多样本的 DNA混合文库的制备的步骤包括: 制备多个带有不同 接头的质量控制合格的单样本的 DNA文库; 将所述多个单样本的 DNA文库按 照预定比例混合; 检验所述混合的多样本的 DNA文库的质量是否合格, 若是, 即为制备的多样本的 DNA混合文库。
其中, 所述单样本的 DNA文库的制备的步骤包括: 利用物理或化学的方法 将基因组 DNA打断成预定大小的 DNA片段, 回收所述打断的 DNA片段; 利 用酶对所述回收的 DNA片段进行末端修复, 形成补平的末端磷酸化的 DNA片 段,回收所述补平的末端磷酸化的 DNA片段;利用酶对所述回收的补平的 DNA 片段的 3,末端加上" A"碱基, 回收所述 3,末端加上" A"碱基的 DNA片段; 使所 述回收的 3,末端加上" A"碱基的 DNA 片段在酶的作用下与标签接头 Index Adapter连接, 并回收带有标签接头的 DNA片段; 以标签接头序列的引物做为 引物, 对所述带有标签接头的 DNA片段进行扩增, 回收所述扩增的产物; 检验 所述扩增的产物的质量控制是否合格, 若是, 即为制备的单样本的 DNA文库。
其中,所述将捕获的相应捕获探针的多样本中 STS区域的 DNA序列进行测 序,得到测序数据的步骤之后还包括:对所述对多样本中 STS区域的 DNA序列 的测序数据进行质量控制。
其中,所述对多样本中 STS区域的 DNA序列的测序数据进行质量控制的步 骤包括:对所述多样本中 STS区域的 DNA序列的测序数据中不合格的数据进行 过滤, 得到合格的多样本的测序数据; 通过短序列对比软件, 将所述合格的多 样本的测序数据与参考基因组序列进行对比, 并统计每个样本的测序深度的相 关参数以及不同样本之间相同的 STS区域的测序深度的相关参数; 根据所述统 计得到的每个样本的测序深度的相关参数, 过滤掉不合格的样本的测序数据, 得到合格的样本的测序数据; 根据所述统计得到的不同样本之间相同的 STS区 域的测序深度的相关参数, 过滤掉不合格 STS区域的测序数据, 得到合格 STS 区域的测序数据。
其中,所述对多样本中 STS区域的 DNA序列的测序数据中不合格的数据进 行过滤, 得到合格的多样本的测序数据的步骤包括: 通过测序数据中低质量值 碱基的比例进行测序质量过滤, 若低质量值碱基个数超过整条序列碱基个数的 预定比例, 则判断为是不合格的数据, 将所述不合格的测序数据过滤掉, 获得 初步合格的第一测序数据集合; 若所述初步合格的第一测序数据集合中测序结 果不确定的碱基个数超过整条序列碱基个数的 10%, 则判断为是不合格的数据, 将所述不合格的测序数据过滤掉, 获得初步合格的第二测序数据集合; 将所述 初步合格的第二测序数据集合中所有测序数据与测序接头序列库进行比对, 若 所述初步合格的第二测序数据集合中存在测序接头序列, 则判断为是不合格的 数据, 将所述不合格的测序数据过滤掉, 获得初步合格的第三测序数据集合; 将所述初步合格的第三测序数据集合中所有测序数据与试验中引入的所有外源 序列比对, 若所述初步合格的第三测序数据集合中存在外源序列, 则判断为是 不合格的数据, 将所述不合格的测序数据过滤掉, 获得合格的多样本的测序数 据的序列。
其中, 所述根据统计得到的每个样本的测序深度的相关参数, 过滤掉不合 格的样本的测序数据, 得到合格的样本的测序数据的步骤包括: 将所有样本的 测序深度值按照从小到大的顺序进行排序, 利用四分位函数确定所述排序后的 所有样本的测序深度值的下四分位数 Q1、上四分位数 Q3以及四分位数间距 IQR; 将所有样本的测序深度值在 Ql减去 1.5倍 IQR和 Q3加上 1.5倍 IQR范围之外 的不合格的样本的测序数据过滤掉, 得到合格的样本的测序数据。
其中, 所述根据统计得到的不同样本之间相同的 STS区域的测序深度的相 关参数, 过滤掉不合格 STS区域的测序数据, 得到合格 STS区域的测序数据的 步骤包括: 将不同样本之间相同的 STS区域的测序深度值按照从小到大的顺序 进行排序, 利用四分位函数确定所述排序后的不同样本之间相同的 STS区域的 测序深度值的中位数、 上四分位数 Q3以及四分位数间距 IQR; 将不同样本之间 相同的 STS区域的测序深度值中位数为 0或者中位数大于 Q3加上 1.5倍 IQR 的不合格的 STS区域的测序数据过滤掉, 得到合格的 STS区域的测序数据。
为解决上述技术问题, 本发明采用的另一个技术方案是: 提供一种基于染 色体序列标签位点 STS区域微缺失检测的装置, 包括: 捕获探针获得模块, 用 于选取染色体上的 STS区域, 根据所述 STS区域的 DNA序列, 设计得到相应 的捕获探针; 杂交模块, 用于将所述捕获探针与多样本的 DNA混合文库进行杂 交, 以捕获多样本中 STS区域的 DNA序列; 测序数据获得模块, 用于将所述捕 获的相应捕获探针的多样本中 STS区域的 DNA序列进行测序, 得到测序数据; 微缺失结果获得模块, 用于采用数理统计方法对所述测序数据进行分析, 根据 所述分析结论, 获得每个样本中染色体 STS区域微缺失的结果。
其中, 所述微缺失结果获得模块包括: 深度值均一化单元, 用于将样本的 STS 区域的测序深度值进行均一化, 得到均一化的深度值; 微缺失结果获得单 元, 用于根据得到的样本的 STS区域的均一化的深度值, 采用数理统计方法, 检测所述样本 STS区域的深度值异常值, 并获得所述样本 STS区域微缺失的结 果。
其中, 所述深度值均一化单元具体用于将所有样本中相同区域的深度值除 以每个样本深度值的平均值, 得到所述样本区域均一化的深度值。 其中, 所述微缺失结果获得单元包括: 平均值方差获得单元, 用于根据得 到的所有样本的同一区域的均一化的深度值, 计算所述所有样本的同一区域的 均一化深度值的平均值以及方差; 正态分布曲线获得单元, 用于根据所述所有 样本的同一区域的均一化深度值的平均值以及方差, 获得所述同一区域所有非 离群样本的正态分布曲线; 概率值计算单元, 用于根据所述正态分布曲线, 计 算每个样本在每个区域在特定深度值时的概率值; 第一判断单元, 用于根据每 个样本在相应区域在特定深度值时的概率值, 设置第一概率值阈值, 若所述样 本所在区域在特定深度值时的概率值小于所述概率值第一概率值阈值, 则获得 所述样本区域有微缺失的结果 Rl。
其中, 所述微缺失结果获得单元还包括: 概率值阈值确定单元, 用于对所 述样本区域有微缺失的结果 R1进行实验验证, 根据实验验证结果, 设置第二概 率值阈值, 其中, 所述第二概率值阈值小于第一概率值阈值; 第二判断单元, 用于若所述样本区域在特定深度值时的概率值小于所述第二概率值阈值, 则获 得所述样本区域有微缺失的结果 R2。
其中, 所述微缺失结果获得单元包括: 比值 D/S获得单元, 用于根据得到 的样本的 STS区域的均一化的深度值, 计算所述样本区域均一化的深度值与所 有样本的深度值的中位数的比值 D/S; 比值 D/R获得单元, 用于根据得到的样 本的 STS区域的均一化的深度值, 计算所述样本区域均一化的深度值与所有区 域的深度值的中位数的比值 D/R; 第一、二比值阈值获得单元, 用于将所述比值 D/S 通过 ID3算法训练出第一比值阈值, 将所述比值 D/R通过 ID3算法训练出 第二比值阈值;第一判断单元,用于若样本区域的比值 D/S大于第一比值阈值, 则获得所述样本区域没有微缺失的结果; 第二判断单元, 用于若样本区域的比 值 D/S小于第一比值阈值, 并且样本区域的比值 D/R大于第二比值阈值, 则获 得所述样本区域没有 缺失的结果; 第三判断单元, 用于若样本区域的比值 D/S 小于第一比值阈值, 并且样本区域的比值 D/R小于第二比值阈值, 则获得所述 样本区域有微缺失的结果。 其中, 所述微缺失结果获得单元包括: 平均值方差获得单元, 用于根据得 到的所有样本的同一区域的均一化的深度值, 计算所述所有样本的同一区域的 均一化深度值的平均值以及方差; 正态分布曲线获得单元, 用于根据所述所有 样本的同一区域的均一化深度值的平均值以及方差, 获得所述同一区域所有非 离群样本的正态分布曲线; 概率值计算单元, 用于根据所述正态分布曲线, 计 算每个样本在每个区域在特定深度值时的概率值; 第一判断单元, 用于根据每 个样本在每个区域在特定深度值时的概率值, 设置第三概率值阈值, 若所述样 本区域在特定深度值时的概率值小于所述第三概率值阈值, 则获得所述样本区 域有 缺失的结果 R3; 比值 D/S获得单元, 用于计算所述结果 R3中样本区域 均一化的深度值与所有样本的深度值的中位数的比值 D/S;比值 D/R获得单元, 用于计算所述结果 R3中样本区域均一化的深度值与所有区域的深度值的中位数 的比值 D/R; 第三、 四比值阈值获得单元, 用于将所述比值 D/S 通过 ID3算法 训练出第三比值阈值, 将所述比值 D/R通过 ID3算法训练出第四比值阈值; 第 二判断单元, 用于若样本区域的比值 D/S 大于第三比值阈值, 则获得所述样本 区域没有微缺失的结果; 第三判断单元, 用于若样本区域的比值 D/S 小于第三 比值阈值, 并且样本区域的比值 D/R大于第四比值阈值, 则获得所述样本区域 没有 缺失的结果; 第四判断单元, 用于若样本区域的比值 D/S 小于第三比值 阈值, 并且样本区域的比值 D/R小于第四比值阈值, 则获得所述样本区域有 缺失的结果。
其中, 所述捕获探针获得模块包括: 区域查找单元, 用于在基因组数据库 中查找染色体上的 STS区域的 DNA序列;序列挑选单元,用于在所述查找到的 用于根据所述挑选到的符合捕获探针设计条件的序列, 设计并合成得到捕获探 针。
其中, 所述装置还包括多样本 DNA混合文库制备模块, 所述多样本 DNA 混合文库制备模块包括: 单样本 DNA文库制备单元, 用于制备多个带有不同接 头的质量控制合格的单样本的 DNA文库; 单样本文库混合单元, 用于将所述多 个单样本的 DNA文库按照预定比例混合; 多样本 DNA混合文库获得单元, 用 于检验所述混合的多样本的 DNA文库的质量是否合格, 若是, 即为制备的多样 本的 DNA混合文库。
其中, 所述装置还包括测序数据质控模块, 所述测序数据质控模块包括: 合格序列获得单元,用于对所述多样本中 STS区域的 DNA序列的测序数据中不 合格的数据进行过滤, 得到合格的多样本的测序数据; 测序深度统计单元, 用 于通过短序列对比软件, 将所述合格的多样本的测序数据与参考基因组序列进 行对比, 并统计每个样本的测序深度的相关参数以及不同样本之间相同的 STS 区域的测序深度的相关参数; 合格样本获得单元, 用于根据所述统计得到的每 个样本的测序深度的相关参数, 过滤掉不合格的样本的测序数据, 得到合格的 样本的测序数据; 合格区域获得单元, 用于根据所述统计得到的不同样本之间 相同的 STS 区域的测序深度的相关参数, 过滤掉不合格 STS 区域的测序数据, 得到合格 STS区域的测序数据。
为解决上述技术问题, 本发明采用的又一个技术方案是: 提供一种计算机 可读介质, 所述介质承载一系列指令以控制计算机处理器执行如上所述的方法。
本发明的有益效果是: 区别于现有技术的情况, 本发明的基于染色体序列 标签位点 STS区域微缺失检测的方法及其装置, 根据 STS区域的 DNA序列, 设计获得捕获探针,探针涵盖了整个染色体上 STS区域,与多样本的 DNA混合 文库进行杂交后, 捕获到的多样本中 STS区域的 DNA序列, 能够大量、 高效、 本发明的数理统计信息分析流程科学、 稳定, 灵敏度高、 假阳性低, 可以有效 的针对微缺失进行分析。
【附图说明】
图 1是本发明基于染色体序列标签位点 STS区域微缺失检测的方法一实施 例的流程图;
图 2是本发明基于染色体序列标签位点 STS区域微缺失检测的方法另一实 施例的流程图;
图 3是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 4是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 5是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 6是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 7是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 8是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中过滤掉深度值离群的样本的示意图;
图 9是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中深度值正常的样本的示意图;
图 10是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 11是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中过滤掉深度值离群的区域的示意图;
图 12是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 13是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 14是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 15是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 16是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中举例说明使用 JD3算法所构造的决策树的示意图;
图 17是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中决策树分析法的部分流程图;
图 18是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中决策树分析法的流程图;
图 19是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例的流程图;
图 20是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中样品测序深度中位数的柱状统计图以及箱线图;
图 21是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中区域测序深度中位数的柱状统计图以及箱线图;
图 22是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中获得区域深度为某个特定值的概率值的示意图;
图 23是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中概率值阈值以及概率值阈值的期望值与观察值的关系示意图;
图 24是本发明基于染色体序列标签位点 STS区域微缺失检测的方法又一实 施例中真、 假阳性比例的示意图;
图 25是本发明基于染色体序列标签位点 STS区域微缺失检测的装置一实施 例的结构示意图;
图 26是本发明基于染色体序列标签位点 STS区域微缺失检测的装置又一实 施例的结构示意图;
图 27是本发明基于染色体序列标签位点 STS区域微缺失检测的装置又一实 施例的结构示意图;
图 28是本发明基于染色体序列标签位点 STS区域微缺失检测的装置又一实 施例的结构示意图;
图 29是本发明基于染色体序列标签位点 STS区域微缺失检测的装置又一实 施例的结构示意图。
【具体实施方式】
下面结合附图和实施例对本发明进行详细说明。
图 1是本发明基于染色体序列标签位点 STS区域微缺失检测的方法一实施 例的流程图, 如图 1所示, 所述方法包括:
步骤 101 : 选取染色体上的 STS区域, 根据所述 STS区域的 DNA序列, 设 计得到相应的捕获探针。
STS ( sequence-tagged site )序列标签位点, 是基因组上定位明确、 作为界 标并能够通过 PCR扩增、 被唯一操作的短的、 单拷贝的 DNA序列, 用于产生 作图位点, 即测定一系列 STS的次序即可作出基因组区域的图谱。 探针是一小 段单链 DNA或者 RNA片段(大约是 20到 500bp ) , 用于检测与其互补的核酸 序列。
步骤 102: 将所述捕获探针与多样本的 DNA混合文库进行杂交, 以捕获多 样本中 STS区域的 DNA序列。
DNA文库是指某一特定来源 DNA通过细胞 - DNA克隆技术构建呈含有所 用 DNA片段的重组 DNA分子,并转化至细菌内,构成 DNA文库。此处的 DNA 混合文库是指所有样本混合在一起的所用 DNA片段。
步骤 103: 将所述捕获的相应捕获探针的多样本中 STS区域的 DNA序列进 行测序, 得到测序数据。
步骤 104:采用数理统计方法对所述测序数据进行分析,根据所述分析结论, 获得每个样本中染色体 STS区域微缺失的结果。 数理统计是伴随着概率论的发展而发展起来的一个数学分支, 研究如何有 效的由集、 整理和分析受随机因素影响的数据, 并对所考虑的问题作出推断或 预测, 为采取某种决策和行动提供依据或建议。
在一实施例中,如图 2所示,所述选取染色体上的 STS区域,根据所述 STS 区域的 DNA序列, 设计并合成相应的捕获探针的步骤包括:
步骤 201 : 在基因组数据库中查找染色体上的 STS区域的 DNA序列。
找出 Y染色体上 STS序列在 UCSC数据库中,在人类基因组参考序列 Hgl9 ( http://genome.ucsc.edu/ )的位置坐标, 根据位置坐标即可找到 Y染色体上 STS 区域的 DNA序列。
步骤 202: 在所述查找到的 STS区域的 DNA序列中挑选符合捕获探针设计 条件的序列。
序列上的重复性及 GC含量会影响到芯片的捕获效率, 甚至发生捕获错误, 所以挑选符合捕获探针设计条件的序列很重要。
步骤 203: 根据所述挑选到的符合捕获探针设计条件的序列, 设计并合成得 到捕获探针。
为了捕获 DNA片段, 探针的长度一般为 60-150bp, GC含量在 40%-70%之 间。 在对同一段上的 DNA序列进行探针设计时, 需要多条探针才能覆盖整段 DNA序列, 而且探针与探针之间存在重叠序列(overlap ), 其中, 重叠序列的长 度一般为 20bp。
在具体实施例中,在基因组数据库中查找染色体上的 STS区域的 DNA序列 的位置坐标, 将此坐标提交给提供 DNA捕获服务的公司, 由这些公司完成捕获 探针的设计与合成。
其中, 如图 3所示, 多样本的 DNA混合文库的制备的步骤包括:
步骤 301 : 制备多个带有不同接头的质量控制合格的单样本的 DNA文库。 每个样本带有一个不同的标签接头, 为了在测序中区别来自不同样本的文 库, 每个样本文库的 DNA末端都含有不同的 6bp或 8bp的 Index碱基序列。 步骤 302: 将所述多个单样本的 DNA文库按照预定比例混合。
每个样本文库 DNA混合量可根据需要等量或按照一定比例混合。
步骤 303: 检验所述混合的多样本的 DNA文库的质量是否合格, 若是, 即 为制备的多样本的 DNA混合文库。
对混合的多样本的 DNA文库进行定量,检测是否引入外源杂质等质量控制 指标。
其中, 如图 4所示, 单样本的 DNA文库的制备的步骤包括:
步骤 401 : 利用物理或化学的方法将基因组 DNA打断成预定大小的 DNA 片段, 回收所述打断的 DNA片段。
一般选择打断的 DNA片段大小为 200~300bp,探针长度一般为 80bp左右, 片段长度在 200~300bp会有较高的捕获效率; 另外捕获之后采用 PE测序, 测通 长度也在 200~300bp。
步骤 402: 利用酶对所述回收的 DNA片段进行末端修复, 形成补平的末端 磷酸化的 DNA片段, 回收所述补平的末端磷酸化的 DNA片段;
步骤 403: 利用酶对所述回收的补平的 DNA片段的 3,末端加上" A"碱基, 回收所述 3,末端加上" A"碱基的 DNA片段;
步骤 404: 使所述回收的 3,末端加上" A"碱基的 DNA片段在酶的作用下与 标签接头 Index Adapter连接, 并回收带有标签接头的 DNA片段;
步骤 405: 以标签接头序列的引物做为引物, 对所述带有标签接头的 DNA 片段进行扩增, 回收所述扩增的产物;
步骤 406: 检验所述扩增的产物的质量控制是否合格, 若是, 即为制备的单 样本的 DNA文库。 对单样本的 DNA文库进行定量, 检测是否引入外源杂质等 质量控制指标。
其中, 将捕获的相应捕获探针的多样本中 STS区域的 DNA序列进行测序, 得到测序数据的步骤之后还包括:对所述对多样本中 STS区域的 DNA序列的测 序数据进行质量控制。如图 5所示,对多样本中 STS区域的 DNA序列的测序数 据进行质量控制的步骤包括:
步骤 501 : 对所述多样本中 STS区域的 DNA序列的测序数据中不合格的数 据进行过滤, 得到合格的多样本的测序数据; 其中高通量测序技术可以为 Illumina Hiseq 2000测序技术, 当然也可以采用现有的其他高通量测序技术。
步骤 502: 通过短序列对比软件, 将所述合格的多样本的测序数据与参考基 因组序列进行对比, 并统计每个样本的测序深度的相关参数以及不同样本之间 相同的 STS区域的测序深度的相关参数;
步骤 503: 根据所述统计得到的每个样本的测序深度的相关参数, 过滤掉不 合格的样本的测序数据, 得到合格的样本的测序数据;
步骤 504: 根据所述统计得到的不同样本之间相同的 STS 区域的测序深度 的相关参数, 过滤掉不合格 STS区域的测序数据, 得到合格 STS区域的测序数 据。
其中, 如图 6所示, 在步骤 501中, 对多样本中 STS区域的 DNA序列的测 序数据中不合格的数据进行过滤, 得到合格的多样本的测序数据的步骤包括: 步骤 601 : 通过测序数据中低质量值碱基的比例进行测序质量过滤, 若低质 量值碱基个数超过整条序列碱基个数的预定比例, 则判断为是不合格的数据, 将所述不合格的测序数据过滤掉, 获得初步合格的第一测序数据集合。 不同的 测序设备所用的质量值计算方法不同, 低质量值序列的标准可咨询测序设备提 供公司或参考领域内的一般标准。 在本实施例中所用的 Illumina 公司的 HiSeq2000测序仪所用的质量值的计算公式为 Q = A - 64, 其中 Q是某碱基的测 序质量值, A是 HiSeq2000测序仪输出的 FQ文件中该碱基对应质量字符的 ASCII 码。 质量值低于 5 的碱基在此实施例中被定义为低质量值碱基, 若低质量值碱 基个数超过整条序列碱基个数的 50%, 则判断为是不合格的数据, 将所述不合 格的测序数据过滤掉, 获得初步合格的第一测序数据集合。
步骤 602:若所述初步合格的第一测序数据集合中测序结果不确定的碱基个 数超过整条序列碱基个数的 10%, 则判断为是不合格的数据, 将所述不合格的 测序数据过滤掉 , 获得初步合格的第二测序数据集合;
步骤 603:将所述初步合格的第二测序数据集合中所有测序数据与测序接头 序列库进行比对, 若所述初步合格的第二测序数据集合中存在测序接头序列, 则判断为是不合格的数据, 将所述不合格的测序数据过滤掉, 获得初步合格的 第三测序数据集合;
步骤 604:将所述初步合格的第三测序数据集合中所有测序数据与试验中引 入的所有外源序列比对, 若所述初步合格的第三测序数据集合中存在外源序列, 则判断为是不合格的数据, 将所述不合格的测序数据过滤掉, 获得合格的多样 本的测序数据。 在本发明实施例中, 试验中引入的外源序列为人参考基因组序 列。
在实际应用中, 高通量测序技术可以是 Illumina Hiseq 2000测序技术, 也可 以是现有的其它高通量测序技术。 不同的测序仪器或条件可有不同的不合格序 列的标准, 如 Illumina Hiseq 2000进行测序时可用的某个标准: 测序质量低于某 一阀值的碱基个数超过整条序列碱基个数的 50%则认为是不合格序列, 其中, 低质量阀值由具体测序技术及测序环境而定;序列中测序结果不确定的碱基(如 Illumina Hiseq 2000测序结果中的 N )个数超过整条序列碱基个数的 10%则认为 是不合格序列; 除样本接头序列外, 与其它实验引入的外源序列比对, 如各种 接头序列, 若序列中存在外源序列则认为是不合格序列。
其中, 如图 7所示, 根据统计得到的每个样本的测序深度的相关参数, 过 滤掉不合格的样本的测序数据, 得到合格的样本的测序数据的步骤包括:
步骤 701 : 将所有样本的测序深度值按照从小到大的顺序进行排序, 利用四 数 Q3以及四分位数间 巨 IQR。
四分位数( Quartile ), 即在统计学中, 把所有数值由小到大排列并分成四等 份, 处于三个分割点位置的得分就是四分位数。 第一四分位数 (Q1), 又称 "较 小四分位数", 即下四分位数, 等于该样本中所有数值由小到大排列后第 25%的 数字。 第二四分位数(Q2), 又称 "中位数", 等于该样本中所有数值由小到大排 列后第 50%的数字。 第三四分位数 (Q3), 又称 "较大四分位数", 即上四分位 数, 等于该样本中所有数值由小到大排列后第 75%的数字。 第三四分位数与第 一四分位数的差距又称四分位距(InterQuartile Range, IQR )。 不论 Ql , Q2, Q3 的变异量数数值为何, 均视为一个分界点, 以此将总数分成四个相等部份, 可 以通过 Ql , Q3 比较, 分析其数据变量的趋势。 四分位数在统计学中的箱线图 绘制方面应用也^艮广泛。 所谓箱线图就是由一组数据 5 个特征绘制的一个箱子 和两条线段的图形, 这种直观的箱线图不仅能反映出一组数据的分布特征, 而 且还可以进行多组数据的分析比较。这五个特征值, 即数据的最大值、最小值、 中位数和两个四分位数。
步骤 702: 将所有样本的测序深度值在 Q1减去 1.5倍 IQR和 Q3加上 1.5 倍 IQR范围之外的不合格的样本的测序数据过滤掉, 得到合格的样本的测序数 据。
在本发明一实施例中, 如图 8、 9所示, 横坐标为区域深度分布, 纵坐标为 相同深度区域频数, 图 8是将深度值离群的样本过滤掉, 留下深度值正常的样 本, 如图 9所示, 为深度值正常的样本的示意图。
其中, 如图 10所示, 根据统计得到的不同样本之间相同的 STS区域的测序 深度的相关参数, 过滤掉不合格 STS区域的测序数据, 得到合格 STS区域的测 序数据的步骤包括:
步骤 1001 : 将不同样本之间相同的 STS区域的测序深度值按照从小到大的 顺序进行排序, 利用四分位函数确定所述排序后的不同样本之间相同的 STS区 域的测序深度值的中位数、 上四分位数 Q3以及四分位数间距 IQR;
步骤 1002: 将不同样本之间相同的 STS区域的测序深度值中位数为 0或者 中位数大于 Q3加上 1.5倍 IQR的不合格的 STS区域的测序数据过滤掉,得到合 格的 STS区域的测序数据。
在本发明一实施例中, 如图 11所示, 横坐标为不同样品的相同区域的深度 分布, 纵坐标为相同深度区域频数, 图 11是将深度值离群的区域过滤掉后的示 意图。
其中, 如图 12所示, 采用数理统计方法对所述测序数据进行分析, 根据所 述分析结论, 获得每个样本中染色体 STS区域微缺失的结果的步骤包括:
步骤 1201 : 将样本的 STS区域的测序深度值进行均一化, 得到均一化的深 度值; 其中, 将样本的 STS区域的测序深度值进行均一化的步骤包括: 将所有 一化的深度值。
步骤 1202: 根据得到的样本的 STS区域的均一化的深度值, 采用数理统计 方法, 检测所述样本 STS区域的深度值异常值, 并获得所述样本 STS区域微缺 失的结果。
在一优选实施例中, 如图 13所示, 步骤 1202的具体步骤包括:
步骤 1301 : 根据得到的所有样本的同一区域的均一化的深度值, 计算所述 所有样本的同一区域的均一化深度值的平均值以及方差;
步骤 1302: 根据所述所有样本的同一区域的均一化深度值的平均值以及方 差, 获得所述同一区域所有非离群样本的正态分布曲线;
步骤 1303: 根据所述正态分布曲线, 计算每个样本在每个区域在特定深度 值时的概率值;
步骤 1304: 根据每个样本在相应区域在特定深度值时的概率值, 设置第一 概率值阈值, 若所述样本所在区域在特定深度值时的概率值小于所述概率值第 一概率值阈值, 则获得所述样本区域有微缺失的结果 Rl。
在另一优选实施例中, 如图 14所示, 在步骤 1304之后还包括:
步骤 1401 :对所述样本区域有微缺失的结果 R1进行实验验证,根据实验验 证结果, 设置第二概率值阈值, 其中, 所述第二概率值阈值小于第一概率值阈 值;
步骤 1402: 若所述样本区域在特定深度值时的概率值小于所述第二概率值 阈值, 则获得所述样本区域有微缺失的结果 R2。
在实际应用中, 在大部分样本都无微缺失的情况下, 对某个区域而言, 微 缺失样本的深度值应该为无没有微缺失样本的深度得到的正态分布的异常值, 采取适当的数理统计方法, 将每个区域的深度的异常值检出, 判读深度为异常 值的区域为微缺失区域。
判断区域的深度值是否为异常值的方法为: 对每个深度值离群过滤之后的 区域, 取深度离群深度值离群过滤之后的样本的数据, 根据其平均值和方差, 得到该区域所有非离群样本数据的正态分布曲线。 根据该曲线, 得到该区域深 度为某个特定值的概率值 (p.value ) , 对这些概率值设置一个合适的阈值 ( p.value-cutoff ), 对概率值小于该阈值的深度的样本, 判断该区域有微缺失。
对于概率值阈值的确定, 可采用以下方法: 通过以上方法算出每个样本每 个区域深度在所得到的正态分布中的概率值, 采取比较宽松的阈值, 先判断更 多的样本区域有微缺失, 再对这些样本区域进行实验验证, 验证其是否为微缺 失。 实验验证的方法可以为设计合适的引物, 对该样本区域的 DNA进行 PCR 反应,从 PCR产物情况判断其是否有正常的 PCR扩增产物, 以判断其是否有微 缺失。有这些样本区域是否是微缺失的信息之后,就可选择合适的概率值阈值, 使之具有最好的假阳性假阴性指标。
在又一优选实施例中, 如图 15所示, 根据得到的样本的 STS区域的均一化 的深度值, 采用数理统计方法, 检测所述样本 STS区域的深度值异常值, 并获 得所述样本 STS区域微缺失的结果的步骤包括:
步骤 1501 : 根据得到的样本的 STS区域的均一化的深度值, 计算所述样本 区域均一化的深度值与所有样本的深度值的中位数的比值 D/S;
步骤 1502: 根据得到的样本的 STS区域的均一化的深度值, 计算所述样本 区域均一化的深度值与所有区域的深度值的中位数的比值 D/R;
步骤 1503: 将所述比值 D/S 通过 ID3算法训练出第一比值阈值, 将所述比 值 D/R通过 ID3算法训练出第二比值阈值; 步骤 1504: 若样本区域的比值 D/S大于第一比值阈值, 则获得所述样本区 域没有微缺失的结果;
步骤 1505: 若样本区域的比值 D/S小于第一比值阈值, 并且样本区域的比 值 D/R大于第二比值阈值, 则获得所述样本区域没有微缺失的结果;
步骤 1506: 若样本区域的比值 D/S小于第一比值阈值, 并且样本区域的比 值 D/R小于第二比值阈值, 则获得所述样本区域有微缺失的结果。
在实际应用中, 当有多个样本和区域需检测, 且敫缺失的样本区域所占的 比例较小时, 微缺失区域的深度和所有样本深度的中位数的比例一般 ^艮小, 且 微缺失区域的深度和所有区域深度的中位数的比例一般也 4艮小, 因此采用适当 的参数, 可用决策树的方法进行深度异常值的检验。
决策树检验的具体步骤如下: 算出所要检验的样本区域的深度与所有样本 的深度的中位数的比例, 如该比例大于一定的阈值, 则可判断为无微缺失, 但 如该比例小于该阈值, 则要进一步判断: 算出所要检验的样本区域的深度与所 有区域的深度的中位数的比例, 如该比例大于一定的阈值, 则可判断为无微缺 失, 但如该比例小于该阈值, 则判断为有微缺失。
决策树检验中阈值的确定是通过决策树的迭代二叉树三代(ID3 )算法进行 计算的 ( Mitchell, Tom M. Machine Learning. McGraw-Hill, 1997); ID3算法的核 心是: 在决策树各级结点上选择属性时, 用信息增益(information gain )作为属 性的选择标准, 以使得在每一个非叶结点进行测试时, 能获得关于被测试记录 最大的类别信息。
决策树 ID3 算法的基本思路是: 第一步, 选取一个属性作为决策树的根节 点, 然后就这个属性的所有取值创建树的分支; 第二步, 用这棵树来对训练数 据进行分类, 如果一个叶节点的所有实例属于同一类, 则以该类标记此节点, 如果所有的叶节点都有类标记, 则算法终止; 第三步, 若还有叶节点没有标记, 则选取一个从该节点到根路径中没有出现过的属性标记该结点, 然后就这个属 性所有的取值继续创建树的分支; 重复算法步骤第二步。
在第一步选择不同的属性时会生成不同的决策树, 因此, 选择合适的属性 将会生成一棵筒单的决策树。 在 ID3 算法中, 通常采用一种基于信息的启发式 的方法来决定如何选取属性。 启发式方法选取具有最高信息量的属性, 也就是 生成最少分支决策树的属性。 在此发明中有两个属性需要选择, 样本区域深度 与样本区域深度中位数的比值 Tl ( D/S ), 不同样品的相同区域深度与不同样品 相同区域深度的中位数的比值 T2 ( D/R )。
属性最高信息量增益以信息增益作为度量标准; 信息增益计算方法如下: 设 D为用类别对训练元组进行的划分, 则 D的熵( entropy )表示为:
Figure imgf000023_0001
其中 pi表示第 i个类别在整个训练元组中出现的概率, 可以用属于此类别元素 的数量除以训练元组元素总数量进行估计。 熵的实际意义是表示 D中元组的类 标号所需要的平均信息量。 现在我们假设将训练元组 D按属性 A进行划分,则 A对 D划分的期望信息 为: tnfoA(D) = ^ \ mfo{D .) 而信息增益即为两者的差值: gain{A) = info(D)― infoA(D)
ID3算法就是在每次需要分裂时,计算每个属性的增益率,然后选择增益率最大 的属性进行分裂。
下面我们用某社区中不真实账号检测的例子说明如何使用 ID3算法构造决 策树。 为了简单起见, 我们假设训练集合包含 10个元素, 如表 1 :
Figure imgf000023_0002
21
替换页 (细则第 26条) 1 m no yes
m s no yes s s yes no 表 1某社区中不真实账号检测 其中 s、 m和 1分别表示小、 中和大。 设 F、 H和 R表示日志密度、 好 友密度、 是否使用真实头像和账号是否真实, 下面计算各属性的信息增益。 injo(D) - -Q hg2Q,l― 0,3^^0,3 = 0,7 * 0,51 + 0,3 * 1 ,74 = 0.879
gain(L) = 0 79― 0.003 = 0.276 因此日志密度的信息增益是 0.276。用同样方法得到 H和 F的信息增益分别 为 0.033和 0.553。 因为 F具有最大的信息增益, 所以第一次分裂选择 F为分裂 属性, 分裂后的结果如图 16表示。 而为了解决与 Tl、 Τ2相类似的具有连续性属性的数据, 先对数据进行离散 化, 简单的做法是把属性值分为 Ai小于等于 Ν和 Ai大于 Ν两段。 对于任何的 一个属性, 其所有的取值在一个数据集中是有限的。 假设该属性取值为 (vl、 v2、 ...vn ),则在这个集合中,一共存在 n-1个分段值,然后进一步构建决策树。
其中离散化具体的方法是:
1 )寻找该连续型属性的最小值 MIN, 寻找该连续型属性的最大值 MAX; 2 ) 设置区间 (MIN , MAX ) 中的 N 个等分断点 Ai, 它们分别是 Ai=MIN+((MAX-MIN)/N)*i , 其中 i=l、 2、 3 ;
3 )分别计算把(MIN, Ai )和(Ai, MAX )作为区间值时的增益值, 进行 比较;
4 )选取增益值最大的 Ak作为该连续型属性的断点,把属性值设为(MIN, Ak )和(Ak, MAX ) 两个区间。
在本发明实施例中, 通过决策树确定阈值的具体步骤是:
第一, 首先对选定的几个样本做部分区域的 PCR验证得到微缺失结果; 用 "+"代表缺失阳性, 即有微缺失, "-"代表缺失阴性, 即没有微缺失; 计算熵:
Info(D)=-(P+)log2 (P+)-(P-)log2(P-)„
其中 p+ ( P- )表示在整个训练组中阳 (阴)性缺失出现的概率, 可以用属 于此类别元素的数量除以训练组元素总数量作为估计。 熵的实际意义表示是 D
22
替换页 (细则第 26条) 中元组的类标号所需要的平均信息量。
第二, 计算比值 Tl , T1等于样本区域均一化的深度值与所有样本的深度值 的中位数的比值;对得到的比值 T1从小到大排序,得到 VI、 V2、 V3 Vn; 依次取一对相邻值的平均值作为阈值 al,=(Vl+V2)/2、 a2,=(V2+V3)/2、 ,..、 a ( n-1 ),=(Vn-l+Vn)/2; 得到阔值最小值 al,与阈值最大值 a ( n-1 ),。
第三, 计算比值 T2, Τ2等于样本区域均一化的深度值与所有区域的深度值 的中位数的比值,对得到的比值 Τ2从小到大排序,得到 Ul、 U2、 U3 Un; 依次取一对相邻值的平均值作为阅值 bl,=(Ul+U2)/2、 b2'=(U2+U3)/2 b
( n-1 ),=(Un-l+Un)/2; 得到阈值最小值 M,与阈值最大值 b ( n-1 ),;
第四, 对于循环得到 al,、 a2,、 …, 计算得到不同分割值的熵:
Figure imgf000025_0001
而信息增益即为两者的差值:
g in(A) = info(D) - infoA(D)
第五, 对于循环得到的分割值 bl,、 ' ... , 计算得到不同分割值的熵: mfo D) =
Figure imgf000025_0002
而信息增益即为两者的差值:
gain{A) info(D)― infoA(D)
第六, 比较两组的信息增益量的最大增益, 较大者的属性(a )则为树根; 将增益量最大的阔值阀值对信息进行分类, 如图 17所示;
第七, 然后通过如上步骤, 计算两组数据的深度除以区域中位数深度的最 大增益, 将增益量最大的阈值作为阀值对信息进行分类, 如图 18所示, 即为在 本发明中最终确定的决策树。 在又一优选实施例中, 如图 19所示,根据得到的样本的 STS区域的均一化 的深度值, 采用数理统计方法, 检测所述样本 STS区域的深度值异常值, 并获 得所述样本 STS区域微缺失的结果的步骤包括:
步骤 1901: 才艮据得到的所有样本的同一区域的均一化的深度值, 计算所述 所有样本的同一区域的均一化深度值的平均值以及方差;
23
替换页 (细则第 26条) 步骤 1902: 根据所述所有样本的同一区域的均一化深度值的平均值以及方 差, 获得所述同一区域所有非离群样本的正态分布曲线;
步骤 1903: 根据所述正态分布曲线, 计算每个样本在每个区域在特定深度 值时的概率值;
步骤 1904: 根据每个样本在每个区域在特定深度值时的概率值, 设置第三 概率值阈值, 若所述样本区域在特定深度值时的概率值小于所述第三概率值阈 值, 则获得所述样本区域有 缺失的结果 R3;
步骤 1905:计算所述结果 R3中样本区域均一化的深度值与所有样本的深度 值的中位数的比值 D/S;
步骤 1906:计算所述结果 R3中样本区域均一化的深度值与所有区域的深度 值的中位数的比值 D/R;
步骤 1907: 将所述比值 D/S 通过 ID3算法训练出第三比值阈值, 将所述比 值 D/R通过 ID3算法训练出第四比值阈值;
步骤 1908: 若样本区域的比值 D/S大于第三比值阈值, 则获得所述样本区 域没有微缺失的结果;
步骤 1909: 若样本区域的比值 D/S小于第三比值阈值, 并且样本区域的比 值 D/R大于第四比值阈值, 则获得所述样本区域没有微缺失的结果;
步骤 1910: 若样本区域的比值 D/S小于第三比值阈值, 并且样本区域的比 值 D/R小于第四比值阈值, 则获得所述样本区域有微缺失的结果。
在实际应用中, 为了更精确地检测出微缺失区域, 可以采用了结合以上两 种方法的策略; 先预设一个较为宽松的 p.value-阈值, 然后对得到的结果再进行 决策树的过滤, 得到染色体 STS区域微缺失的结果。
本发明基于染色体序列标签位点 STS区域微缺失检测的方法, 根据 STS区 域的 DNA序列, 设计获得捕获探针, 探针涵盖了整个染色体上 STS区域, 与多 样本的 DNA混合文库进行杂交后, 捕获到的多样本中 STS区域的 DNA序列, 能够大量、 高效、 准确地检测出染色体上已经报道或者未经报道的 STS相关区 域的微缺失, 另外, 本发明的数理统计信息分析可以按照正态分布的方法, 也 可以按照决策树的分析方法, 或者将两种方法结合起来, 最后还通过实验进行 验证, 这种信息分析流程科学、 稳定, 灵敏度高、 假阳性低, 可以有效的针对 微缺失进行分析。
以下实施例用于解释本发明, 而不用于限定本发明。 本实施例中的操作为 本领域的人员可以理解的内容。 本实施例中所用试剂和耗材未注明生产商者, 均为可通过市场购买的通用产品。 本实施例以检测 Y染色体 STS区域 缺失的 情况为例, 但不限于 Y染色体。 本实施例采用 10个不孕不育样本以及 1个健康 人样本合计 11个样本, 一起建库后杂交同一张 Nimblegen ( Roche )芯片, 本实 施例样本数用于说明而不是限定一次实验的样本数。
实施例中所使用试剂如表 2:
Figure imgf000027_0001
Cot-1 DNA 15279-011 Invitrogen
表 2 本发明实施例中所使用的试剂
本发明实施例的实验流程包括:
(一 )基因组 DNA片段化, 就是单样本的 DNA文库的制备的步骤中: 利 用物理或化学的方法将基因组 DNA打断成预定大小的 DNA片段, 回收所述打 断的 DNA片段。
以重量为 3 g 的无蛋白质、 RNA 污染且没有降解的炎黄基因组 DNA ( http:〃 yh.genomics.org.cn/ )为起始材料,使用 Covaris-S2超声打断仪( Covaris,
US )仪器进行打断。 打断参数设置如表 3:
Figure imgf000028_0001
表 3 Covaris-S2超声打断仪参数设置
打断后的片段经电泳检测合格后, 使用 QIAquick PCR Purification Kit回收 纯化, 样本溶于 75 L Elution Buffer中。 此处片段经电泳检测合格主要是指主带 集中在 200bp~300bp之间。
(二) 片段 DNA末端修复, 也就是单样本的 DNA文库的制备的步骤中: 利用酶对所述回收的 DNA片段进行末端修复, 形成补平的末端磷酸化的 DNA 片段, 回收所述补平的末端磷酸化的 DNA片段。
将上一步得到的 DNA按表 4在 1.5ml的离心管中配制末端修复反应体系: 样品 DNA 75μL
10x Polynucleotide Kinase Buffer 10μ
dNTP Solution Set ( lOmM each ) 4μ
T4 DNA Polymerase 5μL
Klenow Fragment l μL
T4 Polynucleotide Kinase 5μL
Total volume ΙΟΟμ 表 4末端修复反应体系
将上述 100 反应混合物轻微混匀后,于 20°C温浴 30 分钟后,用 QIAquick PCR Purification Kit纯化回收, 回收的 DNA最后于 32 μ ddH20中充分溶解。
(三) DNA片段末端加" A", 也就是单样本的 DNA文库的制备的步骤中: 利用酶对所述回收的补平的 DNA片段的 3,末端加上" A"碱基, 回收所述 3,末端 加上" A"碱基的 DNA片段。
末端修复后的 DNA片段按表 5在 1.5ml离心管中配制加 "A"反应体系:
DNA 32μL
10x blue buffer 5μL
dATP(lmM)
Klenow (3'-5' exo-) 3μL
Total volume 50μ 表 5 末端修复后的 DNA片段加 "A"反应体系
将上述 50 L反应混合物轻微混匀后, 于 37°C温浴 30分钟后, 用 QIAquick PCR Purification Kit纯化回收, 回收的 DNA最后于 15 ddH20中充分溶解。
(四 )标签接头 Adapter的连接, 也就是单样本的 DNA文库的制备的步骤 中: 使所述回收的 3,末端加上" A"碱基的 DNA 片段在酶的作用下与标签接头 Index Adapter连接, 并回收带有标签接头的 DNA片段。
在 1.5 ml的离心管中配制 Adapter连接反应体系, 如表 6所示:
末端加 "A"的 DNA 15μL
2x Rapid ligation buffer 25μL
PE/PE index Adapter oligo mix(40 μΜ) 5μL
T4 DNA Ligase (Rapid) 5μL
Total volume 50 μ
表 6 标签接头 Adapter连接反应体系
上述 50 L反应混合物轻微振荡混合均匀, 瞬时离心后置于 20°C温浴 15分 钟, 反应完后用 MiniElute PCR Purification Kit进行纯化回收, 最后将回收的样 品溶于 25μL Elution Buffer„ (五)杂交前 PCR, 也就是单样本的 DNA文库的制备的步骤中: 以标签接 头序列的引物做为引物, 对所述带有标签接头的 DNA片段进行扩增, 回收所述 扩增的产物。
以步骤(四) 中 DNA为模板扩增, 以含有接头序列的引物进行扩增, 扩增 体系如表 7:
含接头序列的 DNA 25 μL
10x pfx amplification buffer 10 μ
MgS04 ( 50mM ) 4 μL
dNTP mix (lOmM) 4 μL
Platinum® Pfx DNA Polymerase 2 μL
PCR Primer PE 1.0/PE Index Primer 1.0 ( ΙΟρΜ ) 10 μ
PCR Primer PE 2.0/PE Index Primer 2.0 ( index,
10 μ
ΙΟρΜ )
ddH20 35 μL
Total volume ΙΟΟμΙ^ 表 7 PCR扩增体系
PCR程序为 94°C 2分钟; 4个循环的 94 °C 15秒, 62 °C 30秒, 72 °C 30秒; 72 °C 5分钟。 PCR产物用 QIAquick PCR Purification Kit纯化,洗脱体积为 30μΙ^。
(六)混合文库构建,也就是所述多样本的 DNA混合文库的制备的步骤中: 将所述多个单样本的 DNA文库按照预定比例混合, 比如可以是相同比例。 在实 际应用中, 可以根据构建文库的需要, 确定合适的比例。
炎黄文库(http://yh.genomics.org.cn ) 与按照步骤(一) 至 (五)构建的其 他 10个文库取等量的 DNA混合在一起。
(七) 目标区域与探针的杂交: 也就是一种基于染色体序列标签位点 STS 区域微缺失检测的方法中:将所述捕获探针与多样本的 DNA混合文库进行杂交, 以捕获多样本中 STS区域的 DNA序列。
1 )在 1.5ml离心管中加入 450 g的 COT-1 DNA、 3 g来自步骤(六)的文 库混合产物、 lnmol Index-adpater 1 -block和 Index-adpater2 -block ( Multiplexing Sample Preparation Oligonucleotide Kit, Illumina ),混合物置于 SpeedVac( Thermo ) 中蒸干, 温度设置为 60°C。
2 )在蒸干的离心管中加入 Ι Ι .ΙμΙ纯水, 充分溶解 DNA后加入 18.5μ 的 2xSC Hybridiation Buffer和 7.3 L的 SC Hybridiation, 充分混匀后将混合物转移 至杂交仪( Nimblegen )上 95 °C干浴器中, 经过 10分钟使 DNA变性。
3 )将样品取出震荡后置于离心机上全速离心 30秒,置于杂交仪 ( Nimblegen ) 上 42 °C离心管放置位置, 准备杂交。
4 ) 杂交方法参照 NimbleGen公司芯片杂交方法 ( NimbleGen Arrays User's Guide, Version 3.1, 7 Jul 2009, Roche NimbleGen, Inc. )。 样品上样量 35μΙ^, 42 °C 杂交 64-72 小时, 用 90(^L 160mM NaOH 洗脱, 洗脱产物用 MinElute PCR Purification Kit纯化, 最终用 80μΙ^ Elution Buffer洗脱。
(八 )捕获后 PCR:
捕获后的文库进行 PCR扩增, 体系为 Phusion Mix 150μΕ, 上下游引物各 4.2μL ( Multiplexing Sequencing Primers and Phix Control Kit ), 上述的 80μL洗脱 样品加 85μΙ^ ddH20, 混合后分 6管进行 PCR, PCR循环数为 16。 PCR反应后 把 6管混合, 并用 QIAquick PCR Purification Kit磁珠纯化回收 300~450bp大小 的片段, 洗脱体积为 50 L。
(九)文库检测:
Bioanalyzer analysis system (Agilent, Santa Clara, USA)检测文库插入片段大 小及含量; Q-PCR精确定量文库的浓度。
(十)测序及数据分析: 也就是一种基于染色体序列标签位点 STS区域微 缺失检测的方法中:将所述捕获的相应捕获探针的多样本中 STS区域的 DNA序 列进行测序, 得到测序数据; 采用数理统计方法对所述测序数据进行分析, 根 据所述分析结论, 获得每个样本中染色体 STS区域微缺失的结果。
步骤(九) 中检测合格后的文库上机测序, 测序方法参照 Illumina公司 HiSeq2000 操作方法 ( HiSeq 2000 User Guide. Catalog # SY-940-1001 Part # 15011190 Rev B , Illumina )。 本发明实施例的信息分析流程如下:
1. 接收高通量测序技术得到的测序数据, 对测序数据进行质控: 也就是所 述对多样本中 STS区域的 DNA序列的测序数据进行质量控制的步骤中:对所述 多样本中 STS区域的 DNA序列的测序数据的序列中不合格的序列进行过滤,得 到合格的多样本的测序数据的序列。
在本发明实施例中, 采用 Illumina Hiseq 2000高通量测序技术。接收到测序 序列后, 对测序序列进行过滤, 去除不合格的序列。 不合格序列包括: 测序质 量值低于 5的碱基个数超过整条序列碱基个数的 50%则认为是不合格序列; 序 列中测序结果中 N的个数超过整条序列碱基个数的 10%则认为是不合格序列; 与测序接头序列库进行比对, 若序列中存在测序接头序列则认为是不合格序歹l。
2. 样本区域深度统计: 也就是所述对多样本中 STS区域的 DNA序列的测 序数据进行质量控制的步骤中: 通过短序列对比软件, 将所述合格的多样本的 测序数据与参考基因组序列进行对比, 并统计每个样本的测序深度的相关参数 以及不同样本之间相同的 STS区域的测序深度的相关参数。
在本实施例中,采用 SOAPaligner比对程序,将高通量测序技术得到的测序 数据比对到人参考基因组序列上, 人参考基因组序列采用 HG19 ( http://genome.ucsc.edu/ )。 比对后, 进行样品区域测序深度进行统计。
3. 对每个样品进行深度统计, 也就是所述对多样本中 STS区域的 DNA序 列的测序数据进行质量控制的步骤中: 根据所述统计得到的每个样本的测序深 度的相关参数,过滤掉不合格的样本的测序数据,得到合格的样本的测序数据。
对深度离群深度值离群的样本进行过滤, 深度处于 Q1减去 1.5倍 IQR和 Q3加上 1.5倍 IQR ( Ql、 Q3为下、 上四分位数, IQR为四分位数间距)范围之 外的深度的样本定义为离群样本, 不用这些样本进行下一步正态分布曲线的构 建。 如图 20所示, 过滤掉深度异常的样品, 在图 20中, 左边的图例是样品测 序深度中位数的柱状统计图, 右边以箱线图来表示样品测序深度中位数结果, 由柱状图与箱线图可以看出样品测序深度中位数主要落在 42~60X。 4. 对每个区域深度进行统计, 也就是所述对多样本中 STS区域的 DNA序 列的测序数据进行质量控制的步骤中: 根据所述统计得到的不同样本之间相同 的 STS区域的测序深度的相关参数, 过滤掉不合格 STS区域的测序数据, 得到 合格 STS区域的测序数据。
把为 0或大于 Q3加上 1.5倍 IQR( Q3为上四分位数, IQR为四分位数间距 ) 的区域当作离群点去掉, 不对其进行敫缺失的检验。 如图 21所示, 左边的图例 是区域测序深度中位数的柱状统计图, 右边以箱线图来表示样品区域测序深度 中位数结果, 由柱状图与箱线图可以看出样品区域测序深度中位数主要落在 35~75X。
5.进行数据质控去掉深度值离群的区域或样本之后,对每个样本的深度值进 行均一化处理, 每个区域深度值除以改样本深度的平均值, 得到标准化之后的 深度值, 然后利用正态分布曲线进行极端值检验。 也就是所述根据得到的样本 的 STS区域的均一化的深度值, 采用数理统计方法, 检测所述样本 STS区域的 深度值异常值, 并获得所述样本 STS区域微缺失的结果的步骤包括: 根据得到 的所有样本的同一区域的均一化的深度值, 计算所述所有样本的同一区域的均 一化深度值的平均值以及方差; 根据所述所有样本的同一区域的均一化深度值 的平均值以及方差, 获得所述同一区域所有非离群样本的正态分布曲线; 根据 所述正态分布曲线, 计算每个样本在每个区域在特定深度值时的概率值; 根据 每个样本在相应区域在特定深度值时的概率值, 设置第一概率值阈值, 若所述 样本所在区域在特定深度值时的概率值小于所述概率值第一概率值阈值, 则获 得所述样本区域有微缺失的结果 R1 ; 对所述样本区域有微缺失的结果 R1进行 实验验证, 根据实验验证结果, 设置第二概率值阈值, 其中, 所述第二概率值 阈值小于第一概率值阈值; 若所述样本区域在特定深度值时的概率值小于所述 第二概率值阈值, 则获得所述样本区域有微缺失的结果 R2。
对每个深度值离群过滤之后的区域, 取深度值离群过滤之后的样本的数据, 根据其平均值和方差, 得到该区域所有非离群样本数据的正态分布曲线。 根据 该曲线, 得到该区域深度为某个特定值的概率值 p.value。 如图 22所示, 图形横 坐标为样品, 纵坐标为 -loglO ( p-value ); 如图 23所示的图形是说明 p-value值 得期望值与观察值的结果。 通过这个方法, 求得测序得到的每个样本每个区域 深度的概率值。对概率值较低的 20个样本区域进行 PCR反应验证,根据验证的 结果对微缺失检验的概率值阈值进行确定, 并对微缺失检验的准确性进行评估, 具体步骤是:
首先选定一些样品的部分区域, 其中包括了缺失阳性及缺失阴性且 p.value 值跨度比较大的区域; 然后对不同区域的两端进行引物设计, 对样品库做 PCR 扩增, 最后做电泳分析得到微缺失的结果; 最终通过统计得到具有统计意义的 P.value的阀值。如表 8所示,即为得到不同的深度对应的真阳性率和假阳性率, 验证得到这个基于高通量测序的微缺失的检验方法的 AUC(area under curve)值 可以达到 0.9968254, 如图 24所示, 横坐标是支阳性比例, 纵坐标是真阳性比 例, 图 24是根据不同的真阳性率和假阳性率要求可选择不同的概率值的阈值。
阈值 真阳性 i阳性
0 1 1
1 1 0.20952381
2 1 0.123809524
3 1 0.076190476
4 1 0.038095238
5 1 0.028571429
6 0.974358974 0.028571429
7 0.897435897 0.019047619
8 0.871794872 0.00952381
9 0.846153846 0.00952381
10 0.846153846 0
11 0.820512821 0
12 0.512820513 0
13 0.512820513 0
14 0.512820513 0
15 0.512820513 0
16 0.512820513 0
17 0.512820513 0
18 0.512820513 0
19 0.512820513 0
20 0.487179487 0
21 0.256410256 0 22 0.256410256 0
23 0 0 表 8 p-value取 -loglO的对数后采用不同阈值得到不同的真、 假阳性
6.另外借助样品区域深度值, 通过决策树(如图 18 ), 判定样品区域缺失阳 性还是阴性。 也就是所述根据得到的样本的 STS区域的均一化的深度值, 采用 数理统计方法, 检测所述样本 STS区域的深度值异常值, 并获得所述样本 STS 区域微缺失的结果的步骤包括:根据得到的样本的 STS区域的均一化的深度值, 计算所述样本区域均一化的深度值与所有样本的深度值的中位数的比值 D/S;根 据得到的样本的 STS区域的均一化的深度值, 计算所述样本区域均一化的深度 值与所有区域的深度值的中位数的比值 D/R; 将所述比值 D/S 通过 ID3算法训 练出第一比值阈值, 将所述比值 D/R通过 ID3算法训练出第二比值阈值; 若样 本区域的比值 D/S大于第一比值阈值,则获得所述样本区域没有 缺失的结果; 若样本区域的比值 D/S小于第一比值阈值, 并且样本区域的比值 D/R大于第二 比值阈值, 则获得所述样本区域没有 缺失的结果; 若样本区域的比值 D/S 小 于第一比值阈值, 并且样本区域的比值 D/R小于第二比值阈值, 则获得所述样 本区域有微缺失的结果。
在本实施范例中, 预先设置了相对宽松 p-value 阈值 10_6, 然后对这部分阳 性结果进行决策树分类, 从而进一步降低假阳性率, 最终得到样品区域缺失位 到的样本的 STS区域的均一化的深度值,采用数理统计方法,检测所述样本 STS 区域的深度值异常值, 并获得所述样本 STS区域微缺失的结果的步骤包括: 根 据得到的所有样本的同一区域的均一化的深度值, 计算所述所有样本的同一区 域的均一化深度值的平均值以及方差; 根据所述所有样本的同一区域的均一化 深度值的平均值以及方差, 获得所述同一区域所有非离群样本的正态分布曲线; 根据所述正态分布曲线, 计算每个样本在每个区域在特定深度值时的概率值; 根据每个样本在每个区域在特定深度值时的概率值, 设置第三概率值阈值, 若 所述样本区域在特定深度值时的概率值小于所述第三概率值阈值, 则获得所述 样本区域有微缺失的结果 R3; 计算所述结果 R3 中样本区域均一化的深度值与 所有样本的深度值的中位数的比值 D/S; 计算所述结果 R3中样本区域均一化的 深度值与所有区域的深度值的中位数的比值 D/R; 将所述比值 D/S 通过 ID3算 法训练出第三比值阈值, 将所述比值 D/R通过 ID3算法训练出第四比值阈值; 若样本区域的比值 D/S 大于第三比值阈值, 则获得所述样本区域没有 缺失的 结果; 若样本区域的比值 D/S小于第三比值阈值, 并且样本区域的比值 D/R大 于第四比值阈值, 则获得所述样本区域没有 缺失的结果; 若样本区域的比值 D/S小于第三比值阈值, 并且样本区域的比值 D/R小于第四比值阈值, 则获得 所述样本区域有微缺失的结果。
本发明还提供一种计算机可读介质, 所述介质承载一系列指令以控制计算 机处理器执行如上所述的方法, 在此也不再赘述。
图 25是本发明基于染色体序列标签位点 STS区域微缺失检测的装置一实施 例的结构示意图。 如图 25 , 所述装置包括: 捕获探针获得模块 2501、 杂交模块 2502、 测序数据获得模块 2503以及微缺失结果获得模块 2504。
捕获探针获得模块 2501用于选取染色体上的 STS区域, 根据所述 STS区 域的 DNA序列, 设计得到相应的捕获探针。
STS ( sequence-tagged site )序列标签位点, 是基因组上定位明确、 作为界 标并能够通过 PCR扩增、 被唯一操作的短的、 单拷贝的 DNA序列, 用于产生 作图位点, 即测定一系列 STS的次序即可作出基因组区域的图谱。 探针是一小 段单链 DNA或者 RNA片段(大约是 20到 500bp ) , 用于检测与其互补的核酸 序列。
杂交模块 2502用于将所述捕获探针与多样本的 DNA混合文库进行杂交, 以捕获多样本中 STS区域的 DNA序列。
DNA文库是指某一特定来源 DNA通过细胞 - DNA克隆技术构建呈含有所 用 DNA片段的重组 DNA分子,并转化至细菌内,构成 DNA文库。此处的 DNA 混合文库是指所有样本混合在一起的所用 DNA片段。 测序数据获得模块 2503用于将所述捕获的相应捕获探针的多样本中 STS区 域的 DNA序列进行测序, 得到测序数据。
微缺失结果获得模块 2504用于采用数理统计方法对所述测序数据进行分析, 根据所述分析结论, 获得每个样本中染色体 STS区域微缺失的结果。
其中, 捕获探针获得模块 2501包括: 区域查找单元, 用于在基因组数据库 (GDB)中查找染色体上的 STS区域的 DNA序列; 序列挑选单元, 用于在所述查 得单元, 用于根据所述挑选到的符合捕获探针设计条件的序列, 设计并合成得 到捕获探针。
所述基因组数据库 (GDB , http://www.gdb.org/)为人类基因组计划 (HGP)保存 和处理基因组图谱的数据库。 GDB是构建关于人类基因组的百科全书, 除了构 建基因组图谱之外, 还开发了描述序列水平的基因组内容的方法, 包括序列变 异和其它对功能和表型的描述。 GDB数据库以对象模型来保存数据, 提供基于 Web 的数据对象检索服务, 用户可以搜索各种类型的对象, 并以图形方式观看 基因组图谱。 例如, 找出 Y染色体上 STS序列在 UCSC数据库中, 在人类基 因组参考序列 Hgl9 ( http:〃 genome .ucsc.edu/ ) 的位置坐标 , 才艮据位置坐标即可 找到 Υ染色体上 STS区域的 DNA序列。 在具体实施例中, 在基因组数据库中 查找染色体上的 STS 区域的 DNA 序列的位置坐标, 将此坐标提交给 Roche-NimbleGen或其他的提供 DNA捕获服务的公司, 由这些公司完成捕获探 针的设计与合成。
其中, 所述装置还包括多样本 DNA混合文库制备模块, 所述多样本 DNA 混合文库制备模块包括: 单样本 DNA文库制备单元, 用于制备多个带有不同接 头的质量控制合格的单样本的 DNA文库; 单样本文库混合单元, 用于将所述多 个单样本的 DNA文库按照预定比例混合; 多样本 DNA混合文库获得单元, 用 于检验所述混合的多样本的 DNA文库的质量是否合格, 若是, 即为制备的多样 本的 DNA混合文库。 其中, 所述装置还包括测序数据质控模块, 所述测序数据质控模块包括: 合格序列获得单元,用于对所述多样本中 STS区域的 DNA序列的测序数据中不 合格的数据进行过滤, 得到合格的多样本的测序数据; 测序深度统计单元, 用 于通过短序列对比软件, 将所述合格的多样本的测序数据与参考基因组序列进 行对比, 并统计每个样本的测序深度的相关参数以及不同样本之间相同的 STS 区域的测序深度的相关参数; 合格样本获得单元, 用于根据所述统计得到的每 个样本的测序深度的相关参数, 过滤掉不合格的样本的测序数据, 得到合格的 样本的测序数据; 合格区域获得单元, 用于根据所述统计得到的不同样本之间 相同的 STS 区域的测序深度的相关参数, 过滤掉不合格 STS 区域的测序数据, 得到合格 STS区域的测序数据。
其中, 所述微缺失结果获得模块包括: 深度值均一化单元, 用于将样本的 STS 区域的测序深度值进行均一化, 得到均一化的深度值; 微缺失结果获得单 元, 用于根据得到的样本的 STS区域的均一化的深度值, 采用数理统计方法, 检测所述样本 STS区域的深度值异常值, 并获得所述样本 STS区域微缺失的结 果。
其中, 所述深度值均一化单元具体用于将所有样本中相同区域的深度值除 以每个样本深度值的平均值, 得到所述样本区域均一化的深度值。
其中, 如图 26所示, 所述微缺失结果获得单元包括:
平均值方差获得单元 2601 , 用于根据得到的所有样本的同一区域的均一化 的深度值, 计算所述所有样本的同一区域的均一化深度值的平均值以及方差; 正态分布曲线获得单元 2602 , 用于根据所述所有样本的同一区域的均一化 深度值的平均值以及方差, 获得所述同一区域所有非离群样本的正态分布曲线; 概率值计算单元 2603 , 用于根据所述正态分布曲线, 计算每个样本在每个 区域在特定深度值时的概率值;
第一判断单元 2604, 用于根据每个样本在相应区域在特定深度值时的概率 值, 设置第一概率值阈值, 若所述样本所在区域在特定深度值时的概率值小于 所述概率值第一概率值阈值, 则获得所述样本区域有微缺失的结果 Rl。 在一优选实施例中, 如图 27所示, 所述 缺失结果获得单元还包括: 概率值阈值确定单元 2701 ,用于对所述样本区域有微缺失的结果 R1进行实 验验证, 根据实验验证结果, 设置第二概率值阈值, 其中, 所述第二概率值阈 值小于第一概率值阈值;
第二判断单元 2702, 用于若所述样本区域在特定深度值时的概率值小于所 述第二概率值阈值, 则获得所述样本区域有微缺失的结果 R2。
如图 28所示, 在又一优选实施例中, 所述 缺失结果获得单元包括: 比值 D/S获得单元 2801 , 用于根据得到的样本的 STS区域的均一化的深度 值,计算所述样本区域均一化的深度值与所有样本的深度值的中位数的比值 D/S; 比值 D/R获得单元 2802,用于根据得到的样本的 STS区域的均一化的深度 值,计算所述样本区域均一化的深度值与所有区域的深度值的中位数的比值 D/R; 第一、 二比值阈值获得单元 2803 , 用于将所述比值 D/S 通过 ID3算法训练 出第一比值阈值, 将所述比值 D/R通过 ID3算法训练出第二比值阈值;
第一判断单元 2804, 用于若样本区域的比值 D/S大于第一比值阈值, 则获 得所述样本区域没有微缺失的结果;
第二判断单元 2805, 用于若样本区域的比值 D/S小于第一比值阈值, 并且 样本区域的比值 D/R大于第二比值阈值, 则获得所述样本区域没有微缺失的结 果;
第三判断单元 2806, 用于若样本区域的比值 D/S小于第一比值阈值, 并且 样本区域的比值 D/R小于第二比值阈值,则获得所述样本区域有微缺失的结果。
如图 29所示, 在又一优选实施例中, 所述 缺失结果获得单元包括: 平均值方差获得单元 2901 , 用于根据得到的所有样本的同一区域的均一化 的深度值, 计算所述所有样本的同一区域的均一化深度值的平均值以及方差; 正态分布曲线获得单元 2902, 用于根据所述所有样本的同一区域的均一化 深度值的平均值以及方差, 获得所述同一区域所有非离群样本的正态分布曲线; 概率值计算单元 2903 , 用于根据所述正态分布曲线, 计算每个样本在每个 区域在特定深度值时的概率值;
第一判断单元 2904, 用于根据每个样本在每个区域在特定深度值时的概率 值, 设置第三概率值阈值, 若所述样本区域在特定深度值时的概率值小于所述 第三概率值阈值, 则获得所述样本区域有微缺失的结果 R3;
比值 D/S获得单元 2905, 用于计算所述结果 R3中样本区域均一化的深度 值与所有样本的深度值的中位数的比值 D/S;
比值 D/R获得单元 2906, 用于计算所述结果 R3中样本区域均一化的深度 值与所有区域的深度值的中位数的比值 D/R;
第三、 四比值阈值获得单元 2907, 用于将所述比值 D/S 通过 ID3算法训练 出第三比值阈值, 将所述比值 D/R通过 ID3算法训练出第四比值阈值;
第二判断单元 2908, 用于若样本区域的比值 D/S大于第三比值阈值, 则获 得所述样本区域没有微缺失的结果;
第三判断单元 2909, 用于若样本区域的比值 D/S小于第三比值阈值, 并且 样本区域的比值 D/R大于第四比值阈值, 则获得所述样本区域没有微缺失的结 果;
第四判断单元 2910, 用于若样本区域的比值 D/S小于第三比值阈值, 并且 样本区域的比值 D/R小于第四比值阈值,则获得所述样本区域有微缺失的结果。
本发明基于染色体序列标签位点 STS区域微缺失检测的装置, 根据 STS区 域的 DNA序列, 设计获得捕获探针, 探针涵盖了整个染色体上 STS区域, 与多 样本的 DNA混合文库进行杂交后, 捕获到的多样本中 STS区域的 DNA序列, 能够大量、 高效、 准确地检测出染色体上已经报道或者未经报道的 STS相关区 域的微缺失, 另外, 本发明的数理统计信息分析可以按照正态分布的方法, 也 可以按照决策树的分析方法, 或者将两种方法结合起来, 最后还通过实验进行 验证, 这种信息分析流程科学、 稳定, 灵敏度高、 假阳性低, 可以有效的针对 微缺失进行分析。 以上所述仅为本发明的实施例, 并非因此限制本发明的专利范围, 凡是利 用本发明说明书及附图内容所作的等效结构或等效流程变换, 或直接或间接运 用在其他相关的技术领域, 均同理包括在本发明的专利保护范围内。

Claims

权利要求
1.一种基于染色体序列标签位点 STS区域微缺失检测的方法,其特征在于: 所述方法包括:
选取染色体上的 STS区域, 根据所述 STS区域的 DNA序列, 设计得到相 应的捕获探针;
将所述捕获探针与多样本的 DNA混合文库进行杂交, 以捕获多样本中 STS 区域的 DNA序列;
将所述捕获的相应捕获探针的多样本中 STS区域的 DNA序列进行测序,得 到测序数据;
采用数理统计方法对所述测序数据进行分析, 根据所述分析结论, 获得每 个样本中染色体 STS区域 缺失的结果。
2.根据权利要求 1所述的方法, 其特征在于: 所述采用数理统计方法对所述 测序数据进行分析, 根据所述分析结论, 获得每个样本中染色体 STS区域微缺 失的结果的步骤包括:
将样本的 STS区域的测序深度值进行均一化, 得到均一化的深度值; 根据得到的样本的 STS区域的均一化的深度值, 采用数理统计方法, 检测 所述样本 STS区域的深度值异常值, 并获得所述样本 STS区域微缺失的结果。
3.根据权利要求 2所述的方法, 其特征在于: 所述将样本的 STS 区域的测 序深度值进行均一化的步骤包括: 将所有样本中相同区域的深度值除以每个样 本深度值的平均值, 得到所述样本区域均一化的深度值。
4.根据权利要求 2 所述的方法, 其特征在于: 所述根据得到的样本的 STS 区域的均一化的深度值, 采用数理统计方法, 检测所述样本 STS区域的深度值 异常值, 并获得所述样本 STS区域微缺失的结果的步骤包括:
根据得到的所有样本的同一区域的均一化的深度值, 计算所述所有样本的 同一区域的均一化深度值的平均值以及方差; 根据所述所有样本的同一区域的均一化深度值的平均值以及方差, 获得所 述同一区域所有非离群样本的正态分布曲线;
根据所述正态分布曲线, 计算每个样本在每个区域在特定深度值时的概率 值;
根据每个样本在相应区域在特定深度值时的概率值, 设置第一概率值阈值, 若所述样本所在区域在特定深度值时的概率值小于所述概率值第一概率值阈值, 则获得所述样本区域有微缺失的结果 Rl。
5.根据权利要求 4所述的方法, 其特征在于: 所述根据每个样本在相应区域 在特定深度值时的概率值, 设置第一概率值阈值, 若所述样本所在区域在特定 深度值时的概率值小于所述概率值第一概率值阈值, 则获得所述样本区域有微 缺失的结果 R1的步骤之后还包括:
对所述样本区域有微缺失的结果 R1进行实验验证, 根据实验验证结果, 设 置第二概率值阈值, 其中, 所述第二概率值阈值小于第一概率值阈值;
若所述样本区域在特定深度值时的概率值小于所述第二概率值阈值, 则获 得所述样本区域有微缺失的结果 R2。
6.根据权利要求 2 所述的方法, 其特征在于: 所述根据得到的样本的 STS 区域的均一化的深度值, 采用数理统计方法, 检测所述样本 STS区域的深度值 异常值, 并获得所述样本 STS区域微缺失的结果的步骤包括:
根据得到的样本的 STS区域的均一化的深度值, 计算所述样本区域均一化 的深度值与所有样本的深度值的中位数的比值 D/S;
根据得到的样本的 STS区域的均一化的深度值, 计算所述样本区域均一化 的深度值与所有区域的深度值的中位数的比值 D/R;
将所述比值 D/S 通过 ID3算法训练出第一比值阈值, 将所述比值 D/R通过 ID3算法训练出第二比值阈值;
若样本区域的比值 D/S 大于第一比值阈值, 则获得所述样本区域没有 缺 失的结果; 若样本区域的比值 D/S小于第一比值阈值, 并且样本区域的比值 D/R大于 第二比值阈值, 则获得所述样本区域没有微缺失的结果;
若样本区域的比值 D/S小于第一比值阈值, 并且样本区域的比值 D/R小于 第二比值阈值, 则获得所述样本区域有微缺失的结果。
7.根据权利要求 2 所述的方法, 其特征在于: 所述根据得到的样本的 STS 区域的均一化的深度值, 采用数理统计方法, 检测所述样本 STS区域的深度值 异常值, 并获得所述样本 STS区域微缺失的结果的步骤包括:
根据得到的所有样本的同一区域的均一化的深度值, 计算所述所有样本的 同一区域的均一化深度值的平均值以及方差;
根据所述所有样本的同一区域的均一化深度值的平均值以及方差, 获得所 述同一区域所有非离群样本的正态分布曲线;
根据所述正态分布曲线, 计算每个样本在每个区域在特定深度值时的概率 值;
根据每个样本在每个区域在特定深度值时的概率值, 设置第三概率值阈值, 若所述样本区域在特定深度值时的概率值小于所述第三概率值阈值, 则获得所 述样本区域有微缺失的结果 R3;
计算所述结果 R3中样本区域均一化的深度值与所有样本的深度值的中位数 的比值 D/S;
计算所述结果 R3中样本区域均一化的深度值与所有区域的深度值的中位数 的比值 D/R;
将所述比值 D/S 通过 ID3算法训练出第三比值阈值, 将所述比值 D/R通过 ID3算法训练出第四比值阈值;
若样本区域的比值 D/S 大于第三比值阈值, 则获得所述样本区域没有微缺 失的结果;
若样本区域的比值 D/S小于第三比值阈值, 并且样本区域的比值 D/R大于 第四比值阈值, 则获得所述样本区域没有微缺失的结果; 若样本区域的比值 D/S小于第三比值阈值, 并且样本区域的比值 D/R小于 第四比值阈值, 则获得所述样本区域有微缺失的结果。
8.根据权利要求 1所述的方法, 其特征在于: 所述选取染色体上的 STS 区 域, 根据所述 STS区域的 DNA序列, 设计并合成相应的捕获探针的步骤包括: 在基因组数据库中查找染色体上的 STS区域的 DNA序列; 根据所述挑选到的符合捕获探针设计条件的序列, 设计并合成得到捕获探 针。
9.根据权利要求 1所述的方法, 其特征在于: 所述多样本的 DNA混合文库 的制备的步骤包括:
制备多个带有不同接头的质量控制合格的单样本的 DNA文库;
将所述多个单样本的 DNA文库按照预定比例混合;
检验所述混合的多样本的 DNA文库的质量是否合格, 若是, 即为制备的多 样本的 DNA混合文库。
10.根据权利要求 8所述的方法,其特征在于: 所述单样本的 DNA文库的制 备的步骤包括:
利用物理或化学的方法将基因组 DNA打断成预定大小的 DNA片段, 回收 所述打断的 DNA片段;
利用酶对所述回收的 DNA 片段进行末端修复, 形成补平的末端磷酸化的 DNA片段, 回收所述补平的末端磷酸化的 DNA片段;
利用酶对所述回收的补平的 DNA片段的 3,末端加上" A"碱基, 回收所述 3, 末端加上" A"碱基的 DNA片段;
使所述回收的 3,末端加上" A"碱基的 DNA 片段在酶的作用下与标签接头 Index Adapter连接, 并回收带有标签接头的 DNA片段;
以标签接头序列的引物做为引物,对所述带有标签接头的 DNA片段进行扩 增, 回收所述扩增的产物; 检验所述扩增的产物的质量控制是否合格,若是,即为制备的单样本的 DNA 文库。
11.根据权利要求 1所述的方法, 其特征在于: 所述将捕获的相应捕获探针 的多样本中 STS区域的 DNA序列进行测序, 得到测序数据的步骤之后还包括: 对所述对多样本中 STS区域的 DNA序列的测序数据进行质量控制。
12.根据权利要求 10所述的方法, 其特征在于: 所述对多样本中 STS 区域 的 DNA序列的测序数据进行质量控制的步骤包括:
对所述多样本中 STS区域的 DNA序列的测序数据中不合格的数据进行过滤, 得到合格的多样本的测序数据;
通过短序列对比软件, 将所述合格的多样本的测序数据与参考基因组序列 进行对比,并统计每个样本的测序深度的相关参数以及不同样本之间相同的 STS 区域的测序深度的相关参数;
根据所述统计得到的每个样本的测序深度的相关参数, 过滤掉不合格的样 本的测序数据, 得到合格的样本的测序数据;
根据所述统计得到的不同样本之间相同的 STS区域的测序深度的相关参数, 过滤掉不合格 STS区域的测序数据, 得到合格 STS区域的测序数据。
13.根据权利要求 11所述的方法, 其特征在于: 所述对多样本中 STS 区域 的 DNA序列的测序数据中不合格的数据进行过滤,得到合格的多样本的测序数 据的步骤包括:
通过测序数据中低质量值碱基的比例进行测序质量过滤, 若低质量值碱基 个数超过整条序列碱基个数的预定比例, 则判断为是不合格的数据, 将所述不 合格的测序数据过滤掉, 获得初步合格的第一测序数据集合;
若所述初步合格的第一测序数据集合中测序结果不确定的碱基个数超过整 条序列碱基个数的 10%, 则判断为是不合格的数据, 将所述不合格的测序数据 过滤掉, 获得初步合格的第二测序数据集合;
将所述初步合格的第二测序数据集合中所有测序数据与测序接头序列库进 行比对, 若所述初步合格的第二测序数据集合中存在测序接头序列, 则判断为 是不合格的数据, 将所述不合格的测序数据过滤掉, 获得初步合格的第三测序 数据集合;
将所述初步合格的第三测序数据集合中所有测序数据与试验中引入的所有 外源序列比对, 若所述初步合格的第三测序数据集合中存在外源序列, 则判断 为是不合格的数据, 将所述不合格的测序数据过滤掉, 获得合格的多样本的测 序数据。
14.根据权利要求 11所述的方法, 其特征在于: 所述根据统计得到的每个样 本的测序深度的相关参数, 过滤掉不合格的样本的测序数据, 得到合格的样本 的测序数据的步骤包括:
将所有样本的测序深度值按照从小到大的顺序进行排序, 利用四分位函数 及四分位数间距 IQR;
将所有样本的测序深度值在 Q1减去 1.5倍 IQR和 Q3加上 1.5倍 IQR范围 之外的不合格的样本的测序数据过滤掉, 得到合格的样本的测序数据。
15.根据权利要求 11所述的方法, 其特征在于: 所述根据统计得到的不同样 本之间相同的 STS区域的测序深度的相关参数, 过滤掉不合格 STS区域的测序 数据, 得到合格 STS区域的测序数据的步骤包括:
将不同样本之间相同的 STS区域的测序深度值按照从小到大的顺序进行排 序, 利用四分位函数确定所述排序后的不同样本之间相同的 STS区域的测序深 度值的中位数、 上四分位数 Q3以及四分位数间距 IQR;
将不同样本之间相同的 STS区域的测序深度值中位数为 0或者中位数大于 Q3加上 1.5倍 IQR的不合格的 STS区域的测序数据过滤掉, 得到合格的 STS 区域的测序数据。
16.—种基于染色体序列标签位点 STS区域微缺失检测的装置,其特征在于: 所述装置包括: 捕获探针获得模块, 用于选取染色体上的 STS区域, 根据所述 STS区域的 DNA序列, 设计得到相应的捕获探针;
杂交模块, 用于将所述捕获探针与多样本的 DNA混合文库进行杂交, 以捕 获多样本中 STS区域的 DNA序列;
测序数据获得模块, 用于将所述捕获的相应捕获探针的多样本中 STS区域 的 DNA序列进行测序, 得到测序数据;
微缺失结果获得模块, 用于采用数理统计方法对所述测序数据进行分析, 根据所述分析结论, 获得每个样本中染色体 STS区域微缺失的结果。
17.根据权利要求 15所述的装置, 其特征在于: 所述微缺失结果获得模块包 括:
深度值均一化单元, 用于将样本的 STS区域的测序深度值进行均一化, 得 到均一化的深度值;
微缺失结果获得单元,用于根据得到的样本的 STS区域的均一化的深度值, 采用数理统计方法, 检测所述样本 STS区域的深度值异常值, 并获得所述样本 STS区域微缺失的结果。
18.根据权利要求 16所述的装置, 其特征在于: 所述深度值均一化单元具体 样本区域均一化的深度值。
19.根据权利要求 16所述的装置, 其特征在于: 所述微缺失结果获得单元包 括:
平均值方差获得单元, 用于根据得到的所有样本的同一区域的均一化的深 度值, 计算所述所有样本的同一区域的均一化深度值的平均值以及方差;
正态分布曲线获得单元, 用于根据所述所有样本的同一区域的均一化深度 值的平均值以及方差, 获得所述同一区域所有非离群样本的正态分布曲线; 概率值计算单元, 用于根据所述正态分布曲线, 计算每个样本在每个区域 在特定深度值时的概率值; 第一判断单元, 用于根据每个样本在相应区域在特定深度值时的概率值, 设置第一概率值阈值, 若所述样本所在区域在特定深度值时的概率值小于所述 概率值第一概率值阈值, 则获得所述样本区域有微缺失的结果 Rl。
20.根据权利要求 19所述的装置, 其特征在于: 所述微缺失结果获得单元还 包括: 证, 根据实验验证结果, 设置第二概率值阈值, 其中, 所述第二概率值阈值小 于第一概率值阈值;
第二判断单元, 用于若所述样本区域在特定深度值时的概率值小于所述第 二概率值阈值, 则获得所述样本区域有微缺失的结果 R2。
21.根据权利要求 16所述的装置, 其特征在于: 所述微缺失结果获得单元包 括:
比值 D/S获得单元, 用于根据得到的样本的 STS区域的均一化的深度值, 计算所述样本区域均一化的深度值与所有样本的深度值的中位数的比值 D/S; 比值 D/R获得单元, 用于根据得到的样本的 STS区域的均一化的深度值, 计算所述样本区域均一化的深度值与所有区域的深度值的中位数的比值 D/R;
第一、 二比值阈值获得单元, 用于将所述比值 D/S 通过 ID3算法训练出第 一比值阈值, 将所述比值 D/R通过 ID3算法训练出第二比值阈值;
第一判断单元, 用于若样本区域的比值 D/S 大于第一比值阈值, 则获得所 述样本区域没有微缺失的结果;
第二判断单元, 用于若样本区域的比值 D/S 小于第一比值阈值, 并且样本 区域的比值 D/R大于第二比值阈值, 则获得所述样本区域没有 缺失的结果; 第三判断单元, 用于若样本区域的比值 D/S 小于第一比值阈值, 并且样本 区域的比值 D/R小于第二比值阈值, 则获得所述样本区域有 缺失的结果。
22.根据权利要求 16所述的装置, 其特征在于: 所述微缺失结果获得单元包 括: 平均值方差获得单元, 用于根据得到的所有样本的同一区域的均一化的深 度值, 计算所述所有样本的同一区域的均一化深度值的平均值以及方差;
正态分布曲线获得单元, 用于根据所述所有样本的同一区域的均一化深度 值的平均值以及方差, 获得所述同一区域所有非离群样本的正态分布曲线; 概率值计算单元, 用于根据所述正态分布曲线, 计算每个样本在每个区域 在特定深度值时的概率值; 设置第三概率值阈值, 若所述样本区域在特定深度值时的概率值小于所述第三 概率值阈值, 则获得所述样本区域有微缺失的结果 R3;
比值 D/S获得单元,用于计算所述结果 R3中样本区域均一化的深度值与所 有样本的深度值的中位数的比值 D/S;
比值 D/R获得单元,用于计算所述结果 R3中样本区域均一化的深度值与所 有区域的深度值的中位数的比值 D/R;
第三、 四比值阈值获得单元, 用于将所述比值 D/S 通过 ID3算法训练出第 三比值阈值, 将所述比值 D/R通过 ID3算法训练出第四比值阈值;
第二判断单元, 用于若样本区域的比值 D/S 大于第三比值阈值, 则获得所 述样本区域没有微缺失的结果;
第三判断单元, 用于若样本区域的比值 D/S 小于第三比值阈值, 并且样本 区域的比值 D/R大于第四比值阈值, 则获得所述样本区域没有微缺失的结果; 第四判断单元, 用于若样本区域的比值 D/S 小于第三比值阈值, 并且样本 区域的比值 D/R小于第四比值阈值, 则获得所述样本区域有微缺失的结果。
23.根据权利要求 15所述的装置,其特征在于:所述捕获探针获得模块包括: 区域查找单元,用于在基因组数据库中查找染色体上的 STS区域的 DNA序 列;
序列挑选单元,用于在所述查找到的 STS区域的 DNA序列中挑选符合捕获 探针设计条件的序列; 捕获探针获得单元, 用于根据所述挑选到的符合捕获探针设计条件的序列, 设计并合成得到捕获探针。
24.根据权利要求 15所述的装置,其特征在于:所述装置还包括多样本 DNA 混合文库制备模块, 所述多样本 DNA混合文库制备模块包括:
单样本 DNA文库制备单元,用于制备多个带有不同接头的质量控制合格的 单样本的 DNA文库;
单样本文库混合单元,用于将所述多个单样本的 DNA文库按照预定比例混 合;
多样本 DNA混合文库获得单元, 用于检验所述混合的多样本的 DNA文库 的质量是否合格, 若是, 即为制备的多样本的 DNA混合文库。
25.根据权利要求 15所述的装置, 其特征在于: 所述装置还包括测序数据质 控模块, 所述测序数据质控模块包括:
合格序列获得单元,用于对所述多样本中 STS区域的 DNA序列的测序数据 中不合格的数据进行过滤, 得到合格的多样本的测序数据的序列;
测序深度统计单元, 用于通过短序列对比软件, 将所述合格的多样本的测 序数据与参考基因组序列进行对比, 并统计每个样本的测序深度的相关参数以 及不同样本之间相同的 STS区域的测序深度的相关参数;
合格样本获得单元, 用于根据所述统计得到的每个样本的测序深度的相关 参数, 过滤掉不合格的样本的测序数据, 得到合格的样本的测序数据;
合格区域获得单元, 用于根据所述统计得到的不同样本之间相同的 STS区 域的测序深度的相关参数, 过滤掉不合格 STS区域的测序数据, 得到合格 STS 区域的测序数据。
26.—种计算机可读介质, 其特征在于, 所述介质承载一系列指令以控制计 算机处理器执行如权利要求 1至 15中任一项所述的方法。
PCT/CN2012/071648 2012-02-27 2012-02-27 一种检测染色体sts区域微缺失的方法及其装置 WO2013127049A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
RU2014138794A RU2610691C2 (ru) 2012-02-27 2012-02-27 Способ обнаружения микроделеций в области хромосомы с днк-маркирующим участком
ES12870229T ES2701775T3 (es) 2012-02-27 2012-02-27 Procedimiento y dispositivo para detectar microdeleción en el área del cromosoma STS
EP12870229.7A EP2821501B1 (en) 2012-02-27 2012-02-27 Method and device for detecting microdeletion in chromosome sts area
PCT/CN2012/071648 WO2013127049A1 (zh) 2012-02-27 2012-02-27 一种检测染色体sts区域微缺失的方法及其装置
CN201280070387.7A CN104145028B (zh) 2012-02-27 2012-02-27 一种检测染色体sts区域微缺失的方法及其装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/071648 WO2013127049A1 (zh) 2012-02-27 2012-02-27 一种检测染色体sts区域微缺失的方法及其装置

Publications (1)

Publication Number Publication Date
WO2013127049A1 true WO2013127049A1 (zh) 2013-09-06

Family

ID=49081520

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/071648 WO2013127049A1 (zh) 2012-02-27 2012-02-27 一种检测染色体sts区域微缺失的方法及其装置

Country Status (5)

Country Link
EP (1) EP2821501B1 (zh)
CN (1) CN104145028B (zh)
ES (1) ES2701775T3 (zh)
RU (1) RU2610691C2 (zh)
WO (1) WO2013127049A1 (zh)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105779435A (zh) * 2014-12-15 2016-07-20 天津华大基因科技有限公司 试剂盒及其用途
CN105779434A (zh) * 2014-12-15 2016-07-20 天津华大基因科技有限公司 试剂盒及其用途
CN105779432A (zh) * 2014-12-15 2016-07-20 天津华大基因科技有限公司 试剂盒及其用途
CN105779433A (zh) * 2014-12-15 2016-07-20 天津华大基因科技有限公司 试剂盒及其用途
CN105925666A (zh) * 2016-03-30 2016-09-07 广州精科生物技术有限公司 试剂盒、试剂盒的用途及检测目标区域变异的方法及系统
CN105925663A (zh) * 2016-03-30 2016-09-07 广州精科生物技术有限公司 试剂盒、试剂盒的用途及检测目标区域变异的方法及系统
CN105986032A (zh) * 2016-03-30 2016-10-05 广州精科生物技术有限公司 试剂盒、建库方法以及检测目标区域变异的方法及系统
CN106554993A (zh) * 2015-09-30 2017-04-05 广州华大基因医学检验所有限公司 试剂盒及其用途
CN106755454A (zh) * 2017-01-06 2017-05-31 杭州杰毅麦特医疗器械有限公司 一种分子标签核酸检测方法
CN106916881A (zh) * 2015-12-28 2017-07-04 广州华大基因医学检验所有限公司 试剂盒及其用途
CN109402241A (zh) * 2017-08-07 2019-03-01 深圳华大基因研究院 鉴定和分析古dna样本的方法
CN117051129A (zh) * 2023-10-10 2023-11-14 瑞因迈拓科技(广州)有限公司 一种微生物检测背景菌阈值设定方法及其应用

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063959A (zh) * 2018-06-22 2018-12-21 深圳弘睿康生物科技有限公司 一种样本质量控制分析方法和系统

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090112235A (ko) * 2008-04-24 2009-10-28 박민구 인간 y 염색체 미세결실 분석용 칩 및 이를 통한 특발성불임 스크리닝 검사

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100210469A1 (en) * 2006-09-08 2010-08-19 Macrogen Inc. Microarray chip and method for detection of chromosomal abnormality
US8003326B2 (en) * 2008-01-02 2011-08-23 Children's Medical Center Corporation Method for diagnosing autism spectrum disorder
US20090176226A1 (en) * 2008-01-02 2009-07-09 Children's Medical Center Corporation Method for diagnosing autism spectrum disorder
RU2402771C2 (ru) * 2008-10-10 2010-10-27 Татьяна Павловна Шкурат Способ скрининга сердечно-сосудистых заболеваний и биочип для осуществления этого способа

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090112235A (ko) * 2008-04-24 2009-10-28 박민구 인간 y 염색체 미세결실 분석용 칩 및 이를 통한 특발성불임 스크리닝 검사

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"NimbleGen Arrays User's Guide", 7 July 2009, ROCHE NIMBLEGEN, INC.
J. SAMBROOK ET AL.: "Molecular Cloning: A Laboratory Manual", SCIENCE PRESS
META-GENOMICS DNA SEQUENCING, 20 December 2009 (2009-12-20), XP008174376, Retrieved from the Internet <URL:http://biolab.blogus.com/logs54461787.html> *
MITCHELL, TOM M.: "Machine Learning.", 1997, MCGRAW-HILL
See also references of EP2821501A4 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105779434A (zh) * 2014-12-15 2016-07-20 天津华大基因科技有限公司 试剂盒及其用途
CN105779432A (zh) * 2014-12-15 2016-07-20 天津华大基因科技有限公司 试剂盒及其用途
CN105779433A (zh) * 2014-12-15 2016-07-20 天津华大基因科技有限公司 试剂盒及其用途
CN105779435A (zh) * 2014-12-15 2016-07-20 天津华大基因科技有限公司 试剂盒及其用途
CN106554993A (zh) * 2015-09-30 2017-04-05 广州华大基因医学检验所有限公司 试剂盒及其用途
CN106916881A (zh) * 2015-12-28 2017-07-04 广州华大基因医学检验所有限公司 试剂盒及其用途
CN105986032A (zh) * 2016-03-30 2016-10-05 广州精科生物技术有限公司 试剂盒、建库方法以及检测目标区域变异的方法及系统
CN105925663A (zh) * 2016-03-30 2016-09-07 广州精科生物技术有限公司 试剂盒、试剂盒的用途及检测目标区域变异的方法及系统
CN105925666A (zh) * 2016-03-30 2016-09-07 广州精科生物技术有限公司 试剂盒、试剂盒的用途及检测目标区域变异的方法及系统
CN106755454A (zh) * 2017-01-06 2017-05-31 杭州杰毅麦特医疗器械有限公司 一种分子标签核酸检测方法
CN109402241A (zh) * 2017-08-07 2019-03-01 深圳华大基因研究院 鉴定和分析古dna样本的方法
CN117051129A (zh) * 2023-10-10 2023-11-14 瑞因迈拓科技(广州)有限公司 一种微生物检测背景菌阈值设定方法及其应用
CN117051129B (zh) * 2023-10-10 2024-03-22 瑞因迈拓科技(广州)有限公司 一种微生物检测背景菌阈值设定方法及其应用

Also Published As

Publication number Publication date
EP2821501A4 (en) 2015-11-11
RU2014138794A (ru) 2016-04-20
EP2821501A1 (en) 2015-01-07
EP2821501B1 (en) 2018-09-19
CN104145028B (zh) 2016-10-12
ES2701775T3 (es) 2019-02-25
CN104145028A (zh) 2014-11-12
RU2610691C2 (ru) 2017-02-14

Similar Documents

Publication Publication Date Title
WO2013127049A1 (zh) 一种检测染色体sts区域微缺失的方法及其装置
US10538806B2 (en) High throughput screening of populations carrying naturally occurring mutations
JP5389638B2 (ja) 制限断片に基づく分子マーカーのハイスループットな検出
US20130331277A1 (en) Paired end random sequence based genotyping
JP2019523638A (ja) 遺伝子突然変異を検出するマルチポジショニングダブルタグアダプターセット、及びその調製方法と応用
CN104894271B (zh) 一种检测基因融合的方法及装置
WO2012042374A2 (en) Method of determining number or concentration of molecules
CN107002120B (zh) 测序方法
CN117095746A (zh) 一种用于水牛的gbs全基因组关联分析方法
US20230416812A1 (en) Method capable of making one cluster by connecting information of strands generated during pcr process and tracking generation order of generated strands
CN111667883B (zh) 一种基于复合微单倍型焦磷酸测序图谱解析的法医学混合dna的分析方法
KR100971153B1 (ko) PCR과 제한효소를 이용한 SNPs분석 방법
Chen et al. Supplemental materials Comparative Analysis of the D. melanogaster modENCODE Transcriptome Annotation
WO2011071382A1 (en) Polymorfphic whole genome profiling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12870229

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2012870229

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014138794

Country of ref document: RU