WO2019051812A1 - 确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质 - Google Patents

确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质 Download PDF

Info

Publication number
WO2019051812A1
WO2019051812A1 PCT/CN2017/101965 CN2017101965W WO2019051812A1 WO 2019051812 A1 WO2019051812 A1 WO 2019051812A1 CN 2017101965 W CN2017101965 W CN 2017101965W WO 2019051812 A1 WO2019051812 A1 WO 2019051812A1
Authority
WO
WIPO (PCT)
Prior art keywords
chromosome
determining
value
window
sequencing
Prior art date
Application number
PCT/CN2017/101965
Other languages
English (en)
French (fr)
Inventor
陈大洋
史千玉
刘萍
朱珠
邱咏
谢林
夏军
潘健昌
陈芳
蒋慧
徐讯
牟峰
Original Assignee
深圳华大智造科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大智造科技有限公司 filed Critical 深圳华大智造科技有限公司
Priority to CN201780094527.7A priority Critical patent/CN111052249B/zh
Priority to PCT/CN2017/101965 priority patent/WO2019051812A1/zh
Publication of WO2019051812A1 publication Critical patent/WO2019051812A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the present invention relates to a method of determining a predetermined chromosomal conserved region, a method of determining whether a copy number variation is present in a sample genome, and a system and computer readable medium suitable for performing the method.
  • Aneuploidy that is, an abnormal number of chromosomes, means that the number of chromosomes in a cell is not an integer multiple of the number of chromosomes in the normal sperm or egg of the species.
  • chromosomal diseases There are more than 200 kinds of chromosomal diseases known to humans. Most chromosomal diseases are caused by abnormal chromosome numbers, which is an important cause of infertility and congenital birth defects. Such as the common 21-trisomy syndrome (Down Syndrome), 18-trisomy syndrome (Edward Syndrome), 13-trisomy syndrome (Patau Syndrome). According to data from the Southern California Reproductive Medicine Center, amniocentesis is used to detect maternal ages of 35 years.
  • Single-cell sequencing technology has been widely used in preimplantation detection, tumor cell research and cell development.
  • Single-cell detection of chromosomal aneuploidy has a wide range of uses in clinical and scientific research.
  • the method applied to single cell aneuploidy detection mainly depends on the test value of the sample to be compared with a large number of normal samples to judge the aneuploidy of the chromosome.
  • the disadvantage of this method lies in the need for test samples with different experimental conditions. Prepare a large number of normal samples to delineate the normal range of values. Samples that are not repeatable or rare in experimental conditions cannot be effectively detected, such as aneuploidy detection of fetal nucleated red blood cells.
  • the present invention aims to solve at least one of the technical problems existing in the prior art.
  • Whole genome amplification can efficiently and accurately amplify the nanogram-scale starting sample genomic DNA to obtain microgram-level DNA to meet sequencing requirements.
  • the whole genome amplification method has a large amplification preference, which will eventually lead to the poor homogeneity of the measured sequence in the genome, and affect the aneuploidy detection.
  • the single cell heterogeneity is high, and a unique error is generated during the amplification process, and an error signal is generated when the normal sample is used as a reference for the sample to be inspected.
  • the inventors have proposed a method for determining the presence or absence of copy number variation in a sample genome, which is based on the relationship between the GC content and the number of sequences for the problem of amplification bias caused by GC content. Correction achieves the effect of reducing GC bias; and the method is effective in determining single cell chromosome aneuploidy without the need for a normal sample.
  • the invention proposes a method of determining a predetermined chromosomal conserved region.
  • the method comprises the steps of: (1) sequencing whole genome samples from single cells to obtain sequencing results consisting of multiple sequencing sequences; (2) sequencing the results and references Aligning the genomic sequences to determine the distribution of the sequencing sequence on the reference genomic sequence; (3) determining an abnormal region for the predetermined chromosome based on the distribution of the sequencing sequence on the reference genomic sequence; (4) And selecting at least a part of the region other than the abnormal region as the conserved region of the predetermined chromosome for the predetermined chromosome.
  • abnormal region refers to a region in which non-copy number variation exists, such as a region in which an inversion, a translocation, or a chromosome structural variation exists.
  • the pre-determination can be effectively determined A conserved region in the chromosome, and based on the determined conserved region, effectively determines the copy number variation of the single cell chromosome, especially the aneuploidy variation, without the need for a normal sample.
  • the invention proposes a method of determining the presence or absence of a chromosome copy number variation in a sample genome.
  • the method comprises: (a) determining a conserved region of a predetermined chromosome using the method described above; (b) determining the predetermined chromosome based on a sequencing depth of the window in the conserved region An eigenvalue; (c) determining, based on the eigenvalue obtained in the step (b), whether the predetermined chromosome has a copy number variation for the sample genome.
  • the method for determining whether a chromosome copy number variation exists in a sample genome can effectively determine a copy number variation of a single cell chromosome, particularly an aneuploidy variation, without requiring a normal sample; and can be at 28 bp In the case of low-depth whole-genome sequencing, chromosome copy number variation, especially the detection of aneuploidy variation, is achieved.
  • the invention proposes a device for determining a predetermined chromosomal conserved region.
  • the apparatus comprises: a sequencing unit for sequencing a whole genome sample from a single cell to obtain a sequencing result composed of a plurality of sequencing sequences; a comparison unit, The aligning unit is configured to compare the sequencing result with a reference genomic sequence to determine a distribution of the sequencing sequence on the reference genomic sequence; an abnormal region determining unit, the abnormal region determining unit is configured to be based on a distribution of the sequencing sequence on the reference genomic sequence, determining an abnormal region for a predetermined chromosome; a conservative region determining unit for selecting at least a portion other than the abnormal region for the predetermined chromosome The region serves as a conserved region of the predetermined chromosome.
  • the invention proposes a system for determining the presence or absence of chromosome copy number variation in a sample genome.
  • the system comprises: means for determining a predetermined chromosomal conserved region, the device for determining a predetermined chromosomal conserved region, as defined above, the means for determining a predetermined chromosomal conserved region for determining a predetermined chromosome a conservative region; means for determining a feature value, the means for determining a feature value for determining a feature value of the predetermined chromosome based on a sequencing depth of the window in the conservative region; means for determining copy number variation, the determining The means for copy number variation is for determining, based on the feature value obtained in the device for determining a feature value, whether the predetermined chromosome has a copy number variation for the sample genome.
  • a method of determining whether a chromosome copy number variation exists in a sample genome according to an embodiment of the present invention can be effectively implemented, thereby enabling a normal sample to be eliminated Effectively determine copy number variation of single-cell chromosomes, especially aneuploidy variants; and enable chromosome copy number variation, especially for aneuploidy variation, at low-depth whole-genome sequencing at 28 bp Check out.
  • the invention proposes a computer readable medium.
  • instructions are stored on the computer readable medium, the instructions being adapted to be executed by the processor to determine whether copy number variations are present in the sample genome by: (a) utilizing the determinations described above a step of conserving a region of a chromosome, determining a conserved region of a predetermined chromosome; (b) determining a characteristic value of the predetermined chromosome based on a sequencing depth of the window in the conserved region; (c) based on the obtained in step (b) The feature value determines, for the sample genome, whether the predetermined chromosome has a copy number variation.
  • a method for determining whether a copy number variation exists in a sample genome can be effectively implemented, thereby enabling effective determination of copy number variation of a single cell chromosome without requiring a normal sample, in particular It is aneuploidy variation; and it can achieve chromosome copy number variation, especially the detection of aneuploidy variation, by low-depth whole genome sequencing at 28 bp.
  • FIG. 1 shows a schematic flow chart of a method of determining a predetermined region of a predetermined chromosome according to an embodiment of the present invention
  • FIG. 2 shows a flow diagram of a method of determining whether a chromosome copy number variation is present in a sample genome, in accordance with one embodiment of the present invention
  • FIG. 3 is a block diagram showing the structure of an apparatus for determining a predetermined region of a predetermined chromosome according to still another embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing the structure of a system for determining whether a chromosome copy number variation exists in a sample genome according to still another embodiment of the present invention.
  • FIG. 5 is a flow chart showing a method of determining whether a copy number variation exists in a sample genome according to still another embodiment of the present invention.
  • FIG. 6 is a schematic diagram showing the distribution of depth values after data correction of sample one according to still another embodiment of the present invention.
  • FIG. 7 is a schematic view showing the distribution of depth values after data correction of sample 2 according to still another embodiment of the present invention.
  • FIG. 8 is a schematic view showing the distribution of depth values after data correction of sample three according to still another embodiment of the present invention.
  • FIG. 9 is a schematic diagram showing the distribution of depth values after data correction of sample four according to still another embodiment of the present invention.
  • Figure 10 is a diagram showing the distribution of depth values after data correction of sample five according to still another embodiment of the present invention.
  • Figure 11 is a diagram showing the distribution of depth values after data correction of sample six according to still another embodiment of the present invention.
  • Figure 12 is a diagram showing the distribution of depth values after data correction of sample seven according to still another embodiment of the present invention.
  • Figure 13 is a diagram showing a fitting distribution of a whole genome test significant value of Sample 1 according to still another embodiment of the present invention.
  • Figure 14 is a graph showing a fitting distribution of a whole genome test significant value of Sample 2 according to still another embodiment of the present invention.
  • Figure 15 is a graph showing the fitting distribution of the whole genome test significant value of sample three according to still another embodiment of the present invention.
  • Figure 16 is a graph showing the fitting distribution of the whole genome test significant value of sample four according to still another embodiment of the present invention.
  • Figure 17 is a graph showing the fitting distribution of the whole genome test significant value of sample five according to still another embodiment of the present invention.
  • Figure 18 is a diagram showing a fitting distribution of a whole genome test significant value of sample six according to still another embodiment of the present invention.
  • Figure 19 is a graph showing the fitting distribution of the whole genome test significant value of sample seven according to still another embodiment of the present invention.
  • first and second are used for descriptive purposes only, and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated.
  • features defining “first” and “second” may include one or more of the features either explicitly or implicitly.
  • the meaning of "a plurality” is two or more unless otherwise specified. If not explicitly stated, the same letters in the formula or logo herein represent the same meaning.
  • the invention proposes a method of determining a predetermined chromosomal conserved region.
  • a method of determining a predetermined chromosomal conserved region includes:
  • sequencing the whole genome sample further comprises: amplifying the whole genome sample
  • a sequencing library is constructed using the amplified genomic sample; and the sequencing library is sequenced. Thereby, the whole genome information of the sequencing result of the sample genome can be efficiently obtained, and the single cell genome or the micronucleus can be Acid samples were efficiently sequenced.
  • Those skilled in the art can select different methods for constructing a sequencing library according to the specific scheme of the genome sequencing technology employed. For details on constructing the genome sequencing library, refer to the protocol provided by the manufacturer of the sequencing instrument, such as Illumina, for example, see Illumina Multiplexing Sample Preparation Guide (Part #1005361; Feb 2010) or Paired-End SamplePrep Guide (Part #1005063; Feb 2010), which is incorporated herein by reference.
  • the whole genome is amplified using a PCR-based whole genome amplification method or a non-PCR based method.
  • the PCR-based whole genome amplification method is a PEP-PCR, DOP-PCR or OmniPlex WGA method; or the non-PCR based method is MDA.
  • the sequencing library is sequenced using at least one selected from the group consisting of a Hiseq system, a Miseq system, a Genome Analyzer (GA) system, a 454 FLX, a SOLiD system, an Ion Torrent system, and a single molecule sequencing device.
  • the step of extracting the sample genome from the biological sample is further included prior to sequencing the genome.
  • the biological sample that can be employed is not particularly limited.
  • the biological sample that can be employed is any one selected from the group consisting of blood, urine, saliva, tissue, germ cells, fertilized eggs, blastomeres, and embryos.
  • the method and apparatus for separating single cells from a biological sample are not particularly limited.
  • single cells may be isolated from a biological sample using at least one selected from the group consisting of a dilution method, a pipette separation method, a micromanipulation (preferably microdissection), a flow cytometry, and a microfluidic method.
  • the resulting sequencing results include multiple sequencing sequences.
  • the resulting sequencing results are aligned with a reference genomic sequence to determine the location of the resulting sequencing sequence on the reference genomic sequence.
  • the total number of these sequencing data can be calculated using any known method.
  • analysis can be performed using software provided by the manufacturer of the sequencing instrument.
  • the Short Oligonucleotide Analysis Package (SOAP) and the BWA alignment are used, and the sequencing sequence is aligned with the reference genome sequence to obtain the position of the sequencing sequence on the reference genome.
  • Sequence alignment can be performed using default parameters provided by the program, or can be selected by those skilled in the art as needed.
  • the alignment software employed is SOAPaligner/soap2.
  • the reference genomic sequence is a standard human genome reference sequence in the NCBI database (eg, may be hg18, NCBI Build 36). It may also be a part of a known genomic sequence, and may be, for example, a sequence selected from at least one of human chromosome 21, chromosome 18, chromosome 13, X chromosome, and Y chromosome.
  • a sequence uniquely aligned with the reference genomic sequence can be selected for subsequent analysis, thereby avoiding interference of the repeated sequence on the copy number variation analysis.
  • the efficiency of determining a predetermined chromosome conserved region and determining whether a copy number variation exists in the sample genome is further improved.
  • S300 determining an abnormal region based on a distribution of the sequencing sequence on the reference genome sequence for a predetermined chromosome
  • step S300 further comprises: (3-1) dividing the reference genome sequence into a plurality of windows, and separately counting the sequencing depth of each of the windows; (3-2) based on the plurality of The sequencing depth of the same number of windows on both sides of all endpoints of the window, the initial breakthrough point is selected; (3-3) based on the initial breakthrough point, the abnormal region is determined.
  • the sequencing depth of each of the windows is according to a formula Certainly, among them, Representing the sequencing depth of the window, W represents the number of unique alignment sequences in each of the windows, R t represents the sum of the number of unique alignment sequences in each of the windows, and N represents the total number of windows of each of the windows.
  • the sequencing depth of each of the windows is corrected in advance based on the GC content of each of the windows.
  • the correction process comprises: (3-2-a) counting the GC content of each of the windows, and performing segmentation according to a predetermined step size to obtain a plurality of GC content segments; (3-2-b) statistically calculate the median value of the unique alignment sequence number of the window in each of the GC content segments; (3-2-c) based on the formula Determining a sequencing depth of each of the windows that are corrected, wherein T represents a sequencing depth of the window that is corrected, Indicates the sequencing depth of the window, M represents the median value determined in step (3-2-b), and W represents the number of unique alignment sequences in the window.
  • step (3-2-a) the predetermined step size is 0.01.
  • step (3-2) such an endpoint is selected as the initial breakthrough point, and there is a significant difference in sequencing depth in the same number of windows on both sides of the endpoint.
  • the initial breakthrough point is determined by determining the p-value of each endpoint, the p-value representing a significant difference in the number of sequencing data on both sides; and if the site is p- The value is less than the terminating p value, and the site is judged to be a breakthrough point, and preferably the terminating p value is at most 1.1 ⁇ 10 -50 .
  • 100 windows are taken on each side of the respective endpoint.
  • the window has a length of 100-200 Kbp, preferably 150 Kbp.
  • the step (3-3) comprises: (3-3-a) determining a plurality of verification windows based on the initial breakthrough point; and (3-3-b) based on the verification window The difference between the average sequencing depth and the predetermined threshold determines whether the verification window is an abnormal region.
  • the abnormal region is determined by the following steps: (3-3-1) determining a plurality of candidate breakthrough points, wherein the candidate breakthrough points exist before and after Other breakthrough points; (3-3-2) determine the p value of each candidate breakthrough point, and eliminate the candidate breakthrough point with the largest p value; (3-3-3) repeat step 2) for the remaining candidate breakthrough points until The p-values of the remaining candidate breakthrough points are all smaller than the ending p-value, and the remaining candidate breakthrough points are used as the selected candidate breakthrough points; (3-3-4) determining the area between the adjacent two selected candidate breakthrough points is a check window; (3-3-5) determining whether the check window is an abnormal region based on a difference between an average sequencing depth of the check window and a predetermined threshold.
  • S400 selecting, for the predetermined chromosome, at least a part of the region other than the abnormal region as a conserved region of the predetermined chromosome
  • the invention proposes a method of determining the presence or absence of a chromosome copy number variation in a sample genome.
  • this step can refer to the method of determining a predetermined chromosomal conserved region as described above.
  • S2000 further includes: (b-1) according to a formula Determining an average depth value of a predetermined chromosome, where Rc represents the average depth value of chromosome c, c represents the number of the chromosome, j represents the total number of all windows in the conservative region on chromosome c, and Tj represents the sequencing depth of the corrected window ; (b-2) based on the formula Determining the characteristic value of the predetermined chromosome, wherein Rc is an average depth value of the predetermined chromosome, Represents the average of the Rc values for each chromosome in the sample, and sd represents the standard deviation of the Rc values for each chromosome in the sample.
  • S3000 determining, according to the characteristic value obtained in S2000, whether the predetermined chromosome has copy number variation for the sample genome
  • step S3000 based on the difference between the feature value and a predetermined threshold, whether the copy number variation exists in the check window is determined for the sample genome.
  • the predetermined threshold comprises a first threshold and a second threshold, the second threshold being greater than the first threshold, wherein the characteristic value is greater than the second threshold to represent the predetermined chromosome There is a chromosomal duplication, and the characteristic value less than the first threshold indicates that a chromosomal deletion exists in the predetermined chromosome.
  • the first threshold and the second threshold are determined based on a range of Rc value fluctuations of a plurality of reference samples, wherein the reference sample is known to be absent from the copy number variation .
  • the chromosome copy number variation is at least one selected from the group consisting of a chromosome aneuploidy, a chromosome fragment deletion, a chromosome fragment increase, a microdeletion, and a microrepeat.
  • the detection of chromosomal aneuploidy is more effective by the method of determining whether a chromosome copy number variation is present in the sample genome according to an embodiment of the present invention.
  • the present invention provides a device for determining a predetermined chromosomal conserved region, by which the aforementioned method for determining a predetermined chromosomal conserved region can be effectively implemented, thereby enabling effective determination of a conserved region in a predetermined chromosome. Further, based on the determined conserved regions, the copy number variation of the single cell chromosome, particularly the aneuploidy variation, is effectively determined without the need for a normal sample.
  • an apparatus 100 for determining a predetermined chromosomal conserved region includes a sequencing unit 110, a comparison unit 120, an abnormal region determining unit 130, and a conservative region determining unit 140, in accordance with an embodiment of the present invention.
  • the sequencing unit 110 sequences a whole genome sample from a single cell to obtain a sequencing result composed of a plurality of sequencing sequences;
  • a device for determining a predetermined chromosomal conserved region according to an embodiment of the present invention 100 may further comprise a genome extraction unit (not shown) adapted to separate single cells from the biological sample, thereby extracting the sample genome, and the genome extraction unit is coupled to the sequencing unit to provide samples for the sequencing unit 110 Genome.
  • the biological sample can be directly used as a raw material to obtain about the life.
  • the sample of the object presupposes a conserved region of the chromosome and obtains information on the copy number variation, thereby reflecting the health status of the organism.
  • the sequencing unit may further include: a genome amplification component, a sequencing library construction component, and a sequencing component.
  • the genomic amplification component is adapted to amplify the sample genome
  • the sequencing library building component is coupled to the genomic amplification component, and is adapted to construct a sequencing library using the amplified sample genome; and a sequencing component, a sequencing component and a The sequencing library construction components are linked and are suitable for sequencing the sequencing library.
  • the sequencing unit is selected from a second generation sequencing technology such as Ilisea's Hiseq system, Miseq system, Genome Analyzer (GA) system, Roche's 454FLX, Applied Biosystems' SOLiD system, Life Technologies' At least one of an Ion Torrent system and a single molecule sequencing device.
  • a second generation sequencing technology such as Ilisea's Hiseq system, Miseq system, Genome Analyzer (GA) system, Roche's 454FLX, Applied Biosystems' SOLiD system, Life Technologies' At least one of an Ion Torrent system and a single molecule sequencing device.
  • the aligning unit 120 is coupled to a sequencing unit 110 for aligning the sequencing result with a reference genomic sequence to determine the sequencing sequence at the reference genomic sequence Distribution on.
  • the abnormal region determining unit 130 is connected to the comparing unit 120 for determining an abnormal region for a predetermined chromosome based on the distribution of the sequencing sequence on the reference genome sequence.
  • the abnormal region determining unit 130 includes: a window dividing component configured to divide the reference genome sequence into a plurality of windows, and separately count a sequencing depth of each of the windows; An initial mutation point determining component for selecting an initial breakthrough point based on a sequencing depth of the same number of windows on both sides of all of the plurality of windows; an abnormal area determining component, the abnormal area determining component The abnormal region is determined based on the initial breakthrough point.
  • the abnormal area determining component includes: a check window determining module, the check window determining module is configured to determine a plurality of check windows based on the initial break point; and a difference comparison module, wherein the difference comparison module is configured to be based on The difference between the average sequencing depth of the test window and the predetermined threshold is determined, and it is determined whether the check window is an abnormal region.
  • the abnormal region is determined by determining a plurality of candidate breakthrough points, wherein there are other breakthrough points before and after the candidate breakthrough point; determining a p value of each candidate breakthrough point, and rejecting The candidate breakthrough point with the largest p value; repeatedly determine the p value of each candidate breakthrough point for the remaining candidate breakthrough points, and eliminate the candidate breakthrough point with the largest p value until the p value of the remaining candidate breakthrough points is less than the ending p value. Resolving the remaining candidate breakthrough points as selected candidate breakthrough points; determining an area between adjacent two filtered candidate breakthrough points as a test window; determining the difference based on a difference between an average sequencing depth of the test window and a predetermined threshold Check if the window is an abnormal area.
  • the abnormal region determining unit 130 may further include a correction component connected to the initial mutation point determining component for performing the sequencing depth of each of the windows based on the GC content of each of the windows. Correction processing.
  • the calibration component includes a GC content confirmation module, and the GC content confirmation module is adapted to perform statistics on the GC content of each of the windows, and perform segmentation according to a predetermined step size to obtain a plurality of GC content segments; a median value statistic module, wherein the median statistic module is adapted to count a median value of a unique aligned sequence number of windows in each of the GC content segments; a sequencing depth confirmation module, the sequencing depth confirmation module being adapted to be based on formula Determining a sequencing depth of each of the windows that are corrected, wherein T represents a sequencing depth of the window that is corrected, Indicates the sequencing depth of the window, M represents the median value determined in step (3-2-b), and W represents the number of unique alignment sequences in the window.
  • the predetermined step size is 0.01.
  • such an endpoint is selected as the initial breakthrough point, and there is a significant difference in sequencing depth in the same number of windows on either side of the endpoint.
  • the initial breakthrough point is determined by determining a p-value for each endpoint, the p-value representing a significant difference in the number of sequencing data on both sides; and if the p-value of the site
  • the site is a breakthrough point, and the termination p value is preferably at most 1.1 ⁇ 10 ⁇ 50 .
  • 100 windows are taken on each side of the respective endpoints.
  • the window has a length of 100-200 Kbp, preferably 150 Kbp.
  • the abnormal region determining unit may further include: a window T value stable value determining component, the window T value stable value determining component for referring to the reference sequence of the predetermined chromosome, excluding the After the anomaly area determines the area determined in the component, for each of all the windows in the remaining area, according to the formula Determining a T value stable value of each window, where i represents the number of the window, and n represents the number of consecutive at least one window after the ith window, where n is an integer of at least 1, preferably an integer of at least 10, and T ni represents the ith number
  • the T value stable value of the window, the difference significant window determining component, the difference significant determining component is configured to determine the T value stable value of each window obtained by the component based on the window T value stable value determining component, and select the window with significant difference as the abnormal region.
  • the conserved region determining unit 140 is connected to the abnormal region determining unit 130 for selecting at least a part of the region other than the abnormal region as the conserved region of the predetermined chromosome for the predetermined chromosome.
  • the present invention provides a system for determining whether a chromosome copy number variation exists in a sample genome, by which the method for determining whether a chromosome copy number variation exists in a sample genome can be effectively implemented, thereby enabling Effectively determine copy number variation of single cell chromosomes, especially aneuploidy variations.
  • a system for determining whether a chromosome copy number variation exists in a sample genome includes: means 100 for determining a predetermined chromosomal conserved region, the device 100 for determining a predetermined chromosomal conserved region, as previously described, Means for determining a conserved region of a predetermined chromosome for determining a conserved region of a predetermined chromosome; means 200 for determining a feature value, said means 200 for determining a feature value for determining said based on a sequencing depth of said window in said conserved region a feature value of a predetermined chromosome; means 300 for determining copy number variation, the means for determining copy number variation for determining, based on the feature value obtained in the device for determining a feature value, whether the predetermined chromosome is for the sample genome There is a copy number variation.
  • the apparatus 200 for determining a feature value is configured to determine a feature value of the predetermined chromosome based on a sequencing depth of the window in the conservative region, including: determining an average depth unit of a chromosome, Determining the average depth unit of the chromosome is adapted to the formula Determining an average depth value of a predetermined chromosome, where Rc represents the average depth value of chromosome c, c represents the number of the chromosome, j represents the total number of all windows in the conservative region on chromosome c, and Tj represents the sequencing depth of the corrected window Determining the feature value unit of the chromosome, the feature value unit of the determined chromosome being adapted to be based on a formula Determining the characteristic value of the predetermined chromosome, wherein Rc is an average depth value of the predetermined chromosome, Represents the average of the Rc values for each chromosome in the sample, and
  • the means for determining copy number variation is adapted to determine, based on the difference in the feature value from a predetermined threshold, for the sample genome whether a copy number variation exists in the check window.
  • the first threshold and the second threshold are determined based on a range of Rc value fluctuations of a plurality of reference samples, wherein the reference sample is known to be absent from the copy number variation.
  • the first threshold does not exceed a lower end value of the fluctuation range
  • the second threshold is not lower than an upper end value of the fluctuation range
  • the first threshold is at most 0.7 and the second threshold is at least 1.3.
  • the chromosome copy number variation is at least one selected from the group consisting of chromosome aneuploidy, chromosome fragment deletion, chromosome fragment increase, microdeletion, and microrepeat.
  • the system for determining whether a chromosome copy number variation exists in a sample genome is suitable for implementing the aforementioned method for determining whether a chromosome copy number variation exists in a sample genome, and the detection effect of chromosome aneuploidy is better. .
  • the invention proposes a computer readable medium.
  • instructions are stored on the computer readable medium, the instructions being adapted to be executed by a processor to determine whether a copy number variation exists in a sample genome by means of: means for determining a predetermined chromosomal conserved region, Means for determining a predetermined chromosomal conserved region, as defined above, said means for determining a predetermined chromosomal conserved region for determining a conserved region of a predetermined chromosome; means for determining a eigenvalue, said means for determining a eigenvalue for use based on said conserved a sequencing depth of the window in the region, determining a feature value of the predetermined chromosome; means for determining a copy number variation, the means for determining a copy number variation being used for the feature value obtained in the device for determining the feature value,
  • the sample genome determines whether a copy number variation exists in the predetermined chromosomal conserved region, Means for determining a
  • a method for determining whether a copy number variation exists in a sample genome can be effectively implemented, thereby being capable of effectively determining whether a copy number variation exists in a sample genome, particularly suitable for chromosome aneuploidy Sex.
  • the method for determining the presence or absence of copy number variation in a sample genome employed in the examples includes the following:
  • Single cells are isolated from the sample, which may be selected from the group consisting of blood, urine, saliva, tissues, germ cells, fertilized eggs, blastomeres, and embryos.
  • the lysis method is not limited.
  • the extracted genome is then amplified, and the amplification method may employ PCR-based methods such as PEP-PCR (primer-extension-preamplification PCR, PEP-PCR), DOP-PCR, and OmniPlex WGA, or non-PCR-based Methods such as MDA (Multiple Displacement Amplification (MDA), etc., in the examples of this patent, taking into account the effects of amplification time and amplification uniformity, PicoPlex from Rubicon Genomics is preferred.
  • PCR-based methods such as PEP-PCR (primer-extension-preamplification PCR, PEP-PCR), DOP-PCR, and OmniPlex WGA, or non-PCR-based Methods such as MDA (Multiple Displacement Amplification (MDA), etc.
  • library construction can be performed according to the second-generation sequencing platform operating manual, where the second generation sequencing
  • the platform can be Illumina's Hiseq platform, Miseq platform, Genome Analyzer (GA) platform, Roche's 454FLX, Applied Biosystems' SOLiD platform, Life Technologies' Ion Torrent platform, etc.
  • BGISEQ of BGI is used.
  • the sequencing platform in the present invention is not limited to the second generation sequencing platform, and may be other sequencing methods and devices, such as third generation sequencing technology and more advanced sequencing devices in the future. This patent has found that in the case of 28 bp, the effective detection of single cell aneuploidy can be achieved by low-depth whole genome sequencing.
  • the quality of sample sequencing is filtered according to the data characteristics of the sequencing platform used.
  • the data of the lower machine is quality-controlled according to the sequencing light intensity, the number of sequencing sequences, the label resolution rate, and the proportion of G base and C base content.
  • the sequencing result is aligned with the reference genome sequence.
  • the standard human genome sequence hg19, NCBI Build 37
  • NCBI Build 37 can also be limited to certain chromosomes of the known genome.
  • the position of the sequencing sequence on the genome is determined by comparing the underlying sequence of the quality control with the standard human genome sequence. The comparison can be performed using existing comparison software.
  • BWA Borrows-Wheeler Aligner
  • the inventors have found that this method can successfully detect aneuploidy even when the sequencing sequence is 28 bases in length.
  • Those skilled in the art can select other alignment software according to the sequence length of other sequencing platforms, and perform parameter adjustment according to the comparison result.
  • the sample unique comparison ratio, the repetition rate, the unique comparison read number, the GC content and the like are calculated.
  • the percentile method is used to estimate the normal value range (in the case where the data does not conform to the normal distribution and the skewed distribution), the range is 5% and 95%, and the estimation software is: SPSS Statistics. 17.0.
  • the indicators that characterize the sequencing characteristics are selected as the quality control range.
  • the unique comparison ratio, repetition rate, GC content and depth variation coefficient were selected as the quality control indicators in this example.
  • the human reference genome is first divided into different windows, and the windows can be of equal length or unequal length.
  • the number of sequences in each interval is counted based on the sample sequence alignment results.
  • the best solution for determining the length of the interval is based on whether the number of sequences falling within all windows exhibits a normal distribution.
  • the depth value of the window is corrected based on the number of unique alignment sequences that fall within the window after alignment and the GC content of the window.
  • the number of sequences in each window is normalized by the average number of sequences in the genome-wide window, and the normalized depth value is obtained.
  • the number of unique alignment sequences in the window is W, Rt is the total number of unique alignment sequences, and N is the total number of windows.
  • the window type is classified in steps of 0.01 according to the difference in GC content in the window. Multiply the average of the ratio of W to the median value of the number of sequences in the same interval Get the correction value of each window
  • breakpoint P value On the basis of the above-mentioned breakpoint P value, a statistical test is performed on the left and right breakpoint intervals for a certain breakpoint, and the insignificant breakpoint is deleted in the loop. Obtain the mean of the P and depth values for each breakpoint interval. According to the significance of the breakpoint P value, it is judged whether it is a true breakpoint, and it is judged whether it is missing or repeated according to the magnitude of the depth value. The detection accuracy is judged according to the size of the breakpoint interval.
  • the interval correction is performed on the correction depth T in the window.
  • N 10
  • N 10
  • the depth of the n windows behind the i window is used to correct Ti.
  • j is the number of windows for this chromosome
  • c is the chromosome number.
  • the embryonic cells are subjected to WGA using the reagents in the kit, and the specific amplification process includes three steps.
  • cell lysis a mixture of cell lysis buffer and cell lyase is added to a PCR tube to which cells have been collected, reacted at 75 ° C for 10 min, and reacted at 95 ° C for 4 min to lyse and release the cells.
  • DNA a mixture of cell lysis buffer and cell lyase is added to a PCR tube to which cells have been collected, reacted at 75 ° C for 10 min, and reacted at 95 ° C for 4 min to lyse and release the cells.
  • DNA a mixture of cell lysis buffer and cell lyase is added to a PCR tube to which cells have been collected, reacted at 75 ° C for 10 min, and reacted at 95 ° C for 4 min to lyse and release the cells.
  • pre-amplification adding a mixture of pre-amplification buffer and pre-amplification enzyme to the reaction solution, reacting at 95 ° C for 2 min, then at 95 ° C for 15 s, at 15 ° C for 50 s, at 25 ° C for 40 s, The reaction was carried out for 12 cycles at 35 ° C for 30 s, 65 ° C for 40 s, and 75 ° C for 40 s.
  • post-amplification a mixture of post-amplification buffer, post-amplification enzyme and denuclease water is added to the reaction solution in the previous step, and reacted at 95 ° C for 2 min, then at 95 ° C for 15 s, 65 ° C. After 1 min, 14 times at 75 ° C for 1 min, the amplified product after the completion of the reaction can be directly used for downstream analysis or stored in a refrigerator at -20 ° C.
  • the WGA product of the cell is constructed using the reagents in the kit.
  • the specific library construction process includes four steps: DNA disruption, terminal repair, linker ligation, and PCR amplification.
  • end-repair quantify the product after interruption and purification, take a certain amount of purified product, add a mixture of end-repair buffer and end-repairing enzyme, and react at 37 ° C for 30 min, then The reaction was carried out at 75 ° C for 15 min.
  • the joint is connected: a mixture of a connection buffer and a ligase is added to the reaction solution, and then a label connector 1-48 (one connector per sample) is added thereto, and reacted at 20 ° C for 20 minutes, using Magnetic beads were purified to join the reaction product.
  • PCR amplification adding the mixed solution of the PCR reaction solution and the PCR primer to the DNA purified by the upward reaction, and reacting at 98 ° C for 2 min, then at 98 ° C for 15 s, 56 ° C for 15 s, and 72 ° C for 30 s. 12 cycles were cycled, extended at 72 ° C for 5 min, maintained at 4 ° C; after amplification was completed, magnetic beads were used for purification, and the concentration of the purified sample was determined.
  • the BGISEQ-500 sequencing platform independently developed by Huada was used for sequencing.
  • the sequencing reagent uses the reagent in the kit, and the parameter setting and operation method of the instrument must be strictly in accordance with the operation manual.
  • the instrument used in the present invention is BGISEQ-500
  • the number of sequencing cycles in the kit is SE28+10, but since the instrument and the database and sequencing methods are continuously upgraded, the use of the kit is not limited in practical applications.
  • This kind of instrument is not limited to this kind of database construction method, and is not limited to this kind of sequencing cycle, and is applicable to various database construction methods, sequencing platforms and sequencing methods in the BGISEQ series.
  • the measured sample sequences were aligned to the reference genome (hg19, NCBI Build 37) using BWA software (version number: 0.7.7-r441).
  • the alignment information was obtained based on the alignment results.
  • Table 2 the unique alignment sequence was selected from the alignment results, and the repeated sequences were removed and used in the following analysis.
  • the sample is quality controlled based on the information generated by the alignment.
  • Sample name Total number of sequences Unique comparison rate Effective sequence number Repetition rate GC content Coefficient of variation Sample1 9,722,711 0.5539 4,795,452 0.1096 0.4112 0.2128 Sample2 18,484,072 0.5671 8,605,341 0.1791 0.3898 0.2230 Sample3 15,861,223 0.5832 7,530,114 0.1860 0.3807 0.2333 Sample4 12089842 0.5366 6136076 0.0541 0.4087 0.2987 Sample5 14012636 0.5082 6669696 0.0634 0.4115 0.3275

Abstract

提供了确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法和适于执行该方法的装置、系统和计算机可读介质。其中,确定预定染色体保守区域的方法,包括步骤:(1)对来自于单细胞的全基因组样本进行测序,以便获得由多个测序序列构成的测序结果;(2)将所述测序结果与参照基因组序列进行比对,以便确定所述测序序列在所述参照基因组序列上的分布;(3)基于所述测序序列在所述参照基因组序列上的分布,针对预定染色体,确定异常区域;(4)针对所述预定染色体,选择所述异常区域之外的至少一部分区域作为所述预定染色体的保守区域。

Description

确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质
优先权信息
技术领域
本发明涉及确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法和适于执行该方法的系统和计算机可读介质。
背景技术
非整倍体即染色体数目异常,是指细胞中染色体的数目不是该物种正常精子或卵子中染色体数目的整数倍。人类已知的染色体疾病有200多种,大多数染色体疾病由染色体数目异常引起,是导致不育不孕及先天性出生缺陷的重要原因。如新生儿中常见的21-三体综合征(Down Syndrome)、18-三体综合征(EdwardSyndrome)、13-三体综合征(Patau Syndrome)。美国南加州生殖医学中心的研究数据表明,用羊膜穿刺的方法检测年龄为35岁的产妇,其胎儿存在染色体异常的概率为1/132,40岁的为1/38,45岁的为1/12。体外授精-胚胎移植(In Vitro Fertilization-embryo Transfer,IVF)形成胚胎过程中,大约50%左右的胚胎存在染色体的异常,可导致早期胚胎丢失、自然流产和死产。此类染色体疾病目前尚无有效的治疗方法,防治重点在于广泛开展产前筛查和产前诊断。
单细胞测序技术已广泛应用到胚胎植入前检测,肿瘤细胞研究和细胞发育等领域。基于单细胞检测染色体非整倍体在临床和科研上具有广泛的用途。目前应用于单细胞非整倍体检测的方法主要依赖待检样本与大量正常样本比较后的检验值,来判断染色体的非整倍性,此种方法的不足在于对于实验条件不同的检测样本需要准备大量的正常样本来划定正常值范围。对于实验条件不可重复的或稀有的样本不能有效检出,例如胎儿有核红细胞的非整倍体检测。
因此,目前基于单细胞检测染色体非整倍体的方法仍有待改进。
发明内容
本发明旨在至少解决现有技术中存在的技术问题之一。
全基因组扩增可以有效而精确的扩增纳克级的起始样本基因组DNA,得到微克级的DNA,达到测序要求。但由于序列本身GC含量和非线性扩增过程导致全基因组扩增方法有较大的扩增偏好性,会最终导致测得的序列在基因组上均一性较差,而影响非整倍体检测。并且单细胞异质性较高,在扩增过程中会产生独特的错误,在使用正常样本做为参照对待检样本进行处理时,会引起错误信号产生。基于上述问题的发现,发明人提出了一种确定样本基因组中是否存在拷贝数变异的方法,该方法针对GC含量引起的扩增偏差的问题,采用GC含量和序列数的关系,对测序深度进行矫正,达到了降低GC偏差的效果;并且该方法能够在不需要正常样本的情况下有效确定单细胞染色体非整倍性。
在本发明的一个方面,本发明提出了一种确定预定染色体保守区域的方法。根据本发明的实施例,该方法包括以下步骤:(1)对来自于单细胞的全基因组样本进行测序,以便获得由多个测序序列构成的测序结果;(2)将所述测序结果与参照基因组序列进行比对,以便确定所述测序序列在所述参照基因组序列上的分布;(3)基于所述测序序列在所述参照基因组序列上的分布,针对预定染色体,确定异常区域;(4)针对所述预定染色体,选择所述异常区域之外的至少一部分区域作为所述预定染色体的保守区域。需要说明的是,本申请所述的“异常区域”是指存在非拷贝数变异的区域,如存在倒位、易位、染色体结构变异的区域。利用根据本发明实施例的确定预定染色体保守区域的方法,能够有效地确定预 定染色体中的保守区域,进而基于所确定的保守区域,在不需要正常样本的情况下有效确定单细胞染色体的拷贝数变异,尤其是非整倍性变异。
在本发明的第二方面,本发明提出了一种确定样本基因组中是否存在染色体拷贝数变异的方法。根据本发明的实施例,该方法包括:(a)利用前面所述的方法,确定预定染色体的保守区域;(b)基于所述保守区域中所述窗口的测序深度,确定所述预定染色体的特征值;(c)基于步骤(b)中所得到的所述特征值,针对所述样本基因组,确定所述预定染色体是否存在拷贝数变异。利用根据本发明实施例的确定样本基因组中是否存在染色体拷贝数变异的方法,能够在不需要正常样本的情况下有效确定单细胞染色体的拷贝数变异,尤其是非整倍性变异;且能够在28bp的情况下,通过低深度全基因组测序,实现染色体的拷贝数变异,尤其是非整倍性变异的有效检出。
在本发明的第三方面,本发明提出了一种确定预定染色体保守区域的装置。根据本发明的实施例,所述装置包括:测序单元,所述测序单元用于对来自于单细胞的全基因组样本进行测序,以便获得由多个测序序列构成的测序结果;比对单元,所述比对单元用于将所述测序结果与参照基因组序列进行比对,以便确定所述测序序列在所述参照基因组序列上的分布;异常区域确定单元,所述异常区域确定单元用于基于所述测序序列在所述参照基因组序列上的分布,针对预定染色体,确定异常区域;保守区域确定单元,所述保守区域确定单元用于针对所述预定染色体,选择所述异常区域之外的至少一部分区域作为所述预定染色体的保守区域。利用根据本发明实施例的确定预定染色体保守区域的装置,能够有效地实施根据本发明实施例的确定预定染色体保守区域的方法,从而能够有效地确定预定染色体中的保守区域,进而基于所确定的保守区域,在不需要正常样本的情况下有效确定单细胞染色体的拷贝数变异,尤其是非整倍性变异。
在本发明的第四方面,本发明提出了一种确定样本基因组中是否存在染色体拷贝数变异的系统。根据本发明的实施例,所述系统包括:确定预定染色体保守区域的装置,所述确定预定染色体保守区域的装置如前面所限定的,所述确定预定染色体保守区域的装置用于确定预定染色体的保守区域;确定特征值的装置,所述确定特征值的装置用于基于所述保守区域中所述窗口的测序深度,确定所述预定染色体的特征值;确定拷贝数变异的装置,所述确定拷贝数变异的装置用于基于确定特征值的装置中得到的所述特征值,针对所述样本基因组,确定所述预定染色体是否存在拷贝数变异。利用根据本发明实施例的确定样本基因组中是否存在染色体拷贝数变异的系统,能够有效地实施根据本发明实施例的确定样本基因组中是否存在染色体拷贝数变异的方法,进而能够在不需要正常样本的情况下有效确定单细胞染色体的拷贝数变异,尤其是非整倍性变异;且能够在28bp的情况下,通过低深度全基因组测序,实现染色体的拷贝数变异,尤其是非整倍性变异的有效检出。
在本发明的第五方面,本发明提出了一种计算机可读介质。根据本发明的实施例,该计算机可读介质上存储有指令,所述指令适于被处理器执行以便通过下列步骤确定样本基因组中是否存在拷贝数变异:(a)利用前面所描述的确定预定染色体保守区域的步骤,确定预定染色体的保守区域;(b)基于所述保守区域中所述窗口的测序深度,确定所述预定染色体的特征值;(c)基于步骤(b)中所得到的所述特征值,针对所述样本基因组,确定所述预定染色体是否存在拷贝数变异。借助该计算机可读介质,能够有效地实施根据本发明实施例的确定样本基因组中是否存在拷贝数变异的方法,从而能够在不需要正常样本的情况下有效确定单细胞染色体的拷贝数变异,尤其是非整倍性变异;且能够在28bp的情况下,通过低深度全基因组测序,实现染色体的拷贝数变异,尤其是非整倍性变异的有效检出。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明 显和容易理解,其中:
图1显示了根据本发明一个实施例的确定预定染色体保守区域的方法的流程示意图;
图2显示了根据本发明一个实施例的确定样本基因组中是否存在染色体拷贝数变异的方法的流程示意图;
图3显示了根据本发明又一个实施例的确定预定染色体保守区域的装置的结构示意图;
图4显示了根据本发明又一个实施例的确定样本基因组中是否存在染色体拷贝数变异的系统的结构示意图;
图5显示了根据本发明又一个实施例的确定样本基因组中是否存在拷贝数变异的方法的流程示意图;
图6显示了根据本发明又一个实施例的样品一的数据矫正后的深度值分布示意图;
图7显示了根据本发明又一个实施例的样品二的数据矫正后的深度值分布示意图;
图8显示了根据本发明又一个实施例的样品三的数据矫正后的深度值分布示意图;
图9显示了根据本发明又一个实施例的样品四的数据矫正后的深度值分布示意图;
图10显示了根据本发明又一个实施例的样品五的数据矫正后的深度值分布示意图;
图11显示了根据本发明又一个实施例的样品六的数据矫正后的深度值分布示意图;
图12显示了根据本发明又一个实施例的样品七的数据矫正后的深度值分布示意图;
图13显示了根据本发明又一个实施例的样品一的全基因组检验显著值拟合分布图;
图14显示了根据本发明又一个实施例的样品二的全基因组检验显著值拟合分布图;
图15显示了根据本发明又一个实施例的样品三的全基因组检验显著值拟合分布图;
图16显示了根据本发明又一个实施例的样品四的全基因组检验显著值拟合分布图;
图17显示了根据本发明又一个实施例的样品五的全基因组检验显著值拟合分布图;
图18显示了根据本发明又一个实施例的样品六的全基因组检验显著值拟合分布图;以及
图19显示了根据本发明又一个实施例的样品七的全基因组检验显著值拟合分布图。
发明详细描述
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。
需要说明的是,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。进一步地,在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。如果没有明确说明,在本文的公式或标识中,相同的字母代表相同的含义。
一、一种确定预定染色体保守区域的方法
根据本发明的一个方面,本发明提出了一种确定预定染色体保守区域的方法。
参考图1,根据本发明实施例的确定预定染色体保守区域的方法包括:
S100:对样本基因组进行测序,以便获得由多个测序序列构成的测序结果
根据本发明的实施例,对所述全基因组样本进行测序进一步包括:对所述全基因组样本进行扩增;
利用经过扩增的基因组样本构建测序文库;以及对所述测序文库进行测序。由此,能够有效地获取样本基因组的测序结果的全基因组信息,并且能够对单细胞基因组或者微量核 酸样本进行有效测序。本领域技术人员可以根据采用的基因组测序技术的具体方案选择不同的构建测序文库的方法,关于构建基因组测序文库的细节,可以参见测序仪器的厂商例如Illumina公司所提供的规程,例如参见Illumina公司Multiplexing Sample Preparation Guide(Part#1005361;Feb 2010)或Paired-End SamplePrep Guide(Part#1005063;Feb 2010),通过参照将其并入本文。
根据本发明的实施例,利用基于PCR的全基因组扩增方法或非基于PCR的方法对所述全基因组进行扩增。
根据本发明的具体实施例,所述基于PCR的全基因组扩增方法为PEP-PCR、DOP-PCR或OmniPlex WGA方法;或所述非基于PCR的方法为MDA。
根据本发明的具体实施例,利用选自Hiseq系统、Miseq系统、Genome Analyzer(GA)系统、454FLX、SOLiD系统、Ion Torrent系统和单分子测序装置的至少一种对所述测序文库进行测序。
另外,根据本发明的实施例,在对基因组进行测序之前,进一步包括从生物样本中提取样本基因组的步骤。由此,能够直接以生物样本作为原材料,获得关于该生物样本是否具有拷贝数变异的信息,从而反映生物体的健康状态。根据本发明的实施例,可以采用的生物样本并不受特别限制。根据本发明的一些具体示例,可以采用的生物样本为选自血液、尿液、唾液、组织、生殖细胞、受精卵、卵裂球和胚胎的任意一种。
根据本发明的实施例,从生物样本分离单细胞的方法和设备不受特别限制。根据本发明的一些具体示例,可以采用选自稀释法、吸管分离法、显微操作(优选显微切割)、流式细胞分离术、微流控法的至少一种从生物样本分离单细胞。由此,能够有效便捷地获得生物样本的单细胞,以便实施后续操作,由此,可以进一步提高确定样本基因组中是否存在拷贝数变异的效率。
S200:将所述测序结果与参照基因组序列进行比对,以便确定所述测序序列在所述参照基因组序列上的分布
在完成对样本基因组进行测序之后,所得到的测序结果中包含了多个测序序列。将所得到的测序结果与参照基因组序列进行比对,从而可以确定所得到的测序序列在参照基因组序列上的定位。根据本发明的实施例,可以采用任何已知的方法对这些测序数据的总数目进行计算。例如,可以采用测序仪器的制造商所提供的软件进行分析。优选采用短寡核苷酸分析包(Short Oligonucleotide Analysis Package,SOAP)和BWA比对(Burrows-Wheeler Aligner)进行,将测序序列与参考基因组序列比对,得到测序序列在参考基因组上的位置。进行序列比对可以使用程序提供的默认参数进行,或者由本领域技术人员根据需要对参数进行选择。在本发明的一个实施方案中,所采用的比对软件是SOAPaligner/soap2。
根据本发明的实施例,参照基因组序列是NCBI数据库中的标准人类基因组参考序列(例如可以为hg18,NCBI Build 36)。也可以是已知基因组序列的一部分,例如可以为选自人类21号染色体、18号染色体、13号染色体、X染色体和Y染色体的至少一种的序列。
根据本发明的实施例,通过将测序结果与参照基因组序列进行比对,可以选择与参照基因组序列唯一比对的序列,进行后续分析,由此,能够避免重复序列对拷贝数变异分析的干扰,进一步提高确定预定染色体保守区域、确定样本基因组中是否存在拷贝数变异的效率。
S300:基于所述测序序列在所述参照基因组序列上的分布,针对预定染色体,确定异常区域
根据本发明的实施例,步骤S300进一步包括:(3-1)将所述参考基因组序列划分为多个窗口,并且分别统计各个所述窗口的测序深度;(3-2)基于所述多个窗口的所有端点两侧相同数目窗口的测序深度,选择初始突破点;(3-3)基于所述初始突破点,确定所述异常区域。
根据本发明的实施例,各个所述窗口的所述测序深度是按照公式
Figure PCTCN2017101965-appb-000001
确定的,其中,
Figure PCTCN2017101965-appb-000002
表示所述窗口的测序深度,W表示各个所述窗口中的唯一比对序列数,Rt表示各个所述窗口中所述唯一比对序列数的总和,N表示各个所述窗口的窗口总数。
根据本发明的实施例,在进行步骤(3-2)之前,预先基于各个所述窗口的GC含量对各个所述窗口的测序深度进行校正处理。
根据本发明的实施例,所述校正处理包括:(3-2-a)对各个所述窗口的GC含量进行统计,并按照预定步长进行区段划分,以便获得多个GC含量区段;(3-2-b)统计各个所述GC含量区段中窗口的唯一比对序列数的中位值;(3-2-c)基于公式
Figure PCTCN2017101965-appb-000003
确定经过校正的各个所述窗口的测序深度,其中,T表示经过校正的所述窗口的测序深度,
Figure PCTCN2017101965-appb-000004
表示所述窗口的测序深度,M表示步骤(3-2-b)中确定的所述中位值,W表示所述窗口中的唯一比对序列数。
根据本发明的具体实施例,在步骤(3-2-a)中,所述预定步长为0.01。
根据本发明的具体实施例,在步骤(3-2)中,选择这样的端点作为所述初始突破点,在该端点两侧相同数目的窗口中,测序深度存在显著差异性。
根据本发明的具体实施例,所述初始突破点是通过下列步骤确定的:确定各个端点的p值,所述p值表示两侧测序数据数目的显著差异性;以及如果所述位点的p值小于终止p值,判断所述位点为突破点,优选所述终止p值为至多1.1×10-50
根据本发明的具体实施例,在所述各个端点两侧各取100个窗口。
根据本发明的具体实施例,所述窗口的长度均为100-200Kbp,优选150Kbp。
根据本发明的具体实施例,步骤(3-3)包括:(3-3-a)基于所述初始突破点,确定多个检验窗口;以及(3-3-b)基于所述检验窗口的平均测序深度与预定阈值的差异,确定所述检验窗口是否为异常区域。
根据本发明的实施例,在步骤(3-3)中,通过下列步骤确定所述异常区域:(3-3-1)确定多个候选突破点,其中在所述候选突破点的前后均存在其他突破点;(3-3-2)确定每个候选突破点的p值,并剔除p值最大的候选突破点;(3-3-3)对剩余的候选突破点重复步骤2),直到剩余候选突破点的p值均小于终止p值,所述剩余候选突破点作为经过筛选的候选突破点;(3-3-4)确定相邻两个经过筛选的候选突破点之间的区域为检验窗口;(3-3-5)基于所述检验窗口的平均测序深度与预定阈值的差异,确定所述检验窗口是否为异常区域。
根据本发明的再一具体实施例,所述方法进一步包括:(3-4)针对所述预定染色体的参考序列,在排除步骤(3-3)中所确定的区域后,针对剩余区域内的所有窗口的每一个,按照公式
Figure PCTCN2017101965-appb-000005
确定各窗口的T值稳定值,其中i表示窗口的编号,n表示第i号窗口之后连续的至少一个窗口数目,其中n为至少1的整数,优选至少10的整数,Tni表示第i号窗口的T值稳定值;(3-5)基于步骤(3-4)中得到的各窗口的T值稳定值,选择差异显著的窗口作为异常区域。
S400:针对所述预定染色体,选择所述异常区域之外的至少一部分区域作为所述预定染色体的保守区域
二、一种确定样本基因组中是否存在染色体拷贝数变异的方法
在本发明的一个方面,本发明提出了一种确定样本基因组中是否存在染色体拷贝数变异的方法。
参考图2,根据本发明实施例的确定样本基因组中是否存在染色体拷贝数变异的方法包括:
S1000:确定预定染色体的保守区域
具体地,本步骤可参照前面所述的确定预定染色体保守区域的方法。
S2000:基于所述保守区域中所述窗口的测序深度,确定所述预定染色体的特征值
根据本发明的实施例,S2000进一步包括:(b-1)按照公式
Figure PCTCN2017101965-appb-000006
确定预定染色体的平均深度值,其中Rc表示c号染色体的平均深度值,c表示染色体的编号,j表示c号染色体上所述保守区域中所有窗口的总数,Tj表示经过校正的窗口的测序深度;(b-2)基于公式
Figure PCTCN2017101965-appb-000007
确定所述预定染色体的所述特征值,其中,Rc为所述预定染色体的平均深度值,
Figure PCTCN2017101965-appb-000008
表示所述样本中各染色体Rc值的平均值,sd表示所述样本中各染色体Rc值的标准偏差。
S3000:基于S2000中所得到的所述特征值,针对所述样本基因组,确定所述预定染色体是否存在拷贝数变异
根据本发明的实施例,在步骤S3000中,基于所述特征值与预定阈值的差异,针对所述样本基因组,确定所述检验窗口是否存在拷贝数变异。
根据本发明的具体实施例,所述预定阈值包括第一阈值和第二阈值,所述第二阈值大于所述第一阈值,其中,所述特征值大于所述第二阈值表示所述预定染色体存在染色体重复,所述特征值小于所述第一阈值表示所述预定染色体存在染色体缺失。
根据本发明的再一具体实施例,所述第一阈值和所述第二阈值是基于多个参考样本的Rc值波动范围确定的,其中,所述参考样本已知不存在所述拷贝数变异。
根据本发明的再一具体实施例,所述第一阈值不超过所述波动范围的下端值,所述第二阈值不低于所述波动范围的上端值。
根据本发明的再一具体实施例,所述第一阈值为至多0.7,所述第二阈值为至少1.3。
根据本发明的再一具体实施例,染色体拷贝数变异为选自染色体非整倍性、染色体片段缺失、染色体片段增加、微缺失、微重复的至少一种。优选的,利用根据本发明实施例的确定样本基因组中是否存在染色体拷贝数变异的方法对染色体非整倍性的检测效果更优。
三、一种确定预定染色体保守区域的装置
根据本发明的第三方面,本发明提出了一种确定预定染色体保守区域的装置,利用该装置能够有效的实施前述确定预定染色体保守区域的方法,从而能够有效地确定预定染色体中的保守区域,进而基于所确定的保守区域,在不需要正常样本的情况下有效确定单细胞染色体的拷贝数变异,尤其是非整倍性变异。
参考图3,根据本发明的实施例,确定预定染色体保守区域的装置100包括:测序单元110、比对单元120、异常区域确定单元130、保守区域确定单元140。
根据本发明的实施例,所述测序单元110对来自于单细胞的全基因组样本进行测序,以便获得由多个测序序列构成的测序结果;根据本发明的实施例,确定预定染色体保守区域的装置100可以进一步包括基因组提取单元(图中未示出),该基因组提取单元适于从生物样本中分离单细胞,进而提取样本基因组,并且该基因组提取单元与测序单元相连以便为测序单元110提供样本基因组。由此,能够直接以生物样本作为原材料,获得关于该生 物样本预定染色体保守区域和获得拷贝数变异的信息,从而反映生物体的健康状态。根据本发明的实施例,测序单元可以进一步包括:基因组扩增组件、测序文库构建组件以及测序组件。其中,基因组扩增组件适于对所述样本基因组进行扩增,测序文库构建组件与基因组扩增组件相连,并且适于利用经过扩增的样本基因组构建测序文库;以及测序组件,测序组件与所述测序文库构建组件相连,并且适于对所述测序文库进行测序。根据本发明的实施例,测序单元为选自第二代测序技术如Illumina公司的Hiseq系统,Miseq系统,Genome Analyzer(GA)系统,Roche公司的454FLX,Applied Biosystems公司的SOLiD系统,Life Technologies公司的Ion Torrent系统和单分子测序装置的至少一种。由此,能够利用这些测序装置的高通量、深度测序的特点,进一步提高了确定预定染色体保守区域和确定单细胞染色体非整倍性的效率。
根据本发明的实施例,比对单元120与测序单元110相连,所述比对单元120用于将所述测序结果与参照基因组序列进行比对,以便确定所述测序序列在所述参照基因组序列上的分布。
根据本发明的实施例,异常区域确定单元130与比对单元120相连,用于基于所述测序序列在所述参照基因组序列上的分布,针对预定染色体,确定异常区域。
根据本发明的实施例,所述异常区域确定单元130包括:窗口划分组件,所述窗口划分组件用于将所述参考基因组序列划分为多个窗口,并且分别统计各个所述窗口的测序深度;初始突变点确定组件,所述确定初始突变点组件用于基于所述多个窗口的所有端点两侧相同数目窗口的测序深度,选择初始突破点;异常区域确定组件,所述异常区域确定组件用于基于所述初始突破点,确定所述异常区域。其中,异常区域确定组件包括:检验窗口确定模块,所述检验窗口确定模块用于基于所述初始突破点,确定多个检验窗口;以及差异比对模块,所述差异比对模块用于基于所述检验窗口的平均测序深度与预定阈值的差异,确定所述检验窗口是否为异常区域。根据本发明的实施例,通过下列方式确定所述异常区域:确定多个候选突破点,其中在所述候选突破点的前后均存在其他突破点;确定每个候选突破点的p值,并剔除p值最大的候选突破点;对剩余的候选突破点重复确定每个候选突破点的p值,并剔除p值最大的候选突破点,直到剩余候选突破点的p值均小于终止p值,所述剩余候选突破点作为经过筛选的候选突破点;确定相邻两个经过筛选的候选突破点之间的区域为检验窗口;基于所述检验窗口的平均测序深度与预定阈值的差异,确定所述检验窗口是否为异常区域。
各个所述窗口的所述测序深度是按照公式
Figure PCTCN2017101965-appb-000009
确定的,其中,
Figure PCTCN2017101965-appb-000010
表示所述窗口的测序深度,W表示各个所述窗口中的唯一比对序列数,Rt表示各个所述窗口中所述唯一比对序列数的总和,N表示各个所述窗口的窗口总数。根据本发明的具体实施例,异常区域确定单元130还可以进一步包括校正组件,该校正组件与初始突变点确定组件相连,用于基于各个所述窗口的GC含量对各个所述窗口的测序深度进行校正处理。具体地,该校正组件包括GC含量确认模块,所述GC含量确认模块适于对各个所述窗口的GC含量进行统计,并按照预定步长进行区段划分,以便获得多个GC含量区段;中位值统计模块,所述中位数统计模块适于统计各个所述GC含量区段中窗口的唯一比对序列数的中位值;测序深度确认模块,所述测序深度确认模块适于基于公式
Figure PCTCN2017101965-appb-000011
确定经过校正的各个所述窗口的测序深度,其中,T表示经过校正的所述窗口的测序深度,
Figure PCTCN2017101965-appb-000012
表示所述窗口的测序深度,M表示步骤(3-2-b)中确定的所述中位值,W表示所述窗口中的唯一比对序列数。
优选地,所述预定步长为0.01。
根据本发明的实施例,选择这样的端点作为所述初始突破点,在该端点两侧相同数目的窗口中,测序深度存在显著差异性。
根据本发明的实施例,所述初始突破点是通过下列方式确定的:确定各个端点的p值,所述p值表示两侧测序数据数目的显著差异性;以及如果所述位点的p值小于终止p值,判断所述位点为突破点,优选所述终止p值为至多1.1×10-50
根据本发明的实施例,在所述各个端点两侧各取100个窗口。
根据本发明的实施例,所述窗口的长度均为100-200Kbp,优选150Kbp。
根据本发明的实施例,所述异常区域确定单元还可以进一步包括:窗口T值稳定值确定组件,所述窗口T值稳定值确定组件用于针对所述预定染色体的参考序列,在排除所述异常区域确定组件中所确定的区域后,针对剩余区域内的所有窗口的每一个,按照公式
Figure PCTCN2017101965-appb-000013
确定各窗口的T值稳定值,其中i表示窗口的编号,n表示第i号窗口之后连续的至少一个窗口数目,其中n为至少1的整数,优选至少10的整数,Tni表示第i号窗口的T值稳定值,差异显著的窗口确定组件,所述差异显著确定组件用于基于窗口T值稳定值确定组件得到的各窗口的T值稳定值,选择差异显著的窗口作为异常区域。
根据本发明的实施例,保守区域确定单元140与异常区域确定单元130相连,用于针对所述预定染色体,选择所述异常区域之外的至少一部分区域作为所述预定染色体的保守区域。
需要说明的是,本领域技术人员能够理解,在前面所描述的确定预定染色体保守区域的方法的特征和优点也适合于确定预定染色体保守区域的装置,为描述方便,不再详述。
四、一种确定样本基因组中是否存在染色体拷贝数变异的系统
根据本发明的第四方面,本发明提出了一种确定样本基因组中是否存在染色体拷贝数变异的系统,利用该系统能够有效的实施前述确定样本基因组中是否存在染色体拷贝数变异的方法,从而能够有效地确定单细胞染色体的拷贝数变异,尤其是非整倍性变异。
参考图4,根据本发明实施例的确定样本基因组中是否存在染色体拷贝数变异的系统包括:确定预定染色体保守区域的装置100,所述确定预定染色体保守区域的装置100如前所描述的,所述确定预定染色体保守区域的装置用于确定预定染色体的保守区域;确定特征值的装置200,所述确定特征值的装置200用于基于所述保守区域中所述窗口的测序深度,确定所述预定染色体的特征值;确定拷贝数变异的装置300,所述确定拷贝数变异的装置用于基于确定特征值的装置中得到的所述特征值,针对所述样本基因组,确定所述预定染色体是否存在拷贝数变异。
其中,根据本发明的实施例,所述确定特征值的装置200用于基于所述保守区域中所述窗口的测序深度,确定所述预定染色体的特征值,包括:确定染色体的平均深度单元,所述确定染色体的平均深度单元适于按照公式
Figure PCTCN2017101965-appb-000014
确定预定染色体的平均深度值,其中Rc表示c号染色体的平均深度值,c表示染色体的编号,j表示c号染色体上所述保守区域中所有窗口的总数,Tj表示经过校正的窗口的测序深度;确定染色体的所述特征值单元,所述确定染色体的所述特征值单元适于基于公式
Figure PCTCN2017101965-appb-000015
确定所述预定染色体的所述特征值,其中,Rc为所述预定染色体的平均深度值,
Figure PCTCN2017101965-appb-000016
表示所述样本中各染色体Rc值的平均值,sd表示所述样本中各染色体Rc值的标准偏差。
根据本发明的实施例,确定拷贝数变异的装置适于基于所述特征值与预定阈值的差异,针对所述样本基因组,确定所述检验窗口是否存在拷贝数变异。
根据本发明的实施例,所述预定阈值包括第一阈值和第二阈值,所述第二阈值大于所述第一阈值,其中,所述特征值大于所述第二阈值表示所述预定染色体存在染色体重复,所述特征值小于所述第一阈值表示所述预定染色体存在染色体缺失。
根据本发明的实施例,所述第一阈值和所述第二阈值是基于多个参考样本的Rc值波动范围确定的,其中,所述参考样本已知不存在所述拷贝数变异。
根据本发明的实施例,所述第一阈值不超过所述波动范围的下端值,所述第二阈值不低于所述波动范围的上端值。
根据本发明的实施例,所述第一阈值为至多0.7,所述第二阈值为至少1.3。
根据本发明的实施例,染色体拷贝数变异为选自染色体非整倍性、染色体片段缺失、染色体片段增加、微缺失、微重复的至少一种。优选的,利用根据本发明实施例的确定样本基因组中是否存在染色体拷贝数变异的系统适于实施前述确定样本基因组中是否存在染色体拷贝数变异的方法,对染色体非整倍性的检测效果更优。
需要说明的是,本领域技术人员能够理解,在前面所描述的确定样本基因组中是否存在染色体拷贝数变异的方法的特征和优点也适合于确定样本基因组中是否存在染色体拷贝数变异的系统,为描述方便,不再详述。
五、计算机可读介质
根据本发明的第五方面,本发明提出了一种计算机可读介质。根据本发明的实施例,该计算机可读介质上存储有指令,所述指令适于被处理器执行以便通过下列步骤确定样本基因组中是否存在拷贝数变异:确定预定染色体保守区域的装置,所述确定预定染色体保守区域的装置如前面所限定的,所述确定预定染色体保守区域的装置用于确定预定染色体的保守区域;确定特征值的装置,所述确定特征值的装置用于基于所述保守区域中所述窗口的测序深度,确定所述预定染色体的特征值;确定拷贝数变异的装置,所述确定拷贝数变异的装置用于基于确定特征值的装置中得到的所述特征值,针对所述样本基因组,确定所述预定染色体是否存在拷贝数变异。借助该计算机可读介质,能够有效地实施根据本发明实施例的确定样本基因组中是否存在拷贝数变异的方法,从而能够有效地确定样本基因组中是否存在拷贝数变异,尤其适用与染色体非整倍性。
需要说明的是,本领域技术人员能够理解,在前面所描述的确定样本基因组中是否存在拷贝数变异的方法的特征和优点也适合于该计算机可读介质,为描述方便,不再详述。
下面将结合实施例对本发明的方案进行解释。本领域技术人员将会理解,下面的实施例仅用于说明本发明,而不应视为限定本发明的范围。实施例中未注明具体技术或条件的,按照本领域内的文献所描述的技术或条件(例如参考J.萨姆布鲁克等著,黄培堂等译的《分子克隆实验指南》,第三版,科学出版社)或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者,均为可以通过市购获得的常规产品,例如可以采购自Illumina公司。
一般方法
参考图5,在实施例中采用的确定样本基因组中是否存在拷贝数变异的方法包括下列:
1)样本基因组测序
从样本中分离单细胞,样本可以选自血液、尿液、唾液、组织、生殖细胞、受精卵、卵裂球和胚胎等。之后裂解单细胞提取样本全基因组,裂解方法不受限制。之后对提取的基因组进行扩增,扩增方法可以采用基于PCR的方法例如PEP-PCR(primer-extension-preamplification PCR,PEP-PCR)、DOP-PCR、和OmniPlex WGA,也可采用非基于PCR的方法例如MDA(Multiple Displacement Amplification,MDA)等,在本专利的实施例中考虑到扩增时间和扩增均一性的影响,优先选用Rubicon Genomics的PicoPlex。在完成扩增之后可以根据二代测序平台操作手册进行文库构建,这里的二代测序 平台可以是Illumina公司的Hiseq平台,Miseq平台,Genome Analyzer(GA)平台,Roche公司的454FLX,Applied Biosystems公司的SOLiD平台,Life Technologies公司的Ion Torrent平台等,本实施例中采用华大基因的BGISEQ-500平台进行测序。本发明中测序平台并不限定在二代测序平台中,也可以是其他测序方法和装置,例如第三代测序技术以及未来更先进的测序装置。本专利通过测试发现在28bp的情况下,通过低深度全基因组测序,可实现单细胞非整倍体的有效检出。
2)下机数据质控
根据所用测序平台的数据特点对样本测序质量情况进行过滤。在本实施例中,根据BGISEQ-500测序特点,根据测序光强,测序序列数,标签拆分率,G碱基和C碱基含量占比对下机数据进行质控。
3)序列比对
将测序结果与参照基因组序列进行比对,本实施例中参照标准的人类基因组序列(hg19,NCBI Build 37),也可以限定在已知基因组的某几条染色体。将通过质控的下机序列与标准人类基因组序列进行比对,确定测序序列在基因组上的位置。比对可以利用现有比对软件进行。在本发明的实施例中用BWA(Burrows-Wheeler Aligner)比对软件进行。发明人发现此方法在测序序列长度为28碱基的情况下仍可成功检出非整倍体。本领域专业人员可以根据其他测序平台序列长度选择其他比对软件,根据比对结果进行参数调整。比对完成后根据比对结果文件挑选出与参照基因组序列唯一比对的序列,进行后续分析,并去掉序列中的重复序列,防止测序深度产生偏差,出现错误结果。
4)单个样本质控
根据比对结果计算样本唯一比对率,重复率,唯一比对read数目,GC含量等。根据数据分布特点,本实施例中选用百分位数法估计正常值范围(在数据不符合正态分布和偏态分布情况下)采用范围为5%和95%,采用估计软件为:SPSS Statistics 17.0。选用可表征测序特点的指标作为质控范围。本实施例中选用的为唯一比对率,重复率,GC含量和深度变异系数作为质控指标。
5)数据校正与标准化
首先将人类参考基因组划分为不同窗口,窗口可以是等长的也可是非等长的。根据样本序列比对结果对每个区间内的序列数进行统计。根据所有窗口内落入的序列数是否呈现正态分布确定区间长度的最佳方案。根据比对后落入窗口内的唯一比对序列数和该窗口的GC含量对窗口的深度值进行校正。首先用全基因组范围内窗口的平均序列数对各个窗口内的序列数进行标准化,得到标准化后的深度值,
Figure PCTCN2017101965-appb-000017
其中窗口的内唯一比对序列数为W,Rt为总的唯一比对序列数,N为总的窗口数。根据窗口内GC的含量差异,以0.01的步长对窗口类型进行分类。用W与相同间隔内序列数的中位值M的比值乘以平均
Figure PCTCN2017101965-appb-000018
得到每个窗口的矫正值
Figure PCTCN2017101965-appb-000019
6)样本内染色体片段变异检测
完成数据矫正后,根据Y染色体上唯一比对的序列数对单个样本的性别进行判断。逐个遍历样本中的窗口,选择窗口相邻的左右两端等量的窗口数进行游程检验,得到每个窗口对应的检测P值。对所有P值进行排序去掉非显著的窗口位置,得到初始断点集合B={b1,b2,b3,…,bn}。对相邻断点左右两端区间内的深度值进行二轮统计得到每个断点对应新的P值。在上述断点P值的基础上,对某一断点来说分别于左右两断点区间进行统计检验,并在循环中删除不显著断点。获得每个断点区间的P值和深度值的均值。根据断点P值显著性判断是否为真实断点,根据深度值的大小判断是缺失还是重复。根据断点区间大小判断 检测精度。
7)样本内整条染色体变异检测
对窗口内的矫正深度T进行区间矫正,为保证连续几个窗口内的T值的稳定性,采用公式
Figure PCTCN2017101965-appb-000020
n≥10,即利用i窗口后面的n个窗口的深度来矫正Ti。并且计算n个区域内的变异系数,过滤掉变异系数异常的区域,计算每条染色体的平均深度值
Figure PCTCN2017101965-appb-000021
(j为此染色体的窗口数,c为染色体编号)。计算样本内染色体深度的平均深度值
Figure PCTCN2017101965-appb-000022
和标准差sd,计算对应染色体的
Figure PCTCN2017101965-appb-000023
8)单个样本非整倍体判断
根据上述检测步骤对9例正常样本进行统计得到Rc值的正常波动范围为0.7~1.3,以此范围为正常样本的波动范围,其他样本中小于0.7的为染色体缺失,大于1.3的为染色体重复。根据2.2.6中统计检验获得的整条染色体的缺失重复信息进行片段融合,计算变异长度。根据2.2.7中整条染色体的矫正深度值判断整条染色体的变异情况。最终根据整条染色体的深度值,变异长度和变异系数来判断是否发生非整倍体变异。
实施例 对7例商业购买的细胞系进行染色体非整倍性变异检测
本发明使用商业购买的细胞系,共有7个实施例,具体信息见表1。
表1:实施例中七个样本
样本名 Array-CGH结果
Sample1 47,XY,+15(1.38)
Sample2 47,XX,+18(1.48)
Sample3 47,XX,+21(1.46)
Sample4 47,XX,+9(1.41)
Sample5 47,XX,+15(1.43)
Sample6 47,XY,+18(1.46)
Sample7 47,XY,+21(1.45)
(1)挑取细胞系单细胞
商业购买已知核型的细胞系(Coriell Institute for Medical Research),消化细胞,用显微操作仪(Eppendorf,NK2)分选细胞。向消化后的细胞中加入Hochest(LIFE TECHNOLOGIES,1660845)染色液,室温下避光染色15min。将染色后的细胞悬液平铺于已预先平铺了用PBS(LIFE TECHNOLOGIES,14190-144)稀释的1%BSA(NEB,B9001S)的载玻片表面,挑取1个明场下有明显细胞形貌及荧光场下符合条件的有核细胞至做好标记的 PCR管(AXYGEN,MCT-150-C)中,PCR管根据实验要求提前添加4μLPBS作为底液,挑选完成后,离心,准备进行单细胞扩增反应。
(2)单细胞全基因组扩增
采用本试剂盒中的试剂对胚胎细胞进行WGA,具体的扩增过程包括三个步骤。第一,细胞裂解:向已经收集到细胞的PCR管中加入由细胞裂解缓冲液和细胞裂解酶配置的混合液,在75℃下反应10min,95℃下反应4min,使细胞裂解,并释放出其中的DNA。第二,前扩增:向上步反应液中加入由前扩增缓冲液和前扩增酶配置的混合液,在95℃下反应2min,然后在95℃15s,15℃50s,25℃40s,35℃30s,65℃40s,75℃40s下反应12个循环。第三,后扩增:向前一步反应液中加入由后扩增缓冲液,后扩增酶和去核酸酶水配置的混合液,在95℃下反应2min,然后在95℃15s,65℃1min,75℃1min下循环14次,反应完成后的扩增产物可直接用于下游分析或置于-20℃冰箱保存。
(3)测序文库制备
采用本试剂盒中的试剂对细胞的WGA产物进行文库构建,具体的文库构建过程包括四个步骤:DNA打断、末端修复、接头连接、PCR扩增。第一,DNA打断:对WGA产物进行定量,取一定量的WGA产物,向其中加入由DNA打断酶和DNA打断缓冲液组成的混合液,在37℃下反应5min,75℃下反应15min,打断后纯化。第二,末端修复:对打断纯化后的产物进行定量,取一定量的纯化后的产物,向其中加入由末端修复缓冲液和末端修复酶配置的混合液,在37℃下反应30min,然后在75℃下反应15min。第三,接头连接:向上步反应液中加入由连接缓冲液和连接酶配置的混合液,然后向其中加入标签接头1-48(每个样本单独一个接头),在20℃下反应20min,使用磁珠纯化连接反应产物。第四,PCR扩增,向上步反应纯化后的DNA中,加入由PCR反应液和PCR引物配置的混合液,在98℃下反应2min,然后在98℃15s,56℃15s,72℃30s下循环12个cycles,在72℃下延伸5min,4℃保持;扩增完成后,使用磁珠纯化,并测定纯化后样本的浓度。
(4)DNA测序反应
基于二代高通量测序技术,在华大自主研发的BGISEQ-500测序平台进行上机测序。测序试剂采用本试剂盒中试剂,其中仪器的参数设置及操作方法都要严格按照操作手册进行。
虽然本发明中所用仪器为BGISEQ-500,本试剂盒中的测序循环数为SE28+10,但由于仪器以及建库、测序方法会不断升级,所以在实际应用中,本试剂盒的使用不限于这一种仪器,不限于这一种建库方法,不限于这一种测序循环数,适用于BGISEQ系列中的各种建库方法、测序平台以及测序方法。
(5)数据分析
a,序列比对
本实施例中用BWA软件(版本号:0.7.7-r441)将测得的样本序列比对到参考基因组(hg19,NCBI Build 37)。根据比对结果获得比对信息如表2从比对结果中挑出唯一比对的序列,去掉重复序列后用于下面的分析。根据比对产生的信息对样本进行质控。
表2:
样本名 总序列数 唯一比对率 有效序列数 重复率 GC含量 变异系数
Sample1 9,722,711 0.5539 4,795,452 0.1096 0.4112 0.2128
Sample2 18,484,072 0.5671 8,605,341 0.1791 0.3898 0.2230
Sample3 15,861,223 0.5832 7,530,114 0.1860 0.3807 0.2333
Sample4 12089842 0.5366 6136076 0.0541 0.4087 0.2987
Sample5 14012636 0.5082 6669696 0.0634 0.4115 0.3275
Sample6 15562858 0.5137 7459813 0.067 0.4112 0.2808
Sample7 23094668 0.5245 11116101 0.0824 0.4054 0.2687
b,单样本数据校正
七个样本的数据矫正后的深度值分布示意图如图6~12所示。与图6~12对应的每条染色体的矫正深度值见表3。
表3:
Figure PCTCN2017101965-appb-000024
c、单样本分割断点位置
七个样本内全基因组检验显著值拟合分布如图13~19所示。由图13~19可以看出,整条染色体显著断点拟合线均在1.3以上,可判断为非整倍体。
工业实用性
本发明的确定预定染色体保守区域的方法、确定预定染色体保守区域的装置、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质能够有效地用于确定样本基因组中是否存在拷贝数变异。
尽管本发明的具体实施方式已经得到详细的描述,本领域技术人员将会理解。根据已经公开的所有教导,可以对那些细节进行各种修改和替换,这些改变均在本发明的保护范围之内。本发明的全部范围由所附权利要求及其任何等同物给出。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。

Claims (59)

  1. 一种确定预定染色体保守区域的方法,其特征在于,包括:
    (1)对来自于单细胞的全基因组样本进行测序,以便获得由多个测序序列构成的测序结果;
    (2)将所述测序结果与参照基因组序列进行比对,以便确定所述测序序列在所述参照基因组序列上的分布;
    (3)基于所述测序序列在所述参照基因组序列上的分布,针对预定染色体,确定异常区域;
    (4)针对所述预定染色体,选择所述异常区域之外的至少一部分区域作为所述预定染色体的保守区域。
  2. 根据权利要求1所述的方法,其特征在于,步骤(3)进一步包括:
    (3-1)将所述参考基因组序列划分为多个窗口,并且分别统计各个所述窗口的测序深度;
    (3-2)基于所述多个窗口的所有端点两侧相同数目窗口的测序深度,选择初始突破点;
    (3-3)基于所述初始突破点,确定所述异常区域。
  3. 根据权利要求2所述的方法,其特征在于,各个所述窗口的所述测序深度是按照公式
    Figure PCTCN2017101965-appb-100001
    确定的,其中,
    Figure PCTCN2017101965-appb-100002
    表示所述窗口的测序深度,W表示各个所述窗口中的唯一比对序列数,Rt表示各个所述窗口中所述唯一比对序列数的总和,N表示各个所述窗口的窗口总数。
  4. 根据权利要求3所述的方法,其特征在于,在进行步骤(3-2)之前,预先基于各个所述窗口的GC含量对各个所述窗口的测序深度进行校正处理。
  5. 根据权利要求4所述的方法,其特征在于,所述校正处理包括:
    (3-2-a)对各个所述窗口的GC含量进行统计,并按照预定步长进行区段划分,以便获得多个GC含量区段;
    (3-2-b)统计各个所述GC含量区段中窗口的唯一比对序列数的中位值;
    (3-2-c)基于公式
    Figure PCTCN2017101965-appb-100003
    确定经过校正的各个所述窗口的测序深度,其中,T表示经过校正的所述窗口的测序深度,
    Figure PCTCN2017101965-appb-100004
    表示所述窗口的测序深度,M表示步骤(3-2-b)中确定的所述中位值,W表示所述窗口中的唯一比对序列数。
  6. 根据权利要求5所述的方法,其特征在于,在步骤(3-2-a)中,所述预定步长为0.01。
  7. 根据权利要求2所述的方法,其特征在于,在步骤(3-2)中,选择这样的端点作为所述初始突破点,在该端点两侧相同数目的窗口中,测序深度存在显著差异性。
  8. 根据权利要求7所述的方法,其特征在于,所述初始突破点是通过下列步骤确定的:
    确定各个端点的p值,所述p值表示两侧测序数据数目的显著差异性;以及
    如果所述位点的p值小于终止p值,判断所述位点为突破点,优选所述终止p值为至多1.1×10-50
  9. 根据权利要求7所述的方法,其特征在于,在所述各个端点两侧各取100个窗口。
  10. 根据权利要求2所述的方法,其特征在于,所述窗口的长度均为100-200Kbp,优选150Kbp。
  11. 根据权利要求1所述的方法,其特征在于,步骤(3-3)包括:
    (3-3-a)基于所述初始突破点,确定多个检验窗口;以及
    (3-3-b)基于所述检验窗口的平均测序深度与预定阈值的差异,确定所述检验窗口是否为异常区域。
  12. 根据权利要求2所述的方法,其特征在于,在步骤(3-3)中,通过下列步骤确定所述异常区域:
    (3-3-1)确定多个候选突破点,其中在所述候选突破点的前后均存在其他突破点;
    (3-3-2)确定每个候选突破点的p值,并剔除p值最大的候选突破点;
    (3-3-3)对剩余的候选突破点重复步骤2),直到剩余候选突破点的p值均小于终止p值,所述剩余候选突破点作为经过筛选的候选突破点;
    (3-3-4)确定相邻两个经过筛选的候选突破点之间的区域为检验窗口;
    (3-3-5)基于所述检验窗口的平均测序深度与预定阈值的差异,确定所述检验窗口是否为异常区域。
  13. 根据权利要求5所述的方法,其特征在于,进一步包括:
    (3-4)针对所述预定染色体的参考序列,在排除步骤(3-3)中所确定的区域后,针对剩余区域内的所有窗口的每一个,按照公式
    Figure PCTCN2017101965-appb-100005
    确定各窗口的T值稳定值,
    其中i表示窗口的编号,
    n表示第i号窗口之后连续的至少一个窗口数目,其中n为至少1的整数,优选至少10的整数,
    Tni表示第i号窗口的T值稳定值
    (3-5)基于步骤(3-4)中得到的各窗口的T值稳定值,选择差异显著的窗口作为异常区域。
  14. 根据权利要求1所述的方法,其特征在于,对所述全基因组样本进行测序进一步包括:
    对所述全基因组样本进行扩增;
    利用经过扩增的基因组样本构建测序文库;以及
    对所述测序文库进行测序。
  15. 根据权利要求1所述的方法,其特征在于,进一步包括对所述单细胞进行裂解,以便释放所述单细胞的全基因组的步骤。
  16. 根据权利要求15所述的方法,其特征在于,
    利用碱性裂解液对所述单细胞进行裂解,以便释放所述单细胞的全基因组。
  17. 根据权利要求14所述的方法,其特征在于,利用基于PCR的全基因组扩增方法或非基于PCR的方法对所述全基因组进行扩增。
  18. 根据权利要求8所述的方法,其特征在于,
    所述基于PCR的全基因组扩增方法为PEP-PCR、DOP-PCR或OmniPlex WGA方法;或
    所述非基于PCR的方法为MDA。
  19. 根据权利要求14所述的方法,其特征在于,
    利用选自Hiseq系统、Miseq系统、Genome Analyzer(GA)系统、454 FLX、SOLiD系统、Ion Torrent系统和单分子测序装置的至少一种对所述测序文库进行测序。
  20. 根据权利要求1所述的方法,其特征在于,所述单细胞来源于血液、尿液、唾液、组织、生殖细胞、受精卵、卵裂球或胚胎。
  21. 一种确定样本基因组中是否存在染色体拷贝数变异的方法,其特征在于,包括:
    (a)利用权利要求1~20任一项所述的方法,确定预定染色体的保守区域;
    (b)基于所述保守区域中所述窗口的测序深度,确定所述预定染色体的特征值;
    (c)基于步骤(b)中所得到的所述特征值,针对所述样本基因组,确定所述预定染色体是否存在拷贝数变异。
  22. 根据权利要求21所述的方法,其特征在于,步骤(b)进一步包括:
    (b-1)按照公式
    Figure PCTCN2017101965-appb-100006
    确定预定染色体的平均深度值,其中Rc表示c号染色体的平均深度值,c表示染色体的编号,j表示c号染色体上所述保守区域中所有窗口的总数,Tj表示经过校正的窗口的测序深度;
    (b-2)基于公式
    Figure PCTCN2017101965-appb-100007
    确定所述预定染色体的所述特征值,其中,Rc为所述预定染色体的平均深度值,
    Figure PCTCN2017101965-appb-100008
    表示所述样本中各染色体Rc值的平均值,sd表示所述样本中各染色体Rc值的标准偏差。
  23. 根据权利要求21所述的方法,其特征在于,在步骤(c)中,基于所述特征值与预定阈值的差异,针对所述样本基因组,确定所述检验窗口是否存在拷贝数变异。
  24. 根据权利要求22所述的方法,其特征在于,所述预定阈值包括第一阈值和第二阈值,所述第二阈值大于所述第一阈值,其中,所述特征值大于所述第二阈值表示所述预定染色体存在染色体重复,所述特征值小于所述第一阈值表示所述预定染色体存在染色体缺失。
  25. 根据权利要求23所述的方法,其特征在于,所述第一阈值和所述第二阈值是基于多个参考样本的Rc值波动范围确定的,其中,所述参考样本已知不存在所述拷贝数变异。
  26. 根据权利要求23所述的方法,其特征在于,所述第一阈值不超过所述波动范围的下端值,所述第二阈值不低于所述波动范围的上端值。
  27. 根据权利要求23所述的方法,其特征在于,所述第一阈值为至多0.7,所述第二阈值为至少1.3。
  28. 根据权利要求21所述的方法,其特征在于,染色体拷贝数变异为选自染色体非整倍性、染色体片段缺失、染色体片段增加、微缺失、微重复的至少一种,
    优选地,所述染色体拷贝数变异为染色体非整倍性。
  29. 一种确定预定染色体保守区域的装置,其特征在于,包括:
    测序单元,所述测序单元用于对来自于单细胞的全基因组样本进行测序,以便获得由多个测序序列构成的测序结果;
    比对单元,所述比对单元用于将所述测序结果与参照基因组序列进行比对,以便确定所述测序序列在所述参照基因组序列上的分布;
    异常区域确定单元,所述异常区域确定单元用于基于所述测序序列在所述参照基因组序列上的分布,针对预定染色体,确定异常区域;
    保守区域确定单元,所述保守区域确定单元用于针对所述预定染色体,选择所述异常区域之外的至少一部分区域作为所述预定染色体的保守区域。
  30. 根据权利要求28所述的装置,其特征在于,所述异常区域确定单元包括:
    窗口划分组件,所述窗口划分组件用于将所述参考基因组序列划分为多个窗口,并且分别统计各个所述窗口的测序深度;
    初始突变点确定组件,所述确定初始突变点组件用于基于所述多个窗口的所有端点两侧相同数目窗口的测序深度,选择初始突破点;
    异常区域确定组件,所述异常区域确定组件用于基于所述初始突破点,确定所述异常区域。
  31. 根据权利要求29所述的装置,其特征在于,各个所述窗口的所述测序深度是按照 公式
    Figure PCTCN2017101965-appb-100009
    确定的,其中,
    Figure PCTCN2017101965-appb-100010
    表示所述窗口的测序深度,W表示各个所述窗口中的唯一比对序列数,Rt表示各个所述窗口中所述唯一比对序列数的总和,N表示各个所述窗口的窗口总数。
  32. 根据权利要求30所述的装置,其特征在于,进一步包括校正组件,所述校正组件用于基于各个所述窗口的GC含量对各个所述窗口的测序深度进行校正处理。
  33. 根据权利要求31所述的装置,其特征在于,所述校正组件包括:
    GC含量确认模块,所述GC含量确认模块用于对各个所述窗口的GC含量进行统计,并按照预定步长进行区段划分,以便获得多个GC含量区段;
    中位值统计模块,所述中位数统计模块用于统计各个所述GC含量区段中窗口的唯一比对序列数的中位值;
    测序深度确认模块,所述测序深度确认模块用于基于公式
    Figure PCTCN2017101965-appb-100011
    确定经过校正的各个所述窗口的测序深度,其中,T表示经过校正的所述窗口的测序深度,
    Figure PCTCN2017101965-appb-100012
    表示所述窗口的测序深度,M表示步骤(3-2-b)中确定的所述中位值,W表示所述窗口中的唯一比对序列数。
  34. 根据权利要求32所述的装置,其特征在于,所述预定步长为0.01。
  35. 根据权利要求29所述的装置,其特征在于,选择这样的端点作为所述初始突破点,在该端点两侧相同数目的窗口中,测序深度存在显著差异性。
  36. 根据权利要求34所述的装置,其特征在于,所述初始突破点是通过下列方式确定的:
    确定各个端点的p值,所述p值表示两侧测序数据数目的显著差异性;以及
    如果所述位点的p值小于终止p值,判断所述位点为突破点,优选所述终止p值为至多1.1×10-50
  37. 根据权利要求35所述的装置,其特征在于,在所述各个端点两侧各取100个窗口。
  38. 根据权利要求29所述的装置,其特征在于,所述窗口的长度均为100-200Kbp,优选150Kbp。
  39. 根据权利要求29所述的装置,其特征在于,所述异常区域确定组件包括:
    检验窗口确定模块,所述检验窗口确定模块用于基于所述初始突破点,确定多个检验窗口;以及
    差异比对模块,所述差异比对模块用于基于所述检验窗口的平均测序深度与预定阈值的差异,确定所述检验窗口是否为异常区域。
  40. 根据权利要求29所述的装置,其特征在于,通过下列方式确定所述异常区域:
    确定多个候选突破点,其中在所述候选突破点的前后均存在其他突破点;
    确定每个候选突破点的p值,并剔除p值最大的候选突破点;
    对剩余的候选突破点重复确定每个候选突破点的p值,并剔除p值最大的候选突破点,直到剩余候选突破点的p值均小于终止p值,所述剩余候选突破点作为经过筛选的候选突破点;
    确定相邻两个经过筛选的候选突破点之间的区域为检验窗口;
    基于所述检验窗口的平均测序深度与预定阈值的差异,确定所述检验窗口是否为异常区域。
  41. 根据权利要求32所述的装置,其特征在于,所述异常区域确定单元进一步包括:
    窗口T值稳定值确定组件,所述窗口T值稳定值确定组件用于针对所述预定染色体的参考序列,在排除所述异常区域确定组件中所确定的区域后,针对剩余区域内的所有窗口 的每一个,按照公式
    Figure PCTCN2017101965-appb-100013
    确定各窗口的T值稳定值,
    其中i表示窗口的编号,
    n表示第i号窗口之后连续的至少一个窗口数目,其中n为至少1的整数,优选至少10的整数,
    Tni表示第i号窗口的T值稳定值,
    差异显著的窗口确定组件,所述差异显著确定组件用于基于窗口T值稳定值确定组件得到的各窗口的T值稳定值,选择差异显著的窗口作为异常区域。
  42. 根据权利要求28所述的装置,其特征在于,所述测序单元为选自Hiseq系统、Miseq系统、Genome Analyzer(GA)系统、454 FLX、SOLiD系统、Ion Torrent系统和单分子测序装置的至少一种。
  43. 一种确定样本基因组中是否存在染色体拷贝数变异的系统,其特征在于,包括:
    确定预定染色体保守区域的装置,所述确定预定染色体保守区域的装置如权利要求28~41任一项所限定的,所述确定预定染色体保守区域的装置用于确定预定染色体的保守区域;
    确定特征值的装置,所述确定特征值的装置用于基于所述保守区域中所述窗口的测序深度,确定所述预定染色体的特征值;
    确定拷贝数变异的装置,所述确定拷贝数变异的装置用于基于确定特征值的装置中得到的所述特征值,针对所述样本基因组,确定所述预定染色体是否存在拷贝数变异。
  44. 根据权利要求42所述的系统,其特征在于,确定特征值的装置包括:
    确定染色体的平均深度单元,所述确定染色体的平均深度单元适于值按照公式
    Figure PCTCN2017101965-appb-100014
    确定预定染色体的平均深度值,其中Rc表示c号染色体的平均深度值,c表示染色体的编号,j表示c号染色体上所述保守区域中所有窗口的总数,Tj表示经过校正的窗口的测序深度;
    确定染色体的所述特征值单元,所述确定染色体的所述特征值单元适于基于公式
    Figure PCTCN2017101965-appb-100015
    确定所述预定染色体的所述特征值,其中,Rc为所述预定染色体的平均深度值,
    Figure PCTCN2017101965-appb-100016
    表示所述样本中各染色体Rc值的平均值,sd表示所述样本中各染色体Rc值的标准偏差。
  45. 根据权利要求43所述的系统,其特征在于,确定拷贝数变异的装置适于基于所述特征值与预定阈值的差异,针对所述样本基因组,确定所述检验窗口是否存在拷贝数变异。
  46. 根据权利要求44所述的系统,其特征在于,所述预定阈值包括第一阈值和第二阈值,所述第二阈值大于所述第一阈值,其中,所述特征值大于所述第二阈值表示所述预定染色体存在染色体重复,所述特征值小于所述第一阈值表示所述预定染色体存在染色体缺失。
  47. 根据权利要求45所述的系统,其特征在于,所述第一阈值和所述第二阈值是基于多个参考样本的Rc值波动范围确定的,其中,所述参考样本已知不存在所述拷贝数变异。
  48. 根据权利要求45所述的系统,其特征在于,所述第一阈值不超过所述波动范围的下端值,所述第二阈值不低于所述波动范围的上端值。
  49. 根据权利要求45所述的系统,其特征在于,所述第一阈值为至多0.7,所述第二阈值为至少1.3。
  50. 根据权利要求42所述的系统,其特征在于,染色体拷贝数变异为选自染色体非整倍性、染色体片段缺失、染色体片段增加、微缺失、微重复的至少一种。
  51. 一种计算机可读介质,其特征在于,所述计算机可读介质中存储有指令,所述指令被适于处理执行以便通过下列步骤确定样本基因组中是否存在染色体拷贝数变异,
    (a)利用权利要求1~20任一项所限定的确定预定染色体保守区域的步骤,确定预定染色体的保守区域;
    (b)基于所述保守区域中所述窗口的测序深度,确定所述预定染色体的特征值;
    (c)基于步骤(b)中所得到的所述特征值,针对所述样本基因组,确定所述预定染色体是否存在拷贝数变异。
  52. 根据权利要求50所述的计算机可读介质,其特征在于,步骤(b)进一步包括:
    (b-1)按照公式
    Figure PCTCN2017101965-appb-100017
    确定预定染色体的平均深度值,其中Rc表示c号染色体的平均深度值,c表示染色体的编号,j表示c号染色体上所述保守区域中所有窗口的总数,Tj表示经过校正的窗口的测序深度;
    (b-2)基于公式
    Figure PCTCN2017101965-appb-100018
    确定所述预定染色体的所述特征值,其中,Rc为所述预定染色体的平均深度值,
    Figure PCTCN2017101965-appb-100019
    表示所述样本中各染色体Rc值的平均值,sd表示所述样本中各染色体Rc值的标准偏差。
  53. 根据权利要求50所述的计算机可读介质,其特征在于,在步骤(c)中,基于所述特征值与预定阈值的差异,针对所述样本基因组,确定所述检验窗口是否存在拷贝数变异。
  54. 根据权利要求52所述的计算机可读介质,其特征在于,所述预定阈值包括第一阈值和第二阈值,所述第二阈值大于所述第一阈值,其中,所述特征值大于所述第二阈值表示所述预定染色体存在染色体重复,所述特征值小于所述第一阈值表示所述预定染色体存在染色体缺失。
  55. 根据权利要求53所述的计算机可读介质,其特征在于,所述第一阈值和所述第二阈值是基于多个参考样本的Rc值波动范围确定的,其中,所述参考样本已知不存在所述拷贝数变异。
  56. 根据权利要求53所述的的计算机可读介质,其特征在于,所述第一阈值不超过所述波动范围的下端值,所述第二阈值不低于所述波动范围的上端值。
  57. 根据权利要求53所述的的计算机可读介质,其特征在于,所述第一阈值为至多0.7,所述第二阈值为至少1.3。
  58. 根据权利要求50所述的计算机可读介质,其特征在于,染色体拷贝数变异为选自染色体非整倍性、染色体片段缺失、染色体片段增加、微缺失、微重复的至少一种。
  59. 根据权利要求50所述的计算机可读介质,其特征在于,染色体拷贝数变异为染色体非整倍性。
PCT/CN2017/101965 2017-09-15 2017-09-15 确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质 WO2019051812A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201780094527.7A CN111052249B (zh) 2017-09-15 2017-09-15 确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质
PCT/CN2017/101965 WO2019051812A1 (zh) 2017-09-15 2017-09-15 确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/101965 WO2019051812A1 (zh) 2017-09-15 2017-09-15 确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质

Publications (1)

Publication Number Publication Date
WO2019051812A1 true WO2019051812A1 (zh) 2019-03-21

Family

ID=65722245

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/101965 WO2019051812A1 (zh) 2017-09-15 2017-09-15 确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质

Country Status (2)

Country Link
CN (1) CN111052249B (zh)
WO (1) WO2019051812A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021077411A1 (zh) * 2019-10-25 2021-04-29 苏州宏元生物科技有限公司 染色体不稳定性检测方法、系统及试剂盒
WO2021232388A1 (zh) * 2020-05-22 2021-11-25 深圳华大智造科技有限公司 确定胚胎细胞染色体中预定位点碱基类型的方法及其应用

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112349346A (zh) * 2020-10-27 2021-02-09 广州燃石医学检验所有限公司 检测基因组区域中的结构变异的方法
CN113113085B (zh) * 2021-03-15 2022-08-19 杭州杰毅生物技术有限公司 基于智能宏基因组测序数据肿瘤检测的分析系统及方法
CN114582427B (zh) * 2022-03-22 2023-04-07 成都基因汇科技有限公司 一种渐渗区段鉴定方法及计算机可读存储介质
CN117116344A (zh) * 2023-10-25 2023-11-24 北京大学第三医院(北京大学第三临床医学院) 一种单细胞水平pmp22重复变异的检测系统和方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1654475A (zh) * 2003-04-17 2005-08-17 南方医院 重组幽门螺杆菌粘附素保守区的制备及用途
CN1836042A (zh) * 2003-06-12 2006-09-20 原子核物理公司 用于基因沉默的hbv和hcv保守序列
CN102199597A (zh) * 2011-03-09 2011-09-28 中国海洋大学 一种双壳贝类线粒体coi基因扩增引物的筛选方法
CN102206645A (zh) * 2011-04-29 2011-10-05 中国科学院武汉病毒研究所 利用慢病毒载体介导RNAi的方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0811500D0 (en) * 2008-06-20 2008-07-30 Univ Cardiff Method of determining DNA copy number
CA2809055A1 (en) * 2010-08-24 2012-03-01 Biodx, Inc. Defining diagnostic and therapeutic targets of conserved free floating fetal dna in maternal circulating blood
JP6521956B2 (ja) * 2013-06-17 2019-05-29 ベリナタ ヘルス インコーポレイテッド 性染色体におけるコピー数変異を判定するための方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1654475A (zh) * 2003-04-17 2005-08-17 南方医院 重组幽门螺杆菌粘附素保守区的制备及用途
CN1836042A (zh) * 2003-06-12 2006-09-20 原子核物理公司 用于基因沉默的hbv和hcv保守序列
CN102199597A (zh) * 2011-03-09 2011-09-28 中国海洋大学 一种双壳贝类线粒体coi基因扩增引物的筛选方法
CN102206645A (zh) * 2011-04-29 2011-10-05 中国科学院武汉病毒研究所 利用慢病毒载体介导RNAi的方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021077411A1 (zh) * 2019-10-25 2021-04-29 苏州宏元生物科技有限公司 染色体不稳定性检测方法、系统及试剂盒
WO2021232388A1 (zh) * 2020-05-22 2021-11-25 深圳华大智造科技有限公司 确定胚胎细胞染色体中预定位点碱基类型的方法及其应用

Also Published As

Publication number Publication date
CN111052249A (zh) 2020-04-21
CN111052249B (zh) 2024-04-05

Similar Documents

Publication Publication Date Title
AU2021202149B2 (en) Detecting repeat expansions with short read sequencing data
US11453913B2 (en) Safe sequencing system
US10767228B2 (en) Fetal chromosomal aneuploidy diagnosis
WO2019051812A1 (zh) 确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质
US10947595B2 (en) Nucleic acids and methods for detecting chromosomal abnormalities
CN105574361B (zh) 一种检测基因组拷贝数变异的方法
US10216895B2 (en) Rare variant calls in ultra-deep sequencing
WO2017084624A1 (zh) 一种同时完成基因位点、染色体及连锁分析的方法
JP2020108377A (ja) 非侵襲的に胎児の性染色体異数性のリスクを計算する方法
US20150267255A1 (en) Method of detecting chromosomal abnormalities
US20200255896A1 (en) Method for non-invasive prenatal screening for aneuploidy
US20230005568A1 (en) Method of correcting amplification bias in amplicon sequencing
CN106939334B (zh) 一种孕妇血浆中胎儿dna含量的检测方法
WO2014075228A1 (zh) 确定生物样本中染色体数目异常的方法、系统和计算机可读介质
AU2020286376A1 (en) Limit of detection based quality control metric
WO2017132909A1 (zh) 从血液样本中分离靶细胞的方法及其用途
Vendrell et al. New protocol based on massive parallel sequencing for aneuploidy screening of preimplantation human embryos
TWI564742B (zh) Methods for determining the aneuploidy of fetal chromosomes, systems and computer-readable media

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17925143

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17925143

Country of ref document: EP

Kind code of ref document: A1