WO2014183270A1 - 一种检测染色体结构异常的方法及装置 - Google Patents

一种检测染色体结构异常的方法及装置 Download PDF

Info

Publication number
WO2014183270A1
WO2014183270A1 PCT/CN2013/075622 CN2013075622W WO2014183270A1 WO 2014183270 A1 WO2014183270 A1 WO 2014183270A1 CN 2013075622 W CN2013075622 W CN 2013075622W WO 2014183270 A1 WO2014183270 A1 WO 2014183270A1
Authority
WO
WIPO (PCT)
Prior art keywords
read length
read
cluster
pair
clusters
Prior art date
Application number
PCT/CN2013/075622
Other languages
English (en)
French (fr)
Inventor
杨传春
Original Assignee
深圳华大基因科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=51897591&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=WO2014183270(A1) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by 深圳华大基因科技有限公司 filed Critical 深圳华大基因科技有限公司
Priority to PL13884613.4T priority Critical patent/PL2998407T5/pl
Priority to US14/890,989 priority patent/US11004538B2/en
Priority to EP13884613.4A priority patent/EP2998407B2/en
Priority to HUE13884613A priority patent/HUE047501T2/hu
Priority to ES13884613T priority patent/ES2766860T5/es
Priority to CN201380004734.0A priority patent/CN104302781B/zh
Priority to RU2015153453A priority patent/RU2654575C2/ru
Priority to PCT/CN2013/075622 priority patent/WO2014183270A1/zh
Publication of WO2014183270A1 publication Critical patent/WO2014183270A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the invention relates to the technical field of genomics and bioinformatics, in particular to a method and a device for detecting abnormalities of chromosome structure.
  • Karyotype analysis For example, G-band karyotype analysis, because the distribution of 400 to 600 BAND is used to judge the abnormality of chromosome structure, it is usually only possible to detect abnormalities at the chromosome level. In the best case, deletions and duplications above 5 Mbp can be detected. For the detection of smaller fragments ( ⁇ 5M), there is nothing to do. Moreover, this method requires the cultivation of living cells, requiring the cells to remain active.
  • Fluorescence in situ hybridization FISH, fluorescence in situ Hybridization method: deletions, repeats, and balanced translocations of smaller fragments can be detected, but the detected chromosome fragments need to be predetermined to prepare the corresponding probes, and thus are limited by probe design. Because FISH cannot detect unknown areas, it is often used to verify test results.
  • Microarray method including two probe methods, one based on single nucleotide polymorphism (SNP, single) Nucleotide polymorphisms), a CNV-based design, have similar limitations to FISH.
  • SNP single nucleotide polymorphism
  • CNV-based design a CNV-based design
  • a method for detecting an abnormality in a chromosome structure includes the steps of: obtaining a whole genome sequencing result of a target individual, including a plurality of pairs of read length pairs, each pair of read length pairs consisting of two read length sequences, Located at both ends of the measured chromosome fragment, each pair of read pairs are from the positive and negative strands of the corresponding chromosome fragment, or each pair of read pairs are from the positive or negative strand of the corresponding chromosome fragment;
  • the reference sequence is compared to obtain an abnormal matching set, and the abnormal matching set includes a first type of read length pair that meets the following description, and the two read length sequences of the first type of read long pair are respectively matched to different chromosomes of the reference sequence;
  • the matched position clusters the read length sequences in the abnormal matching set into clusters, each cluster contains a single-ended read length sequence from a set of read long pairs, and the corresponding read length sequence of the other end is located in another cluster;
  • an apparatus for detecting an abnormality of a chromosome structure includes: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data including executable And a processor, coupled to the data input unit, the data output unit, and the storage unit for executing an executable program stored in the storage unit, the executing of the program comprising performing the foregoing method for detecting an abnormality of the chromosome structure.
  • a computer readable storage medium for storing a program for execution by a computer is provided, and those skilled in the art can understand that when the program is executed, the above-mentioned detection of chromosome structure abnormality can be completed by instructing related hardware. All or part of the steps of the method.
  • the storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
  • the method according to the present invention obtains a pair of read lengths matched to different chromosomes by alignment of the whole genome sequencing result with the reference sequence, thereby enabling screening of chromosomal translocation structural abnormalities, and further improving the obtained results by clustering and filtering. Sex and reliability make it possible to obtain analytically meaningful results.
  • FIG. 1 is a schematic diagram of a pair of Reads obtained by double-end sequencing according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a first type of Reads of anomaly matching according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a second type of Reads of anomaly matching according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a third type of Reads for abnormal matching according to an embodiment of the present invention.
  • Figure 5 is a schematic illustration of a pair of clusters located on different chromosomes in accordance with an embodiment of the present invention
  • FIG. 6 is a schematic diagram of RPK of "FA” in Experimental Example 1 according to an embodiment of the present invention.
  • Fig. 7 is a schematic diagram showing the RPK of "SON" in Experimental Example 1 according to an embodiment of the present invention.
  • a method for detecting an abnormality in a chromosome structure comprising the steps of:
  • Step1 Obtain the whole genome sequencing results of the target individual.
  • the sequencing results include paired read length pairs (also called “reads”) Reads, each pair of Reads consisting of two read length sequences located at the ends of the measured chromosome fragments, each pair of Reads from the positive strand of the corresponding chromosome fragment And the negative strand, or, each pair of Reads comes from both the positive or negative strand of the corresponding chromosome fragment.
  • paired read length pairs also called "reads” Reads
  • the measured chromosome fragment is usually obtained by interrupting the chromosome sample from the target individual, and the corresponding library preparation is performed according to the selected sequencing method.
  • the optional sequencing method is based on the sequencing platform from but not limited to CG. (Complete Genomics), Illumina/ Solexa, ABI/SOLiD and Roche 454, preparation of a single-ended or double-ended sequencing library based on the selected sequencing platform.
  • double-end sequencing can be performed, and the two read length sequences Read1 and Read2 in each pair of Reads obtained are respectively derived from the positive-strand Sp and the negative-strand Sm of the corresponding chromosome segment, as shown in FIG.
  • the length L-r1 and the length L-r2 of Read2 may be the same or different.
  • a single-read sequencing method is used to completely obtain the sequence of the entire chromosome segment, it is also feasible to intercept a sequence of an appropriate length from both ends of the completely obtained sequence to form a pair of Reads.
  • the two read length sequences Read in each pair of Reads are simultaneously from the positive or negative strand of the corresponding chromosome fragment. This embodiment does not limit the specific sequencing method selected.
  • the size of the library used for sequencing is referred to as L-lib
  • a library in which L-lib is 100 to 1000 bp is generally referred to as a small fragment library
  • L-lib is 2K, 5K-6K, 10K, 20K
  • a 40 Kbp library is called a large fragment library.
  • the present invention does not require the size of L-lib, but in general, a library of a larger length is advantageous for obtaining an effective result under the premise of ensuring the quality of library construction. Therefore, L-lib ⁇ 300 bp is preferred. Large fragments, such as 5 Kbp libraries, or small fragments, such as a 500 bp library, can generally be used.
  • the sequencing depth of the large fragment library can be selected to be greater than 2 times, and the sequencing depth of the small fragment library can be selected to be greater than 5 times.
  • the sequencing depth of the large fragment library is preferably 2 Multiplication, the sequencing depth of the small fragment library is preferably 5 times.
  • L-r1 and L-r2 are preferably 25 bp or more, because if it is less than 25 bp, the unique aligning ratio is lowered, and the complexity of subsequent obtaining alignment results is increased. L-r1 and L-r2 also need not be too large to avoid wasting data, so it is preferably 50 bp. L-r1 and L-r2 have no maximum value and can be changed according to the development of sequencing technology. For example, according to current sequencing technology, L-r1 and L-r2 generally do not exceed 150 bp.
  • Step2. Align the sequencing results with the reference sequence.
  • the reference sequence used is a known sequence and may be any reference template in the biological category to which the target individual belongs in advance.
  • the reference sequence can be selected from the National Center for Biotechnology Information (NCBI, national). Center for biotechnology Information) provided by HG19.
  • NCBI National Center for Biotechnology Information
  • a resource library containing more reference sequences may be pre-configured, and a closer reference sequence may be selected according to factors such as gender, race, and region of the target individual before the sequence comparison, to help obtain more accurate. Test results.
  • a pair of Reads is allowed to have at most n base mismatches, n is preferably 1 or 2.
  • Normal match set *.pair This includes Reads that conform to the description below, that is, the two read length sequences Read1 and Read2 in Reads match the same chromosome of the reference sequence, and the positive and negative chains of the matched position The relationship is consistent with the positive and negative chain relationship in Reads, and the deviation of the length L-pr and L-lib of the chromosome segment calculated from the matched position is less than the preset threshold V-lib.
  • V-lib is preferably 5% x L-lib ⁇ 15% x L-lib, further preferably 10% x L-lib.
  • the above thresholds are set empirically according to the standard deviation of the library size.
  • the standard deviation of small fragment libraries is about 15 bp, and the standard deviation of large fragment libraries is about 50 bp. It can be considered that the deviation of L-pr and L-lib is within the range of 3 standard deviations, for example, for 500 bp.
  • the library can be considered to have a suitable range of 455 bp to 545 bp for L-pr.
  • the number of Reads can be obtained according to the number of positions matched. For example, the number of Reads included in the unit length can be counted.
  • the unit length can be set according to L-lib. For example, it can be set to 1.5 to 4 times L. -lib. If L-lib is 500 bp, the unit length can be set to 1 Kbp, and the RPU can be recorded as RPK.
  • V-rm is 10 to 30%, and more preferably 20%.
  • the average value of the RPU can be obtained by statistics or by estimation.
  • the average value of the RPU can be estimated by the following method: sequencing depth ⁇ (unit length / L-lib). If you do not need to use RPU, you do not need to get *.pair.
  • the two read length sequences in the first type of Reads are matched to different chromosomes of the reference sequence, respectively; such Reads are associated with translocation structural anomalies, such as balanced translocations and unbalanced translocations.
  • a case of balanced translocation is shown.
  • Read1 in a pair of Reads matches the chromosome chra
  • Read2 matches the chromosome chrb
  • the connection between Read1 and Read2 is shown.
  • the dotted line indicates their positional relationship in the chromosome segment (the same below), and pa and pb respectively indicate the position of the possible breakpoint.
  • breakpoint refers to the boundary point where the chromosome is structurally abnormal.
  • the two read length sequences in the second type of Reads match the same chromosome of the reference sequence, but L-pr is negative; such Reads are associated with repetitive structural anomalies in tandem.
  • both Read1 and Read2 in a pair of Reads match the chromosome chra, but the head-to-tail position relationship of the matched position is opposite to the head-to-tail position relationship of Read1 and Read2 in the chromosome segment, respectively, pa1 and pa2 indicate possible existence.
  • the starting and ending position of the repeated segment, L-sv indicates the length of the repeated segment, and the dotted line in the middle of the chra in the figure indicates the length of the omission (the same below).
  • the two read length sequences in the third type of Reads match the same chromosome of the reference sequence, but L-pr is greater than L-lib and the deviation exceeds the preset threshold V-lib; such Reads are associated with missing structural anomalies.
  • both Read1 and Read2 in a pair of Reads match the chromosome chra, and the head-to-tail position relationship of the matched position is the same as the head-to-tail position relationship of Read1 and Read2 in the chromosome segment, but the distance exceeds the suitable range
  • pa1 And pa2 respectively indicate the start and end positions of the missing fragments that may exist
  • L-sv indicates the length of the missing fragments.
  • the exception matching set is not limited to including the above three types of Reads, as long as it does not belong to the normal matching set, but can match the read sequence of Reads or Reads in the reference sequence, can be counted into the abnormal matching set.
  • One of ordinary skill in the art can associate different types of abnormally matched expressions with corresponding chromosomal structural anomalies that may occur.
  • the case of distinguishing positive or negative chain matching or mismatch may not be considered in the abnormal matching set.
  • Unable to match set *.unmap This includes Read that cannot be matched to the reference sequence. These Reads can be paired (both cannot match) or single-ended (the other Read can match).
  • the single-ended Read that exists in *.unmap can be used to further breakpoint assembly after obtaining the result cluster to obtain a more accurate breakpoint range. If you do not need to breakpoint assembly, you don't have to get *.unmap.
  • Step3 Cluster the read length sequences in *.sin into clusters according to the matched positions.
  • the clustering can adopt various clustering algorithms, which is not limited in this embodiment.
  • a simple method is to divide the cluster according to the set minimum inter-cluster distance V-cl, that is, to search the read-length sequence Read sorted by position, starting from the first Read, if the second Read is If the distance between them is less than V-cl, they are divided into the same cluster, and the search is continued from the second Read until the distance between the nth read and the n-1th Read is greater than V-cl.
  • the n pieces of Read start to be divided into the second cluster, and the foregoing process is executed cyclically until all Reads are traversed.
  • clustering it is not necessary to consider the case of positive and negative chains separately, and clustering according to the position of Read matching on the chromosome.
  • Each cluster after clustering contains a single-ended read length sequence from a set of Reads, and the corresponding read length sequence at the other end is located in another cluster, so these two clusters can be referred to as a pair of clusters.
  • FIG. 5 which are schematic diagrams of a pair of clusters cluster1 and cluster2 located on different chromosomes, of course, the paired clusters may also be located on the same chromosome.
  • each cluster preferably contains more than two Reads. If a single Read is more than V-cl from both before and after Read, the abnormal data can be discarded.
  • V-cl is not lower than L-lib. If the setting is too low, the number of candidate clusters will be too large, and the number of Reads in the cluster will be too small, which is not convenient for later screening and filtering, and may also lead to an increase in false positive results. If the setting is too high, it may be inconvenient to determine the breakpoint, and the range of the breakpoint is increased. Therefore, it may preferably be 10 Kbp.
  • V-cl can have different specific meanings, such as the distance between the centers of gravity of two adjacent clusters, or the closest two clusters. The distance between the two Reads and so on.
  • Step4. Filter the clusters obtained by clustering.
  • Filtration is to remove as much as possible of possible interference, such as sample contamination, sequencing errors, comparison errors, noise, etc., so that the results can reflect the true chromosome structure anomaly as much as possible, so it can be set according to actual needs and possible types of interference.
  • Filtering conditions the present embodiment preferably provides the following filtering methods. In practical applications, one or several filtering methods may be used in combination or separately:
  • the degree of compactness of the cluster calculates the degree of compactness of each cluster, and filter out the clusters whose degree of compaction does not satisfy the preset requirement R-va and the clusters paired with them.
  • Various available mathematical methods can be used to calculate the degree of compactness of each cluster.
  • the degree of compaction can be expressed by the variance, and the variance of the position of each Read in the cluster and the center or center of gravity of the cluster can be calculated. The smaller the variance, the tighter the degree The higher.
  • the length of the read length sequence in the range of 5% to 25% of the length of both ends of the cluster may be discarded, preferably 20%, to reduce the influence of the peripheral data on the calculation result.
  • R-va may be set to a fixed threshold, for example, the required variance is lower than a fixed threshold, or set to a elimination ratio, for example, the ranking of the required variance in all clusters is within a preset minimum interval, for example, R-va is set to The ranking of the variance in all clusters is in the lowest interval of 2% to 10%, preferably 5%.
  • the degree of compactness of the cluster reflects the stability of the Read distribution, indicating whether Read is concentrated in a small interval.
  • the actual structural variation will be submerged in numerous “environmental noise", but “environmental noise.”
  • the effect on the whole genome is basically uniform, so there is a tendency to show a basic average distribution in the whole sequence (of course, it may also be affected by, for example, GC (guanine Guanine and cytosine Cytosine) content), but in reality Where the structural variation occurs, the Read in the cluster usually exhibits a trend similar to a normal distribution, so the degree of compactness, such as variance, can well reflect the differences between clusters.
  • (B) According to the linear correlation of the paired clusters: Calculate the linear correlation of the two pairs of pairs, and filter out the paired clusters whose linear correlation does not satisfy the preset requirement R-li.
  • Various available mathematical methods can be used to calculate the linear correlation of a pair of clusters, such as calculating the correlation coefficients of two clusters, and the higher the correlation coefficient, the higher the linear correlation.
  • R-li may be set to a fixed threshold, for example, the correlation coefficient is required to be higher than a fixed threshold, or set to a phase-out ratio, for example, the ranking of the correlation coefficient in all clusters is required to be within a preset maximum interval, for example, R-li It is set that the ranking of the correlation coefficient in all clusters is within the highest range of 2% to 10%, preferably 5%.
  • the linear correlation pays more attention to the consistency of the Reads distribution in the paired clusters, that is, whether the distribution trends at both ends of Reads are basically the same, so the linear correlation can better reflect the distribution inside the paired clusters.
  • the degree of compaction of the clusters such as the variance
  • the linear correlation of the clusters to filter the candidate clusters can achieve good results.
  • a control set according to a normal sample the paired cluster is compared with a preset control set containing a plurality of normal samples, and the paired clusters whose number of hit normal samples reaches a preset threshold value V-con are filtered out.
  • a normal sample refers to a collection of result clusters obtained by an analysis process such as "alignment-cluster-filtering" with other normal individuals of the same biological species as the target individual. For ease of alignment, all of the Reads in the cluster can be merged into one, and the paired clusters produce a pair of fused value pairs (similar to a pair of Reads), using the fused value pairs for comparison.
  • the frequency of occurrence of the result cluster in a normal individual can be obtained. If the frequency of occurrence of a result cluster is high, it may indicate that the result cluster may be due to sample nature, experimental process, sequencing process or environment. What is caused by noise or the like does not mean that such structural variation has occurred in the sample itself.
  • Such a result cluster is a common false positive result obtained by the same method analysis of different samples and should be removed. Therefore, filtering the clusters using the control set can further reduce the probability of false positives and help to obtain real structural variation analysis results.
  • V-con can be determined according to the establishment manner and characteristics of the normal sample, for example, the ratio of the V-con to the normal sample number in the control set may be 3%-10%, preferably 5%-6%, for example, if the control set contains 90% For a normal sample, 5 hits can be considered as reaching the threshold.
  • auxiliary parameters include various parameters that help to further confirm, distinguish the type of structural anomaly, or help to understand the details of the structural anomaly. For example, the number of mismatch generated in the comparison process, the number of Reads supporting the cluster, the RPU value of the relevant region obtained based on *.pair, whether the cluster is located in the N region, or the like.
  • the use of auxiliary parameters may include two methods. One is as a filtering condition, the filtering requirements related to the auxiliary parameters are set, the clusters that do not meet the requirements are directly filtered out, and the other is used as a reference for the auxiliary judgment, and the auxiliary parameters are accompanied by the results.
  • the clusters are provided together and judged by means of manual analysis. Therefore, the content of this section can be applied to Step 4 (for filtering), and can also be applied to the next step Step 5 (for assisting manual analysis).
  • the specific use of the method is not limited.
  • the following sections list some auxiliary parameters and their relationship with the result analysis. In actual use, they can be set as filtering conditions according to the following description, or as auxiliary judgment basis for manual analysis. Different auxiliary parameters can be used in combination or separately. Use alone.
  • mismatch number The average mismatch number of Reads in a paired cluster is generally no more than one or two, that is, one or two mismatches are allowed for each pair of Reads, preferably no more than one. If the matching requirement is set according to this setting, it is not necessary to consider the parameter again. If the setting is relatively loose, for example, if two mismatches are allowed, the result cluster can be filtered again according to the parameter. Or judge, for example, to set an average of only one mismatch.
  • the number of Reads supporting the cluster that is, the number of Reads included in the paired cluster.
  • the "L-lib impact range of breakpoints" is usually larger than the sum of the spans of the paired clusters.
  • the range of influence of L-lib on breakpoints is generally fluctuated by 2 times L-lib. For example, between 1 and 4 times L-lib, when it is specifically set, it can be appropriately relaxed or tightened according to actual conditions.
  • the RPU value of the relevant area obtained based on *.pair different types of structural anomalies usually have different effects on the RPU. For example, in the case of balanced translocation, the RPU on both sides of the breakpoint does not change significantly. In the case of missing or repetitive structural abnormalities, the RPU of the region between the breakpoints is significantly reduced or increased, so the RPU value of the relevant region can be used to further verify or assist in determining the occurrence of chromosomal structural abnormalities.
  • the RPU of the region between the breakpoints should be higher than the average and the range of variation exceeds V-rm;
  • the RPU of the region between the breakpoints should be below the average and vary beyond V-rm.
  • the RPU of the relevant area may be provided in a graphical, tabular or other easy-to-read manner, or the entire range of RPU changes may be provided in a graphical form, a table, etc., so that The operator understands the overall situation.
  • the Reads alignment near the N zone (which includes the centromere and telomere regions) is relatively more complex with other regions, and if the obtained cluster is not located in the N region It can generally be considered that it can be judged based on the obtained information. If the obtained cluster is located in the N zone, more careful verification may be required, such as joint use of filtering conditions and auxiliary parameters, or may be combined with other external data, such as a table of target individuals. The final determination is made by the results of the type, and/or further precise sequencing of the breakpoints (eg, Sanger sequencing).
  • Step5. Perform data analysis on the filtered result cluster.
  • the presence of the resulting cluster after filtering reflects the possible occurrence of a corresponding type of chromosome structural abnormality, so this step is not necessary if only structural anomalies that may exist are found.
  • the obtained result clusters can be further analyzed. According to different types of result clusters, the following analysis methods can be used:
  • the position of the innermost Read is obtained, and the preset length is extended inward from the position as the range of the breakpoint, and the innermost Read refers to if the cluster contains all Read on the left end, the Read on the far right is the innermost Read. If the cluster contains the right Read, the leftmost Read is the innermost Read. This situation is usually related to unbalanced translocations, where Read in the same cluster is distributed on one side of the breakpoint.
  • the span of the breakpoint range extending from the innermost Read can be determined according to L-lib, L-r1/L-r2, sequencing depth, etc., for example, 0.5 to 2 times L-lib, generally not more than 2 times L -lib.
  • FIG. 2 a case of balanced translocation is shown. If a pair of result clusters are obtained (only two read length sequences are drawn in each cluster, the rest are regarded as omitted) as shown in FIG. 2, one result The cluster is located near the position pa of the chromosome chra, and its paired result cluster is located near the position pb of the chromosome chrb. Because of the cluster on chra, Read1 is the Read end of the left end of the chromosome fragment, and its adjacent Read2 is the Read end of the right end of the chromosome fragment. Therefore, it can be considered that the breakpoint pa of chra is located between Read1 and Read2, and the analysis on chrb is Similar.
  • the following result data can be output: the number of two chromosomes in which a translocation structural abnormality may occur (the chromosome in which the result clusters are respectively located), and the two ends of the paired result cluster
  • the range of positions (the range of positions of the ends of the cluster on the two chromosomes, the span of the two ends of the cluster can be obtained), the range of the breakpoint obtained by the analysis, and the like.
  • the relevant parameters and other auxiliary parameters generated during the filtering process can also be output together, for example, the compactness of each pair of result clusters, the linear correlation between each other, the number of Reads supporting the pair of result clusters, and the performance breakpoints. Graphs, tables, etc. of the side RPU changes.
  • FIG. 3 a case of tandem repetition is shown, in which a pair of result clusters (only one read length sequence is drawn in each cluster, and the rest are regarded as omitted) are located between the start and end points of the repeated segments. Therefore, it can be considered that the start and end points of the repeated segments are located in a range extending outward from the most edge of the cluster (the two Reads do not necessarily belong to a pair of Reads).
  • the result data of the repetitive structural anomaly output is roughly the same, the difference is that the chromosome numbers at both ends of the cluster are the same, and data indicating the length of the estimated repetitive segment can also be output.
  • both ends of the paired result cluster (only one read length sequence is drawn in each cluster, and the rest are regarded as omitted) are located outside the start and end points of the missing segment, It can be considered that the starting and ending points of the missing segment are in the range extending inward from the closest Read at both ends of the cluster (the two Reads do not necessarily belong to a pair of Reads).
  • the result data types of the missing structural anomaly output are approximately the same as those of the repetitive structural anomaly, except that the output data representing the length of the segment between the estimated breakpoints represents the length of the missing segment.
  • Step6 Breakpoint assembly.
  • N can be reasonably set according to the length of Lr1/Lr2. Since the sequence length is less than 25 bp, the unique comparison rate will be greatly reduced. Therefore, when setting the value of N, it can be considered that the length of the truncated subsequence is not lower than Or not significantly below 25bp.
  • the range of the breakpoint can be effectively reduced.
  • the probe can be further prepared according to the position range of the breakpoint, and other accurate sequencing methods, such as Sanger sequencing, can be used to obtain accurate Breakpoint position for further study of breakpoints. If you do not need to narrow the breakpoint range, this step can be omitted.
  • an apparatus for detecting an abnormality in a chromosome structure comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data, including executable a program, connected to the data input unit, the data output unit, and the storage unit, for executing an executable program stored in the storage unit, the execution of the program includes completing all of the various methods in the foregoing embodiment or Part of the steps.
  • L-lib is 500bp
  • PE50 sequencing pair-end Sequencing, L-r1 and L-r2 are basically 50 bp;
  • V-lib is ⁇ 45 bp
  • RPK has V-rm of 20%
  • V-cl is 10Kbp (inter-cluster distance is defined as the distance between two nearest Reads)
  • the minimum number of Reads in the cluster is 2
  • the ranking is in the lowest range of 5%
  • the control set includes 90 normal samples with a V-con of 5.
  • This example is a study of the family of meow syndrome.
  • the two target individuals in this example belong to a family, where "FA” means father and "SON” means son.
  • the genome-wide low multipliers were sequenced for the two target individuals, respectively, with a "FA” sequencing depth of 2.2 and a “SON” sequencing depth of 3.1.
  • the number of the two chromosomes in which the paired result cluster is located chr12, chr5
  • the number of the two chromosomes in which the paired result cluster is located chr12, chr5
  • RPK The change of RPK in the relevant region on the chromosome: as shown in Figure 7, the abscissa is the position on the chromosome, in 10Kbp, the ordinate is RPK, the curve is drawn according to the data of SON.pair, and pa and pb are broken. Point position, it can be seen from the figure that the RPK of "SON" has obvious changes. Looking at the RPK calculation value, the RPK of the forearm of chromosome 5 of SON is only 0.5 times of the average value, and the forearm of chromosome 12 is more than the average value. 0.5 times.
  • This case is a study of congenital heart disease.
  • the target individual in this case is a patient with congenital heart disease, expressed as "XX".
  • sequencing results are then compared to the reference sequence HG19 using SOAP alignment software to obtain XX.sin.
  • the number of the two chromosomes in which the paired result cluster is located chr14, chr14
  • the position of the two ends of the paired result cluster is in the range of 7357040-73557288, 73670432-73670682
  • Tightness (variance) at the left and right ends 100.63, 100.59

Abstract

一种检测染色体结构异常的方法及装置,其中方法包括:获取目标个体的全基因组测序结果,即多对位于所测染色体片段的两端的读长对;将测序结果与参考序列进行比对,获得异常匹配集,其中包括两个读长序列分别匹配到参考序列的不同染色体的读长对;按照匹配到的位置将异常匹配集中的读长序列聚类成簇;使用例如与紧致程度等相关的预置要求对聚类得到的簇进行过滤,获得过滤后的结果簇,以用于判断染色体易位性结构异常的发生。

Description

一种检测染色体结构异常的方法及装置 技术领域
本发明涉及基因组学及生物信息学技术领域,具体涉及检测染色体结构异常的方法及装置。
背景技术
目前常见的染色体检查方法有:
核型分析:例如G带核型分析,由于采用400到600个BAND的分布情况来判断染色体结构异常,因此通常只能检测染色体级别的异常,最好情况下可以检测出5Mbp以上的缺失和重复,对于更小片段(<5M)的检测则无能为力。并且,该方法需要对活体细胞进行培养,要求细胞必须保持活性。
荧光原位杂交(FISH,fluorescence in situ hybridization)方法:可以检测出更小片段的缺失、重复和平衡易位,但需要预先确定所检测的染色体片段以准备相应的探针,因此受探针设计的限制。由于FISH无法检测未知区域,因此常用于验证检测结果。
微阵列(Microarray)方法:其中包括两种探针方法,一种基于单核苷酸多态性(SNP,single nucleotide polymorphisms)设计,一种基于CNV设计,因此具有与FISH类似的局限性。
随着全基因组测序技术的不断发展,测序成本不断降低,使得全基因组测序的普及化成为可能,有必要研究基于全基因组测序结果来发现染色体结构异常的手段。
发明内容
依据本发明的一方面提供一种检测染色体结构异常的方法,包括如下步骤:获取目标个体的全基因组测序结果,其中包括多对读长对,每对读长对由两个读长序列组成,分别位于所测染色体片段的两端,每对读长对分别来自相应染色体片段的正链和负链,或者,每对读长对同时来自相应染色体片段的正链或负链;将测序结果与参考序列进行比对,获得异常匹配集,异常匹配集包括符合下述描述的第一类读长对,第一类读长对中的两个读长序列分别匹配到参考序列的不同染色体;按照匹配到的位置将异常匹配集中的读长序列聚类成簇,每个簇中含有来自一组读长对的单端的读长序列,相应的另一端的读长序列位于另一个簇中;对聚类得到的簇进行过滤,其中包括,计算各个簇的紧致程度,过滤掉紧致程度不满足预置要求R-va的簇及与其成对的簇,获得过滤后的含有第一类读长对的结果簇,以用于判断染色体易位性结构异常的发生。
依据本发明的另一方面提供一种检测染色体结构异常的装置,包括:数据输入单元,用于输入数据;数据输出单元,用于输出数据;存储单元,用于存储数据,其中包括可执行的程序;处理器,与数据输入单元、数据输出单元及存储单元数据连接,用于执行存储单元中存储的可执行的程序,该程序的执行包括完成上述检测染色体结构异常的方法。
依据本发明的再一方面提供一种计算机可读存储介质,用于存储供计算机执行的程序,本领域普通技术人员可以理解,在执行该程序时,通过指令相关硬件可完成上述检测染色体结构异常的方法的全部或部分步骤。所称存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。
依据本发明的方法通过全基因组测序结果与参考序列的比对获得匹配到不同染色体的读长对,使得能够筛选出染色体易位性结构异常,并且通过聚类以及过滤进一步提高获得的结果的有效性和可靠性,使得能够获得具有分析意义的结果。
附图说明
本发明的上述和/或附加的方面和优点从结合下面附图对实施方式的描述中将变得明显和容易理解,其中:
图1是依据本发明的一种实施方式的双端测序获得的一对Reads示意图;
图2是依据本发明的一种实施方式的异常匹配的第一类Reads示意图;
图3是依据本发明的一种实施方式的异常匹配的第二类Reads示意图;
图4是依据本发明的一种实施方式的异常匹配的第三类Reads示意图;
图5是依据本发明的一种实施方式的位于不同染色体的一对簇的示意图;
图6是依据本发明的一种实施方式的实验例1中“FA”的RPK示意图;
图7是依据本发明的一种实施方式的实验例1中“SON”的RPK示意图。
具体实施方式
依据本发明的一种实施方式,提供一种检测染色体结构异常的方法,包括如下步骤:
Step1. 获取目标个体的全基因组测序结果。
测序结果包括成对读长对(也称“读段”)Reads,每对Reads由两个读长序列组成,分别位于所测染色体片段的两端,每对Reads分别来自相应染色体片段的正链和负链,或者,每对Reads同时来自相应染色体片段的正链或负链。
所测染色体片段通常是将来自目标个体的染色体样本经过打断获得的,并根据所选用的测序方法进行相应的文库(library)制备,可选用的测序方法根据来自的测序平台包括但不限于CG(Complete Genomics)、Illumina/ Solexa、ABI/ SOLiD和Roche 454,依据所选测序平台进行单端或双端测序文库的制备。根据本发明一种具体实施方式可进行双末端测序,获得的每对Reads中的两个读长序列Read1和Read2分别来自相应染色体片段的正链Sp和负链Sm,如图1所示;Read1的长度L-r1与Read2的长度L-r2可以相同也可以不同。当然,若使用的单端(single-read)测序方法能够完整获得整个染色体片段的序列,从完整获得的序列的两端分别截取适当长度的序列来构成一对Reads也是可行的,这种情况下,每对Reads中的两个读长序列Read同时来自相应染色体片段的正链或负链。本实施例对所选用的具体测序方法不作限定。
本发明中,将测序所使用的文库的大小记为L-lib,一般将L-lib为100~1000bp的文库称为小片段文库,将L-lib为2K、5K-6K、10K、20K、40Kbp的文库称为大片段文库。本发明对L-lib的大小无要求,不过一般而言,在保证文库建设质量的前提下,长度较大的文库对于获得有效的结果而言是有益的,因此,优选L-lib≥300bp。通常可使用大片段,例如5Kbp的文库,或小片段,例如500bp的文库。为使得测序结果具有较好的丰度,大片段文库的测序深度可选择大于2乘,小片段文库的测序深度可选择大于5乘,为避免数据的浪费,大片段文库的测序深度优选为2乘,小片段文库的测序深度优选为5乘。需要说明的是,由于本发明中涉及的具体数据大多具有统计意义,因此,如无特殊说明,任意以精确方式表达的数值均代表一个范围,即包含该数值正负10%的区间,以下不再重复说明。
L-r1和L-r2优选大于等于25bp,因为若低于25bp则唯一比对率会降低,使后续获得比对结果的复杂度增加。L-r1和L-r2也不需要太大,以免浪费数据,因此可优选为50bp。L-r1和L-r2无最高值限制,可根据测序技术的发展而作改变,例如根据当前的测序技术,L-r1和L-r2一般不超过150bp。
Step2. 将测序结果与参考序列进行比对。
所使用的参考序列是已知序列,可以是预先获得的目标个体所属生物类别中的任意的参考模板。例如,若目标个体是人类,参考序列可选择美国国家生物技术信息中心(NCBI,national center for biotechnology information)提供的HG19。进一步地,也可以预先配置包含更多参考序列的资源库,在进行序列比对前,先依据目标个体的性别、人种、地域等因素选择更接近的参考序列,以有助于获得更准确的检测结果。在比对过程中,根据比对参数的设置,一对Reads最多允许有n个碱基错配(mismatch),n优选为1或2,若Reads中有超过n个碱基发生错配,则视为该对Reads无法比对到参考序列,或者,若错配的n个碱基全部位于Reads中的一个Read,则视为该Reads中的该Read无法比对到参考序列。具体比对时,可使用各种比对软件,例如SOAP(Short Oligonucleotide Analysis Package),bwa,samtools等,本实施方式对此不作限定。
根据Reads的比对情况,可获得如下分类:
(一)正常匹配集*.pair:其中包括符合下述描述的Reads,即,Reads中的两个读长序列Read1和Read2匹配到参考序列的相同的染色体,且匹配到的位置的正负链关系与Reads中的正负链关系一致,且根据匹配到的位置所计算出来的染色体片段的长度L-pr与L-lib的偏差小于预置的阈值V-lib。V-lib优选为5%×L-lib~15%×L-lib,进一步优选为10%×L-lib。上述阈值是根据经验,按照文库大小的标准差来设置的。根据经验,小片段文库的标准差在15bp左右,大片段文库的标准差在50bp左右,可以认为L-pr与L-lib的偏差在3倍标准差的范围内是合适的,例如,对于500bp的文库,可以认为L-pr的合适的范围为455bp~545bp。
基于*.pair可以获得Reads按照所匹配到的位置的数量分布,例如可以统计单位长度所包含的Reads的数量RPU,可以根据L-lib设置相应的单位长度,例如可设置为1.5~4倍L-lib。若L-lib为500bp,单位长度可设置为1Kbp,此时RPU可记为RPK。根据RPU相对于平均值的变化情况,例如变化是否超过预置阈值V-rm,可用于辅助判断结构异常的发生,增加结果分析的准确性。优选的,V-rm为10~30%,进一步优选为20%。此外,RPU的平均值可通过统计获得,也可以根据估计得到,例如,可采用如下方式估算RPU的平均值:测序深度×(单位长度/L-lib)。若不需要使用RPU,可不必获得*.pair。
(二)异常匹配集*.sin:其中包括符合下述描述的三类Reads,
第一类Reads中的两个读长序列分别匹配到参考序列的不同染色体;这类Reads与易位性结构异常有关,例如平衡易位和非平衡易位。如图2所示,表示一种平衡易位的情况,一对Reads中的Read1匹配到染色体chra,而Read2匹配到染色体chrb,而另一对Reads的情况正好相反,图中连接Read1和Read2的虚线表示他们在染色体片段中的首尾位置关系(下同),pa和pb分别表示可能存在的断点的位置,所称“断点”指染色体发生结构异常的边界点。
第二类Reads中的两个读长序列匹配到参考序列的相同染色体,但L-pr为负值;这类Reads与串联的重复性结构异常有关。如图3所示,一对Reads中的Read1和Read2均匹配到染色体chra,但匹配到的位置的首尾位置关系与Read1和Read2在染色体片段中的首尾位置关系相反,pa1和pa2分别表示可能存在的重复片段的起止位置,L-sv表示重复片段的长度,图中chra中部的虚线表示省略的长度(下同)。
第三类Reads中的两个读长序列匹配到参考序列的相同染色体,但L-pr大于L-lib且偏差超过预置的阈值V-lib;这类Reads与缺失性结构异常有关。如图4所示,一对Reads中的Read1和Read2均匹配到染色体chra,且匹配到的位置的首尾位置关系与Read1和Read2在染色体片段中的首尾位置关系相同,但距离超过适合范围,pa1和pa2分别表示可能存在的缺失片段的起止位置,L-sv表示缺失片段的长度。
由于异常匹配集中的不同类型的Reads分别代表可能出现的不同种类的染色体结构异常,因此,根据检测需要,可以不必全部获取上述种类的异常匹配Reads,例如,若只需要检测易位性结果异常,可以仅从比对结果中获取第一类Reads。同样,异常匹配集也不局限于包括上述三种类型的Reads,只要不属于正常匹配集,但又能匹配到参考序列的Reads或Reads中的一个读长序列,都可以统计入异常匹配集。本领域一般技术人员可以将不同类型的异常匹配的表现形式与可能出现的相应的染色体结构异常相关联。此外,考虑到可能存在的噪声等干扰的影响,在异常匹配集中可以不考虑区分正负链匹配或不匹配的情况。
(三)无法匹配集*.unmap:其中包括无法匹配到参考序列的Read,这些Read可以是成对的(两个均无法匹配),也可以是单端的(另一个Read能够匹配)。
*.unmap中存在的单端Read可以用于在获得结果簇后进一步进行断点组装,以获得更加准确的断点范围。若不需要进行断点组装,可不必获得*.unmap。
Step3. 按照匹配到的位置将*.sin中的读长序列聚类成簇(cluster)。
聚类可采用各种聚类算法,本实施例对此不作限定。例如,一种简单的做法是,按照设置的簇间最小距离V-cl进行簇的划分,即,搜索按位置排序的读长序列Read,从第一条Read开始,若第二条Read与其之间的距离小于V-cl,则划分在同一个簇中,并从第二条Read开始继续搜索,直到第n条Read与第n-1条Read之间的距离大于V-cl,则从第n条Read开始划分为第二个簇,循环执行前述过程直到遍历所有Read。聚类时,可不必分别考虑正负链的情况,按照Read匹配在染色体上的位置进行聚类即可。
聚类后的每个簇中含有来自一组Reads的单端的读长序列,相应的另一端的读长序列位于另一个簇中,因此可以将这两个簇称为一对簇。如图5所示,为分别位于不同染色体的一对簇cluster1和cluster2的示意图,当然,成对的簇也可能位于相同染色体上。为使聚类后的分析有意义,每个簇中优选包含两条以上的Read,若出现单个Read与其前后Read的距离均大于V-cl,可以丢弃该异常数据。
V-cl最小不低于L-lib,若设置过低,会使得候选簇过多,且簇中的Read数过少,不便于后期的筛选和过滤,也可能导致假阳性结果的增多,若设置过高,则可能不便于断点的确定,增大了断点的范围,因此,可优选为10Kbp。根据所采用的聚类算法的不同,V-cl可以有不同的具体含义,例如可以是相邻的两个簇的重心之间的距离,或者,是指相邻的两个簇中位置最接近的两条Read之间的距离等。
Step4. 对聚类得到的簇进行过滤。
过滤是为了尽量除去各种可能存在的干扰,例如样本污染、测序错误、比对错误、噪声等,使得结果能尽量反映真实的染色体结构异常,因此可以根据实际需要以及可能出现的干扰类型来设置过滤条件,本实施例优选地提供如下过滤方式,在实际应用中,可以联合或单独使用其中的一种或几种过滤方式:
(一)依据簇的紧致程度:计算各个簇的紧致程度,过滤掉紧致程度不满足预置要求R-va的簇及与其成对的簇。可以采用各种可用的数学方法来计算各个簇的紧致程度,例如可以以方差来表示紧致程度,计算簇中各个Read的位置与簇的中心或重心的方差,方差越小则紧致程度越高。优选地,在计算各个簇的紧致程度时,可以放弃位于簇的两端的长度范围为5%至25%中的读长序列,优选为20%,以减小外围数据对计算结果的影响。优选地,R-va可设置为固定阈值,例如要求方差低于固定阈值,或者设置为淘汰比例,例如要求方差在全部簇中的排名处于预置的最低区间内,例如,R-va设置为方差在全部簇中的排名处于2%~10%的最低区间内,优选为5%。
簇的紧致程度反映了Read分布的稳定性,表明Read是不是集中在一个较小的区间内,一般而言,真实的结构变异会淹没在众多的“环境噪音”之中,但“环境噪音”对整个全基因组的影响基本是均匀的,所以在全序列中呈现基本平均分布的趋势(当然,也可能会受到例如GC(鸟嘌呤Guanine和胞嘧啶Cytosine)含量等的影响),而在真实的结构变异发生的地方,簇内的Read通常会呈现类似正态分布的趋势,因此紧致程度,例如方差,能很好地反映簇间的差异情况。
(二)依据成对簇的线性相关性:计算成对的两个簇的线性相关性,过滤掉线性相关性不满足预置要求R-li的成对的簇。可以采用各种可用的数学方法来计算一对簇的线性相关性,例如计算两个簇的相关系数,相关系数越高则线性相关性越高。优选地,R-li可设置为固定阈值,例如要求相关系数高于固定阈值,或者设置为淘汰比例,例如要求相关系数在全部簇中的排名处于预置的最高区间内,例如,R-li设置为相关系数在全部簇中的排名处于2%~10%的最高区间内,优选为5%。
线性相关性更加注重成对簇内Reads分布的一致性,即表现Reads两端的分布趋势是否基本一致,因此线性相关性更能反映成对簇内部的分布情况。
作为一种优选的实施方式,联合使用簇的紧致程度,例如方差,以及簇的线性相关性来对候选的簇进行过滤能够获得良好的效果。
(三)依据正常样本的对照集:将成对的簇与预置的包含多个正常样本的对照集进行比对,过滤掉命中正常样本的数目达到预置阈值V-con的成对的簇。正常样本是指将与目标个体相同生物种类的其他正常的个体经过如上“比对-聚类-过滤”等分析过程所获得的结果簇的集合。为便于比对,可将簇内的所有Read融合成一个,成对的簇即产生一对融合后的数值对(类似于一对Reads),使用融合后的数值对进行比对。通过采集包含大量正常样本的对照集,能够得到结果簇在正常个体中出现的频率,如果某个结果簇出现的频率高,可能说明该结果簇可能是由于样品性质、实验过程、测序过程或环境噪音等引起的,并不代表样品本身真实的发生了这样的结构变异。这样的结果簇就是不同样品用同样方法分析所得到的一个共同假阳性结果,应该去掉。因此,使用对照集对簇进行过滤能进一步降低假阳性的概率,有助于得到真实的结构变异分析结果。V-con可以根据正常样本的建立方式以及特点等进行确定,例如V-con与对照集中正常样本数的比例可以为3%-10%,优选为5%-6%,例如若对照集中包含90个正常样本,则可以将命中5个视为达到阈值。
(四)依据其他辅助参数:所称辅助参数包括各种有助于进一步证实、区分结构异常类型或者有助于了解结构异常的细节情况的参数。例如在比对过程中产生的mismatch数,支持簇的Reads的数目,基于*.pair获得的相关区域的RPU值,簇是否位于N区等。对于辅助参数的利用可包括两种方式,一是作为过滤条件,设置与辅助参数相关的过滤要求,直接过滤掉不符合要求的簇,另一是作为辅助判断的参考依据,将辅助参数随同结果簇一起提供,通过人工分析的方式进行判断,因此本节内容可应用于Step4中(用于过滤),也可应用于下一步骤Step5之后(用于辅助人工分析),本实施例对辅助参数的具体使用方式不作限定。以下列举部分辅助参数及其与结果分析的关系,实际使用时,既可以按照下述描述设置为过滤条件,也可以作为人工分析的辅助判断依据,不同的辅助参数既可以联合使用,也可以分别单独使用。
(1)mismatch数:成对簇中Reads的平均mismatch数一般不超过1个或2个,即允许每对Reads具有1个或2个mismatch,优选为不超过1个。若比对时的匹配要求即按此设置,可不必再次考虑该参数,若比对时的设置较为宽松,例如设置为允许有2个mismatch,则在获得结果簇时可再次依据该参数进行过滤或判断,例如设置为平均仅允许1个mismatch。
(2)支持簇的Reads的数目:即成对簇所包含Reads的数目,该参数原则上越大越好,一般可设置其判断依据为与测序深度的归一值基本一致,或略小于该值(例如取整),所称测序深度的归一值为:测序深度×(L-lib对断点的影响范围/L-lib)×(成对的簇两端的跨度的平均值/L-lib)。其中“L-lib对断点的影响范围”通常会大于“成对的簇两端的跨度之和”,“L-lib对断点的影响范围”一般会以2倍L-lib为均值左右波动,例如在1~4倍L-lib之间,在具体设置时,可以根据实际情况适当放宽或收紧。
(3)基于*.pair获得的相关区域的RPU值:不同类型的结构异常通常会对RPU产生不同的影响,例如,平衡易位的情况下,断点两侧的RPU不会发生明显变化,而缺失或重复性结构异常的情况下,断点之间的区域的RPU会明显降低或增高,因此可利用相关区域的RPU值来进一步验证或辅助判断染色体结构异常的发生。例如:
对于含有第一类Reads的簇,若根据簇内的Reads之间的关系判断为平衡易位(详见下文Step5,第一节),则断点两侧的RPU相对于平均值的变化应该不超过V-rm,若根据簇内的Reads之间的关系判断为非平衡易位(详见下文Step5,第一节),则断点背离结果簇的一侧的RPU应低于平均值,且变化范围超过V-rm;
对于含有第二类Reads的簇,位于断点之间的区域的RPU应高于平均值,且变化范围超过V-rm;
对于含有第三类Reads的簇,位于断点之间的区域的RPU应低于平均值,且变化范围超过V-rm。
当使用RPU作为人工分析的辅助判断依据时,可以将相关区域的RPU以图形、表格或其它易于识读的方式提供,或者也可以将全部范围的RPU变化情况以图形、表格等方式提供,以便于操作者了解整体情况。
(4)簇是否位于N区:根据经验,N区(其中包含着丝粒和端粒区)附近的Reads比对情况与其他区域相对具有更高的复杂性,若获得的簇并非位于N区,通常可认为能够基于已获得的信息进行判断,若获得的簇位于N区,则可能需要更谨慎的验证,例如联合使用过滤条件及辅助参数,或者可以结合其他外部数据,例如目标个体的表型,和/或进一步对断点进行精确测序(例如Sanger测序)的结果来作出最终判断。
Step5. 对过滤后的结果簇进行数据分析。
过滤后得到的结果簇的存在即反映可能发生了相应类型的染色体结构异常,因此若仅需要发现可能存在的结构异常,本步骤并不是必须的。为获得更加详细的关于结构异常的信息,可以进一步对获得的结果簇进行数据分析,根据不同类型的结果簇,可采用如下分析方式:
(一)染色体易位性结构异常(第一类Reads)
搜索含有第一类Reads的结果簇,若相邻的两个读长序列在各自所属的Reads中的位置相反,获取这两个读长序列匹配到的位置之间的范围作为断点的范围。这种情况通常与平衡易位有关,同一簇中的Read分布在断点的两侧。
若不存在上述情况的Read,则获取最靠内的Read的位置,并从该位置向内延伸预置长度作为断点的范围,所称最靠内的Read是指,若簇包含的都是左端Read,则最右边的Read为最靠内的Read,若簇包含的都是右端Read,则最左边的Read为最靠内的Read。这种情况通常与非平衡易位有关,同一簇中的Read分布在断点的一侧。从最靠内的Read延伸出的断点范围的跨度可根据L-lib、L-r1/L-r2、测序深度等确定,例如可以是0.5~2倍L-lib,一般不大于2倍L-lib。
参考图2,示出一种平衡易位的情况,若获得的一对结果簇(每个簇中仅画出两个读长序列,其余的视为省略)如图2所示分布,一个结果簇位于染色体chra的位置pa附近,其成对结果簇位于染色体chrb的位置pb附近。由于chra上的簇中,Read1为所在染色体片段的左端Read,而其相邻的Read2为所在染色体片段的右端Read,因此可认为chra的断点pa位于Read1和Read2之间,chrb上的分析与之类似。
根据上述数据分析,对于可能发生的易位性结构异常,可输出如下结果数据:可能发生易位性结构异常的两个染色体的编号(结果簇分别位于的染色体),成对结果簇的两端的位置范围(簇的两端在两个染色体上的边界的位置范围,相应可以获得簇的两端的跨度),经分析获得的断点的范围等。在过滤过程中产生的相关参数以及其他辅助参数也可以一并输出,例如一对结果簇各自的紧致度,相互的线性相关度,支持该对结果簇的Reads的数目,以及表现断点两侧RPU变化情况的图形、表格等。
(二)染色体串联重复性结构异常(第二类Reads)
搜索含有第二类Reads的结果簇,在成对的簇中获取匹配到的距离最远的两个位置之间的范围作为发生重复的范围,并从该两个位置分别向外延伸预置长度,例如0.5~2倍L-lib,作为断点(重复片段的起止点)的范围。
参考图3,示出一种串联重复的情况,成对结果簇(每个簇中仅画出一个读长序列,其余的视为省略)的两端均位于重复片段的起止点之间的范围内,因此可认为重复片段的起止点位于从簇两端最边缘的Read(这两个Read不一定属于一对Reads)向外延伸的范围内。
与易位性结构异常相比,重复性结构异常输出的结果数据类型大致相同,区别在于:簇两端的染色体编号相同,还可以输出表示估计的重复片段长度的数据。
(三)染色体缺失性结构异常(第三类Reads)
搜索含有第三类Reads的结果簇,在成对的簇中获取匹配到的距离最近的两个位置之间的范围作为发生缺失的范围,并从该两个位置分别向内延伸预置长度,例如0.5~2倍L-lib,作为断点(缺失片段的起止点)的范围。
参考图4,示出一种片段缺失的情况,成对结果簇(每个簇中仅画出一个读长序列,其余的视为省略)的两端均位于缺失片段的起止点之外,因此可认为缺失片段的起止点位于从簇两端最接近的Read(这两个Read不一定属于一对Reads)向内延伸的范围内。
与重复性结构异常相比,缺失性结构异常输出的结果数据类型大致相同,区别在于:输出的表示估计的断点之间的片段长度的数据代表的是缺失片段的长度。
Step6. 断点组装。
为进一步缩小断点的范围,还可以利用*.unmap中的数据进行断点组装,例如,获取所确定的断点范围周围设定范围(例如0.5~2倍L-lib)内的单端Read(能够单端匹配到参考序列的Read,在比对时可分入*.sin中),从*.unmap中提取与之成对的Read作为补丁序列,将所有补丁序列截成N段,N优选为2,并将补丁序列截断后获得的子序列重新与参考序列进行比对,按照能够正常匹配的结果对断点区域进行组装。
在实际使用中。N值可根据Lr1/Lr2的长度合理设置,由于序列长度低于25bp后会导致唯一比对率的较大下降,因此,在设置N值时可以考虑使得截断后的子序列的长度不低于或不明显低于25bp。
在进行断点组装后,能有效缩小断点的范围,在此基础上,可以进一步根据断点所处的位置范围制备探针,使用其他的精确测序手段,例如Sanger测序等,最终获得准确的断点位置,以便于进一步进行针对断点的研究。如果不需要缩小断点范围,本步骤可以省略。
本领域普通技术人员可以理解,上述实施方式中各种方法的全部或部分步骤可以通过程序来指令相关硬件完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。
依据本发明的另一方面还提供一种检测染色体结构异常的装置,包括:数据输入单元,用于输入数据;数据输出单元,用于输出数据;存储单元,用于存储数据,其中包括可执行的程序;处理器,与上述数据输入单元、数据输出单元及存储单元数据连接,用于执行存储单元中存储的可执行的程序,该程序的执行包括完成上述实施方式中各种方法的全部或部分步骤。
以下结合具体目标个体对依据本发明的具体检测方法的运行结果进行详细的描述。下述检测过程所使用的具体参数设置为:
1. L-lib为500bp,PE50测序(pair-end 测序,L-r1和L-r2基本为50bp);
2. 选择NCBI的HG19作为参考序列,使用SOAP软件对测序结果进行比对;
3. V-lib为±45bp,RPK的V-rm为20%,V-cl为10Kbp(簇间距离定义为两个最近的Read之间的距离),簇中最少Read数为2,R-va和设置为方差在全部簇中的排名处于5%的最低区间内(计算方差时,忽略位于簇的两端的长度范围为20%中的读长序列),R-li设置为相关系数在全部簇中的排名处于5%的最低区间内,对照集包括90个正常样本,V-con为5。
实验例一
本例为猫叫综合症家系研究。本例中的两个目标个体属于一个家系,其中“FA”表示爸爸,“SON”表示儿子。
1. 分别对两个目标个体进行全基因组低乘数的测序,其中“FA”的测序深度为2.2,“SON”的测序深度为3.1。
2. 然后使用SOAP比对软件将两个目标个体的测序结果分别与参考序列HG19进行比对,获得两个文件FA.sin和SON.sin。
3. 对两个文件FA.sin和SON.sin进行聚类、过滤和分析处理,获得结果簇及相关参数输出如下:
“FA”:
成对结果簇所在的两个染色体的编号:chr12,chr5
成对结果簇的两端的位置范围:14779615-14780233,23314785-23314205
成对结果簇的两端的跨度:618,580
支持该对结果簇的Reads的数目:5
左右两端的紧致度(方差):90.59,87.01
是否位于N区:否
断点的范围:chr12:14779968~14780233,chr5:23314205~23314455
染色体上相关区域RPK的变化情况:如图6所示,图中横坐标为染色体上的位置,以10Kbp为单位,纵坐标为RPK,曲线依据FA.pair的数据绘出,pa和pb表示断点位置,由图可以看出“FA”的RPK没有明显变化。
“SON”:
成对结果簇所在的两个染色体的编号:chr12,chr5
成对结果簇的两端的位置范围:14779618-14779968,23314455-23314830
成对结果簇的两端的跨度:350,375
支持该对结果簇的Reads的数目:6
左右两端的紧致度(方差):22.43,18.44
是否位于N区:否
断点的范围:chr12:大于14779968,chr5:小于23314455
染色体上相关区域RPK的变化情况:如图7所示,图中横坐标为染色体上的位置,以10Kbp为单位,纵坐标为RPK,曲线依据SON.pair的数据绘出,pa和pb表示断点位置,由图可以看出“SON”的RPK有明显变化,查看RPK计算数值可知,SON的5号染色体的前臂的RPK只有平均值的0.5倍,而12号染色体的前臂比平均值多了0.5倍。
通过分析结果可以清楚的判断出“FA”为平衡易位,“SON”为非平衡易位,且通过“FA”的结果分析出的断点范围已位于300bp以内。为进一步进行断点位置的研究,接下来我们从参考序列HG19上取出相应的序列,设计好引物,进行了qPCR的验证和Sanger测序,最终得出准确的断点位置为:Chr12:14780019,Chr5:23314435。
实验例二
本例为先天性心脏病研究。本例中的目标个体是一个有先天性心脏病的患者,以“XX”来表示。
1. 对该目标个体进行全基因组低乘数的测序,测序深度为2.7。
2. 然后使用SOAP比对软件将测序结果与参考序列HG19进行比对,获得XX.sin。
3. 对XX.sin进行聚类、过滤和分析处理,获得结果簇及相关参数输出如下:
“XX”:
成对结果簇所在的两个染色体的编号:chr14,chr14
成对结果簇的两端的位置范围73557040-73557288,73670432-73670682
估计的重复片段的长度:113392
成对结果簇的两端的跨度:248,250
支持该对结果簇的Reads的数目:4
左右两端的紧致度(方差):100.63,100.59
是否位于N区:否
断点的范围:chr14: 73556540-73557040,chr14: 73670682-73671182(范围大小按照1倍L-lib估计,即500bp)
通过分析结果可以清楚的判断出“XX”的14号染色体发生了一个长度约为113Kbp的重复,且该重复为串联发生。为进一步进行断点位置的研究,接下来我们从参考序列HG19上取出相应的序列,设计好引物,进行了qPCR的验证和Sanger测序,qPCR的扩增比值>1,显示为重复,Sanger测序最终得出准确的断点位置为:Chr14: 73557008,Chr14: 73670820,证实了“XX”的14号染色体是发生了一个113812bp的重复,重复片段串联插入到该片段末端。
以上所述仅为本发明的较佳实施例,应当理解,这些实施例仅用以解释本发明,并不用于限定本发明。对于本领域的一般技术人员,依据本发明的思想,可以对上述具体实施方式进行变化。

Claims (14)

  1. 一种检测染色体结构异常的方法,其特征在于,包括如下步骤,
    获取目标个体的全基因组测序结果,所述测序结果包括多对读长对,每对读长对由两个读长序列组成,分别位于所测染色体片段的两端,每对读长对分别来自相应染色体片段的正链和负链,或者,每对读长对同时来自相应染色体片段的正链或负链;
    将所述测序结果与参考序列进行比对,获得异常匹配集,所述异常匹配集包括符合下述描述的第一类读长对,第一类读长对中的两个读长序列分别匹配到参考序列的不同染色体;
    按照匹配到的位置将所述异常匹配集中的读长序列聚类成簇,每个簇中含有来自一组读长对的单端的读长序列,相应的另一端的读长序列位于另一个簇中;
    对聚类得到的簇进行过滤,其中包括,计算各个簇的紧致程度,过滤掉紧致程度不满足预置要求R-va的簇及与其成对的簇,
    获得过滤后的含有第一类读长对的结果簇,以用于判断染色体易位性结构异常的发生。
  2. 如权利要求1所述的方法,其特征在于,
    在对聚类得到的簇进行过滤时,还包括,
    计算成对的两个簇的线性相关性,过滤掉线性相关性不满足预置要求R-li的成对的簇,和/或,
    将成对的簇与预置的包含多个正常样本的对照集进行比对,过滤掉命中正常样本的数目达到预置阈值V-con的成对的簇。
  3. 如权利要求1所述的方法,其特征在于,还包括,
    搜索含有第一类读长对的结果簇,若相邻的两个读长序列在各自所属的读长对中的位置相反,获取这两个读长序列匹配到的位置之间的范围作为断点的范围,若不存在上述情况的读长序列,获取最靠内的读长序列的位置,并从该位置向外延伸预置长度作为断点的范围。
  4. 如权利要求1所述的方法,其特征在于,
    所述异常匹配集还包括符合下述描述的第二类读长对,第二类读长对中的两个读长序列匹配到参考序列的相同染色体,但根据匹配到的位置所计算出来的染色体片段的长度L-pr为负值;
    还获得过滤后的含有第二类读长对的结果簇,以用于判断染色体串联重复性结构异常的发生。
  5. 如权利要求4所述的方法,其特征在于,还包括,
    搜索含有第二类读长对的结果簇,在成对的簇中获取匹配到的距离最远的两个位置之间的范围作为发生重复的范围,并从该两个位置分别向外延伸预置长度作为断点的范围。
  6. 如权利要求1所述的方法,其特征在于,
    所述异常匹配集还包括符合下述描述的第三类读长对,第三类读长对中的两个读长序列匹配到参考序列的相同染色体,但根据匹配到的位置所计算出来的染色体片段的长度L-pr大于文库大小L-lib且偏差超过预置的阈值V-lib,V-lib优选为5%×L-lib~15%×L-lib,进一步优选为10%×L-lib;
    还获得过滤后的含有第三类读长对的结果簇,以用于判断染色体缺失性结构异常的发生。
  7. 如权利要求6所述的方法,其特征在于,还包括,
    搜索含有第三类读长对的结果簇,在成对的簇中获取匹配到的距离最近的两个位置之间的范围作为发生缺失的范围,并从该两个位置分别向内延伸预置长度作为断点的范围。
  8. 如权利要求1-7任意一项所述的方法,其特征在于,
    在将所述测序结果与参考序列进行比对时,还包括,
    获得正常匹配集,所述正常匹配集包括符合下述描述的读长对,读长对中的两个读长序列匹配到参考序列的相同的染色体,且匹配到的位置的正负链关系与该读长对中的正负链关系一致,且根据匹配到的位置所计算出来的染色体片段的长度L-pr与测序所使用的文库的大小L-lib的偏差小于预置的阈值V-lib,V-lib优选为5%×L-lib~15%×L-lib,进一步优选为10%×L-lib,
    统计单位长度所包含的正常匹配集中的Reads的数量RPU,获得RPU相对于平均值的变化情况,以用于辅助判断结构异常的发生,优选的,RPU相对于平均值的变化以RPU的变化是否超过预置阈值V-rm来表示,优选的,V-rm为10~30%,进一步优选为20%。
  9. 如权利要求1-7任意一项所述的方法,其特征在于,
    在将所述测序结果与参考序列进行比对时,还包括,
    获得无法匹配集,所述无法匹配集包括无法匹配到参考序列的读长序列,其中包括成对无法匹配的读长序列或单端无法匹配的读长序列,
    在获得结果簇后,还包括,
    获取所确定的断点范围周围设定范围内的单端的读长序列,从无法匹配集中提取与之成对的读长序列作为补丁序列,将所有补丁序列截成N段,N优选为2,并将补丁序列截断后获得的子序列重新与参考序列进行比对,按照能够正常匹配的结果对断点区域进行组装。
  10. 如权利要求1-7任意一项所述的方法,其特征在于,
    在计算各个簇的紧致程度时,放弃位于簇的两端的各5%至25%的读长序列不参与计算,和/或,
    以方差来表示紧致程度,R-va设置为方差在全部簇中的排名处于2%~10%的最低区间内,优选为5%。
  11. 如权利要求2所述的方法,其特征在于,
    在计算成对的两个簇的线性相关性时,以相关系数来表示线性相关性,R-li设置为相关系数在全部簇中的排名处于2%~10%的最高区间内,优选为5%,和/或,
    V-con与对照集中正常样本数的比例为3%-10%,优选为5%-6%。
  12. 如权利要求1所述的方法,其特征在于,
    测序所使用的文库的大小L-lib≥300bp,优选为500bp或5Kbp,和/或,
    读长序列的长度大于等于25bp,优选为50bp正负10%。
  13. 一种检测染色体结构异常的装置,其特征在于,包括:
    数据输入单元,用于输入数据;
    数据输出单元,用于输出数据;
    存储单元,用于存储数据,其中包括可执行的程序;
    处理器,与所述数据输入单元、数据输出单元及存储单元数据连接,用于执行所述可执行的程序,所述程序的执行包括完成如权利要求1-12任意一项所述的方法。
  14. 一种计算机可读存储介质,其特征在于,用于存储供计算机执行的程序,所述程序的执行包括完成如权利要求1-12任意一项所述的方法。
PCT/CN2013/075622 2013-05-15 2013-05-15 一种检测染色体结构异常的方法及装置 WO2014183270A1 (zh)

Priority Applications (8)

Application Number Priority Date Filing Date Title
PL13884613.4T PL2998407T5 (pl) 2013-05-15 2013-05-15 Sposób wykrywania nieprawidłowości strukturalnych chromosomów i urządzenie do tego sposobu
US14/890,989 US11004538B2 (en) 2013-05-15 2013-05-15 Method and device for detecting chromosomal structural abnormalities
EP13884613.4A EP2998407B2 (en) 2013-05-15 2013-05-15 Method for detecting chromosomal structural abnormalities and device therefor
HUE13884613A HUE047501T2 (hu) 2013-05-15 2013-05-15 Eljárás kromoszómális szerkezeti abnormalitások kimutatására, és ennek eszköze
ES13884613T ES2766860T5 (es) 2013-05-15 2013-05-15 Método para detectar anomalías estructurales cromosómicas y dispositivo para ello
CN201380004734.0A CN104302781B (zh) 2013-05-15 2013-05-15 一种检测染色体结构异常的方法及装置
RU2015153453A RU2654575C2 (ru) 2013-05-15 2013-05-15 Способ и устройство для детектирования хромосомных структурных аномалий
PCT/CN2013/075622 WO2014183270A1 (zh) 2013-05-15 2013-05-15 一种检测染色体结构异常的方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2013/075622 WO2014183270A1 (zh) 2013-05-15 2013-05-15 一种检测染色体结构异常的方法及装置

Publications (1)

Publication Number Publication Date
WO2014183270A1 true WO2014183270A1 (zh) 2014-11-20

Family

ID=51897591

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/075622 WO2014183270A1 (zh) 2013-05-15 2013-05-15 一种检测染色体结构异常的方法及装置

Country Status (8)

Country Link
US (1) US11004538B2 (zh)
EP (1) EP2998407B2 (zh)
CN (1) CN104302781B (zh)
ES (1) ES2766860T5 (zh)
HU (1) HUE047501T2 (zh)
PL (1) PL2998407T5 (zh)
RU (1) RU2654575C2 (zh)
WO (1) WO2014183270A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107075564A (zh) * 2014-12-10 2017-08-18 深圳华大基因研究院 确定肿瘤核酸浓度的方法和装置
CN107077533A (zh) * 2014-12-10 2017-08-18 深圳华大基因研究院 测序数据处理装置和方法
CN107077538A (zh) * 2014-12-10 2017-08-18 深圳华大基因研究院 测序数据处理装置和方法
CN111583996A (zh) * 2020-04-20 2020-08-25 西安交通大学 一种模型非依赖的基因组结构变异检测系统及方法
US11004538B2 (en) 2013-05-15 2021-05-11 Bgi Genomics Co., Ltd. Method and device for detecting chromosomal structural abnormalities

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688727B (zh) * 2016-08-05 2020-07-14 深圳华大基因股份有限公司 生物序列聚类和全长转录组中转录本亚型识别方法和装置
CN107058465B (zh) * 2016-10-14 2021-10-01 南方科技大学 一种利用单倍体测序技术检测染色体平衡易位的方法
CN106845155B (zh) * 2016-12-29 2021-11-16 安诺优达基因科技(北京)有限公司 一种用于检测内部串联重复的装置
CN106709276A (zh) * 2017-01-21 2017-05-24 深圳昆腾生物信息有限公司 一种基因变异成因分析方法及系统
CN109280702A (zh) * 2017-07-21 2019-01-29 深圳华大基因研究院 确定个体染色体结构异常的方法和系统
CN108830044B (zh) * 2018-06-05 2020-06-26 序康医疗科技(苏州)有限公司 用于检测癌症样本基因融合的检测方法和装置
CN109887547B (zh) * 2019-03-06 2020-10-02 苏州浪潮智能科技有限公司 一种基因序列比对滤波加速处理方法、系统及装置
CN112687341B (zh) * 2021-03-12 2021-06-04 上海思路迪医学检验所有限公司 一种以断点为中心的染色体结构变异鉴定方法
CN114743594B (zh) * 2022-03-28 2023-04-18 深圳吉因加医学检验实验室 一种用于结构变异检测的方法、装置和存储介质
CN115910199B (zh) * 2022-11-01 2023-07-14 哈尔滨工业大学 一种基于比对框架的三代测序数据结构变异检测方法
CN115831223B (zh) * 2023-02-20 2023-06-13 吉林工商学院 一种挖掘近源物种间染色体结构变异的分析方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561845A (zh) * 2008-12-12 2009-10-21 深圳华大基因研究院 一种染色体同线性同源区域的检测方法和系统
CN101914628A (zh) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 检测基因组目标区域多态性位点的方法及 系统
CN102409099A (zh) * 2011-11-29 2012-04-11 浙江大学 一种利用测序技术分析猪乳腺组织基因表达差异的方法
WO2012097474A1 (zh) * 2011-01-20 2012-07-26 深圳华大基因科技有限公司 检测转基因外源片段插入位点的方法和系统
CN102789553A (zh) * 2012-07-23 2012-11-21 中国水产科学研究院 利用长转录组测序结果装配基因组的方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7943304B2 (en) 2005-01-12 2011-05-17 Ramesh Vallabhaneni Method and apparatus for chromosome profiling
US20140228223A1 (en) * 2010-05-10 2014-08-14 Andreas Gnirke High throughput paired-end sequencing of large-insert clone libraries
CA2822439A1 (en) * 2010-12-23 2012-06-28 Sequenom, Inc. Fetal genetic variation detection
US11004538B2 (en) 2013-05-15 2021-05-11 Bgi Genomics Co., Ltd. Method and device for detecting chromosomal structural abnormalities

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561845A (zh) * 2008-12-12 2009-10-21 深圳华大基因研究院 一种染色体同线性同源区域的检测方法和系统
CN101914628A (zh) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 检测基因组目标区域多态性位点的方法及 系统
WO2012097474A1 (zh) * 2011-01-20 2012-07-26 深圳华大基因科技有限公司 检测转基因外源片段插入位点的方法和系统
CN102409099A (zh) * 2011-11-29 2012-04-11 浙江大学 一种利用测序技术分析猪乳腺组织基因表达差异的方法
CN102789553A (zh) * 2012-07-23 2012-11-21 中国水产科学研究院 利用长转录组测序结果装配基因组的方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2998407A4 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11004538B2 (en) 2013-05-15 2021-05-11 Bgi Genomics Co., Ltd. Method and device for detecting chromosomal structural abnormalities
CN107075564A (zh) * 2014-12-10 2017-08-18 深圳华大基因研究院 确定肿瘤核酸浓度的方法和装置
CN107077533A (zh) * 2014-12-10 2017-08-18 深圳华大基因研究院 测序数据处理装置和方法
CN107077538A (zh) * 2014-12-10 2017-08-18 深圳华大基因研究院 测序数据处理装置和方法
CN107077538B (zh) * 2014-12-10 2020-08-07 深圳华大生命科学研究院 测序数据处理装置和方法
CN107077533B (zh) * 2014-12-10 2021-07-27 深圳华大生命科学研究院 测序数据处理装置和方法
CN111583996A (zh) * 2020-04-20 2020-08-25 西安交通大学 一种模型非依赖的基因组结构变异检测系统及方法
CN111583996B (zh) * 2020-04-20 2023-03-28 西安交通大学 一种模型非依赖的基因组结构变异检测系统及方法

Also Published As

Publication number Publication date
US20160085911A1 (en) 2016-03-24
PL2998407T5 (pl) 2023-01-30
RU2015153453A (ru) 2017-06-20
EP2998407B2 (en) 2022-11-30
CN104302781B (zh) 2016-09-14
EP2998407A1 (en) 2016-03-23
CN104302781A (zh) 2015-01-21
EP2998407B1 (en) 2019-12-04
PL2998407T3 (pl) 2020-05-18
ES2766860T5 (es) 2023-02-23
RU2654575C2 (ru) 2018-05-21
US11004538B2 (en) 2021-05-11
ES2766860T3 (es) 2020-06-15
HUE047501T2 (hu) 2020-04-28
EP2998407A4 (en) 2017-01-11

Similar Documents

Publication Publication Date Title
WO2014183270A1 (zh) 一种检测染色体结构异常的方法及装置
US20210057045A1 (en) Determining the Clinical Significance of Variant Sequences
US20210269874A1 (en) Sequence assembly
CN105886616B (zh) 一种用于猪基因编辑的高效特异性sgRNA识别位点引导序列及其筛选方法
Zook et al. A robust benchmark for germline structural variant detection
WO2017023148A1 (ko) 다양한 플랫폼에서 태아의 성별과 성염색체 이상을 구분할 수 있는 새로운 방법
Lee et al. Previously undetected super-spreading of Mycobacterium tuberculosis revealed by deep sequencing
Sun et al. SHOREmap v3. 0: fast and accurate identification of causal mutations from forward genetic screens
Pu et al. Detection and analysis of ancient segmental duplications in mammalian genomes
WO2013065944A1 (ko) Ngs를 위한 서열 재조합 방법 및 장치
WO2019139363A1 (ko) 무세포 dna를 포함하는 샘플에서 순환 종양 dna를 검출하는 방법 및 그 용도
WO2011071209A1 (ko) 히든 마코브 모델을 이용한 식물 저항성 유전자 동정 및 분류를 위한 시스템 및 방법
CN106834490A (zh) 一种鉴定胚胎平衡易位断裂点和平衡易位携带状态的方法
WO2017135768A1 (ko) 추정 자손의 유전질환 발병 위험성을 예측하는 방법 및 시스템
CN111081315A (zh) 一种同源假基因变异检测的方法
WO2015043278A1 (zh) 同时进行单体型分析和染色体非整倍性检测的方法和系统
WO2017086675A1 (ko) 대사 이상 질환 진단 장치 및 그 방법
Singh et al. Inferences of demography and selection in an African population of Drosophila melanogaster
WO2017126943A1 (ko) 염색체 이상 판단 방법
CN113205857B (zh) 基因组性染色体非同源区域的鉴定方法和装置
Kõks et al. Sequencing and annotated analysis of full genome of Holstein breed bull
Gu et al. A suite of automated sequence analyses reduces the number of candidate deleterious variants and reveals a difference between probands and unaffected siblings
CN113611358A (zh) 样品病原细菌分型方法和系统
WO2021172780A1 (ko) 유전자 선별 방법 및 장치
WO2016208827A1 (ko) 유전자를 분석하는 방법 및 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13884613

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14890989

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2013884613

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2015153453

Country of ref document: RU

Kind code of ref document: A