WO2018232580A1 - 基于三代捕获测序对二倍体基因组单倍体分型的方法和装置 - Google Patents

基于三代捕获测序对二倍体基因组单倍体分型的方法和装置 Download PDF

Info

Publication number
WO2018232580A1
WO2018232580A1 PCT/CN2017/089108 CN2017089108W WO2018232580A1 WO 2018232580 A1 WO2018232580 A1 WO 2018232580A1 CN 2017089108 W CN2017089108 W CN 2017089108W WO 2018232580 A1 WO2018232580 A1 WO 2018232580A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
snp
haplotype
optimal
region
Prior art date
Application number
PCT/CN2017/089108
Other languages
English (en)
French (fr)
Inventor
周泽
孙宇辉
张涛
章元伟
Original Assignee
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院 filed Critical 深圳华大基因研究院
Priority to CN201780090335.9A priority Critical patent/CN110621785B/zh
Priority to PCT/CN2017/089108 priority patent/WO2018232580A1/zh
Publication of WO2018232580A1 publication Critical patent/WO2018232580A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6881Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for tissue or cell typing, e.g. human leukocyte antigen [HLA] probes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the invention relates to the technical field of bioinformatics, in particular to a method and a device for haplotype typing of diploid genome based on three generations of capture sequencing.
  • SMRT sequencing performs real-time sequencing, eliminating the need for PCR amplification during the sequencing process, thereby avoiding the base bias caused by the PCR process; SMRT sequencing utilizes zero-mode waveguide holes (ZMW) to produce extremely long sequencing reads.
  • ZMW zero-mode waveguide holes
  • PacBio RS sequencing results in a median of up to 2,246 bp, up to 23,000 bp, compared to the second
  • the 100 bp sequencing fragment produced by the most widely used Illumina sequencer in sequencing is a significant improvement.
  • PacBio's sequencers are already available for Whole Genome Sequencing, Targeted Sequencing, Complex Populations, RNA Sequencing, and Epigenetics. Technical details can be found in the article (Eid, John, et al. "Real-time DNA sequencing from single polymerase molecules.” Science 323.5910 (2009): 133-138.).
  • the reliability of (genotype phasing) is poor, especially in the HLA-A gene.
  • the results of typing with SAMtools software have obvious deviations and errors.
  • the SAM tools are used to classify the CCS circular correction sequencing fragments near the HLA-A gene.
  • the distribution of two haploids (haplotypes) shows that the haplotypes phasing of the sequenced fragments are very unevenly distributed on the chromosomes, the depth of some regions is extremely low, and the depth of other regions is extremely high, and in the SNP As you can see in the bar chart, each strip has a mixture of colors, indicating that the typing results are confusing.
  • the existing haploid typing method has poor accuracy and the resolution of the typing is not high.
  • the main methods include genotyping information generated by microarray genotyping chips (SNP genotypes) for small SNP typing; and high-throughput sequencing to sequence multiple individuals to obtain a correlation
  • SNP genotypes genotyping information generated by microarray genotyping chips
  • SAMtools includes the use of the Hidden Markov Model to classify individual individuals. With.
  • the invention provides a method and a device for haplotype typing of diploid genome based on three generations of capture sequencing, which can perform high-accuracy clustering on sequencing fragments included in a region with normal sequencing results and uniform coverage to distinguish The sequencing fragments corresponding to the two haplotypes achieve the purpose of haploid typing.
  • an embodiment provides a method for haplotype typing of diploid genomics based on three generations of capture sequencing, comprising:
  • the CCS sequences corresponding to the target gene region are aligned to the reference genome to obtain an optimal alignment sequence fragment, wherein the CCS sequence is obtained by the circular correction of the third generation target region capture sequencing fragment; and then the sequencing fragment is selected according to the optimal ratio.
  • the seeds are extended to obtain a set of sequencing fragments
  • Each of the sequenced fragments was scored based on the SNP set corresponding to the optimal haplotype described above, and each of the sequenced fragments was judged according to the score to distinguish the haplotype.
  • an embodiment provides an apparatus for haplotype typing of diploid genomics based on three generations of capture sequencing, comprising:
  • the CCS sequences corresponding to the target gene region are aligned to the reference genome to obtain an optimal alignment sequence fragment, wherein the CCS sequence is obtained by the circular correction of the third generation target region capture sequencing fragment; and then the sequencing fragment is selected according to the optimal ratio.
  • the seeds are extended to obtain a set of sequencing fragments
  • Each of the sequenced fragments was scored based on the SNP set corresponding to the optimal haplotype described above, and each of the sequenced fragments was judged according to the score to distinguish the haplotype.
  • an embodiment provides a computer readable storage medium comprising a program executable by a processor to implement the following method:
  • the CCS sequences corresponding to the target gene region are aligned to the reference genome to obtain an optimal alignment sequence fragment, wherein the CCS sequence is obtained by the circular correction of the third generation target region capture sequencing fragment; and then the sequencing fragment is selected according to the optimal ratio.
  • the seeds are extended to obtain a set of sequencing fragments
  • Each of the sequenced fragments was scored based on the SNP set corresponding to the optimal haplotype described above, and each of the sequenced fragments was judged according to the score to distinguish the haplotype.
  • the invention uses the third-generation target region to capture and sequence the data, and the third-generation sequencer can be used in the sequencing process to obtain a sequencing result in which the corresponding chromosomal location is relatively random, the fragment length is relatively random, and the length of the target region is floating.
  • the advantage of long segments is easy to assemble, and the short segment accuracy is high.
  • the haploid typing method of the invention is most suitable for the third-generation sequencing data, and fully utilizes the advantages of the third-generation sequencing method, and can obtain a high-reliability full-length gene haploid compared to the second-generation sequencing technology. Classification information, and further achieve high-precision mutation detection.
  • FIG. 1 is a flow chart of a method for diploid genomic haploid typing based on three generations of capture sequencing according to an embodiment of the present invention
  • FIG. 2 is a length distribution diagram of a subsequencing fragment obtained after preliminary processing of sample sequencing data in an embodiment of the present invention, wherein the abscissa indicates the length corresponding to the subsequencing fragment, and the ordinate represents the number of subsequencing fragments at a specific length;
  • FIG. 3 is a length distribution diagram of a CCS sequence obtained by further performing CCS ring correction in the sample sequencing data according to an embodiment of the present invention, wherein the abscissa indicates the length of the CCS sequence, and the ordinate represents the number of CCS sequences in a specific length range, and the CCS sequence is displayed. The number of subsequence fragments in the corresponding length range is reduced by nearly 90%;
  • FIG. 5 is a diagram showing the consistency of CCS sequences in one embodiment of the present invention.
  • Each dot represents a CCS sequence included in the HLA-A region of the sample, and the abscissa represents the number of SNPs consistent with the heterozygous SNP marker, and the ordinate represents the heterogeneous The number of SNPs with inconsistent SNP markers;
  • 6 to 12 are integrated genomics views of HLA-A, HLA-B, HLA-C, HLA-DPA1, HLA-DPB1, HLA-DQA1, and HLA-DQB1 genes, respectively, according to one embodiment of the present invention (IGV, Integrative Genomics) Viewer).
  • Targeteted Sequencing refers to, for example, the use of Roche NimbleGen SeqCap EZ System for DNA sample processing, and the use of PacBio's sequencer RSII. Sequencing was performed.
  • second generation sequencing refers to, for example, sequencing using a most widely used sequencer such as Illumina HiSeq 4000, see review literature (Michael Metzker (2010), Sequencing technologies-the next Generation, Nature Genetics).
  • PacBio refers to the PacBio RSII and PacBio Sequel System sequencer issued by Pacific Biosciences.
  • third generation sequencing refers to, for example, the most mature single molecule real-time sequencing based on the Pacific Biotechnology Corporation SMRT sequencing method.
  • Polymerase Read refers to, for example, a sequenced fragment comprising sequence information that is directly converted by an optical signal during the sequencing process using a PacBio sequencer.
  • the term "adapter” means, for example, that a DNA fragment needs to be modified prior to sequencing using a PacBio sequencer, and a DNA hairpin structure single strand is required at each end.
  • the hairpin structure has a single sequence with a specific sequence.
  • reads refers to a segment or segments of sequencing fragments that are retained after the above-described polymerase sequencing fragments have been removed from the linker sequence.
  • CCS Chemical Consensus Sequences
  • haplotypes phasing refers to, for example, for a diploid organism (eg, a human), the sequenced fragments obtained by sequencing correspond to two chromosomes of the same species, The process of clustering all sequenced fragments to distinguish the two haplotypes they belong to.
  • single nucleotide mutation refers to a single nucleotide polymorphism caused by a variation of a single nucleotide in an organism.
  • heterozygous SNPs means that a diploid organism, such as a human, undergoes a single nucleotide mutation at the same position on a pair of chromosomes, and the types of the two mutated bases are different.
  • the term "contig” refers to a longer sequence obtained by joining two or more sequencing fragments having a sequence of coincidences.
  • seed refers to a starting sequencing fragment that is analyzed as a sequencing fragment in a haploid typing method.
  • window refers to the length of a coordinate range used when counting corresponding values within a particular coordinate range of a chromosome in a haploid typing method.
  • the invention provides a complete method for the problem that the accuracy of the haploid typing of the existing haploid typing software is not high.
  • the haploid typing method can perform high-accuracy clustering on the sequencing fragments contained in the regions with normal sequencing results and uniform coverage to distinguish the sequencing fragments corresponding to the two haplotypes, so as to achieve the purpose of haploid typing. .
  • the present invention provides a complete method for obtaining accurate, detailed and complete variation information of haplotypes by using a target region capture sequencing method based on "third generation sequencing technology", including single nucleotide polymorphism (SNP), Downstream information analysis methods such as Insertion Deletion Variation (Indel), Chromosome Structure Variation (SV), and Copy Number Variation (CNV) to solve the information analysis and data processing procedures that are currently not used to solve the third-generation target region capture sequencing data.
  • SNP single nucleotide polymorphism
  • Indel Insertion Deletion Variation
  • SV Chromosome Structure Variation
  • CNV Copy Number Variation
  • the invention includes a complete information analysis method, and the PacBio RSII sequencing data can be obtained by sequencing the corresponding bax.h5 raw data file, the FASTQ sequence information file corresponding to the CCS sequence, the BAM comparison information file obtained by the comparison, and assembling.
  • the resulting FASTA assembles the genomic sequence file into the final variant
  • the data required for the data analysis method of the present invention comes from experimental methods for capture sequencing of target regions that are now mature and widely used, such as HLA region capture sequencing.
  • the data preprocessing process prior to performing the haploid typing method of the present invention includes:
  • the haploid typing method of the present invention is then carried out, as shown in Figure 1, a method for mimicking haploid genomic duplication based on three generations of capture sequencing provided in one embodiment, comprising:
  • Step S101 Aligning the CCS sequences corresponding to the target gene region to the reference genome to obtain the position on the corresponding chromosome of the optimal alignment, wherein the CCS sequence is obtained by the circular correction of the third generation target region capture sequence; and then according to the optimal ratio A heterozygous SNP marker was selected for the corresponding chromosomal location of the obtained CCS sequence.
  • Best hit read refers to the alignment of the alignment scores. segment.
  • the start and end position coordinate information of these sequenced segments and all the SNPs type and coordinate information contained therein are stored for recall, for example, stored in a specific variable structure.
  • the preset cutoff value ranges from 25% to 75%, since portions near 0% and 100% are due to sequencing errors in the third generation sequencing process, so 0% to SNPs with more sequencing errors are included in the range of 25% and 75% to 100%, so these two partial SNPs are not considered in the selection of heterozygous SNP markers.
  • the sequencing depth typically needs to be greater than half of the highest sequencing depth, such a region being referred to as a "high sequencing depth region", for example, in such regions, the CCS sequence segments are evenly distributed with 75 ⁇ above sequencing depth.
  • the size of the window can be based on empirical default values, such as 500 bp. Find the most heterozygous partial window in these high sequencing depth windows, ie the window with the largest number of heterozygous SNP markers, and establish the position of these windows as a seed. The basis for the selection.
  • Step S103 Clustering the CCS sequence fragments covered on the window, and generating two sets of optimal SNP sets as seeds according to the clustering result.
  • Step S104 According to the position of the seed and the CCS sequence fragment belonging to the same haplotype on the genome, the seed is extended to obtain a CCS sequence fragment set.
  • extending the seed to obtain a set of CCS sequence fragments specifically includes: examining, by each seed, all of the CCS sequence fragments.
  • the triple window region corresponding to each seed at the beginning of the extension is used as a detected region, also referred to as an already extended region, corresponding to a portion of each CCS sequence segment that overlaps with a known region.
  • the SNP is judged and its position, type and sequencing quality value are compared; the CCS sequence fragments belonging to the same haplotype are spatially coincident according to the spatial coincidence degree of the position on the genome and the known region.
  • the sequence is hierarchically sequenced, and then the known regions are sequentially added until the end of all CCS sequence fragments, thereby constructing a complete haplotype and recording a CCS sequence fragment set.
  • Step S105 Find a hybrid SNP marker set corresponding to the CCS sequence segment set, and obtain a SNP set corresponding to the optimal haplotype according to the quality value of each SNP.
  • obtaining the optimal haplotype corresponding SNP set according to the quality value of each SNP may specifically include: calculating a sequencing quality value corresponding to each SNP in the hybrid SNP marker set, and selecting a sequencing quality value. Add the highest SNP and get the SNP set corresponding to the optimal haplotype.
  • Step S106 Each CCS sequence segment is scored according to the SNP set corresponding to the optimal haplotype as a criterion, and each CCS sequence segment is judged according to the score to distinguish the haplotype.
  • the determining of the scoring and distinguishing haplotypes may specifically include: overlapping the SNP sets corresponding to the optimal haplotype according to the position and type of the SNP on each CCS sequence segment.
  • the weighted uniformity ratio calculation is performed by using the sequencing quality value, and each CCS sequence fragment is judged according to the score of each CCS sequence segment to distinguish the haplotype, and the CCS sequence with higher consistency (for example, the first 50%) is selected. Fragments, the purpose of distinguishing between two haplotypes.
  • the method further comprises: assembling the CCS sequence fragments under the two haplotypes corresponding to the target gene to construct a contig, thereby obtaining The full-length haplotype sequence of the target gene.
  • the program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc.
  • the computer executes the program to implement the above functions.
  • the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the above functions can be realized.
  • the program may also be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk or a mobile hard disk, and may be saved by downloading or copying.
  • the system is updated in the memory of the local device, or the system of the local device is updated.
  • another embodiment of the present invention provides an apparatus for diploid genomic haploid typing based on three generation capture sequencing, comprising: a memory for storing a program; a processor for executing a program stored by the above memory To achieve the following method: matching the CCS sequence fragments corresponding to the target gene region to the reference genome to obtain the optimally aligned chromosomal location, wherein the CCS sequence fragment is the third generation target region capture sequencing fragment obtained through circular correction; Selecting a hybrid SNP marker for the CCS sequence segment according to the optimal ratio; selecting a region with a higher sequencing depth than the preset value for the CCS sequence segment, and searching for the window with the largest number of heterozygous SNP markers in the region; The CCS sequence fragments covering the above window are clustered, and two sets of optimal SNP sets are generated as seeds according to the clustering result; the positions of the seeds on the genome overlap with the CCS sequence fragments belonging to the same haplotype, The above seed is extended to obtain a CCS sequence fragment set; and the corresponding
  • the SNP set scores each CCS sequence segment as a standard, and judges each CCS sequence segment to distinguish the haplotype according to the score.
  • Yet another embodiment of the present invention provides a computer readable storage medium comprising a program executable by a processor to implement a method of: comparing a CCS sequence segment corresponding to a target gene region to a reference genome to obtain an optimal ratio Position on the chromosome, wherein the CCS sequence fragment is the third generation target region capture sequence CCS sequence fragment obtained by circular correction; then the hybrid SNP marker is selected according to the optimal ratio of the CCS sequence fragment; according to the above optimal alignment
  • the CCS sequence segment selects a region with a higher sequencing depth than the preset value, and searches for the window with the largest number of hybrid SNP markers in the above region; clusters the CCS sequence segments covered on the window, and generates according to the clustering result
  • the optimal SNP sets of the two groups are used as seeds; according to the position of the above-mentioned seeds and the CCS sequence fragments belonging to the same haplotype on the genome, the above seeds are extended to obtain a CCS sequence fragment set; and the corresponding CCS sequence fragment set is found
  • the embodiment of the present invention uses the third-generation target region to capture the sequenced data, and the third-generation sequencer can be used in the sequencing process to obtain a random distribution of the corresponding chromosome positions, the fragment length is relatively random, and the sequencing result is floating near the length of the target region. It can not only take advantage of the easy assembly of long segments, but also reflect the high accuracy of short segments.
  • the haploid typing method of the invention is most suitable for the third-generation sequencing data, and fully utilizes the advantages of the third-generation sequencing method, and can obtain a high-reliability full-length gene haploid compared to the second-generation sequencing technology. Classification information, and further achieve high-precision mutation detection.
  • HLA target region on human chromosome 6 was sequenced and sequenced, and the full-length regions of HLA-A, HLA-B, HLA-C, HLA-DPA1, HLA-DPB1, HLA-DQA1, and HLA-DQB1 genes were Information analysis.
  • the HLA gene BGI-YH cell line samples were subjected to HLA complete full-length region capture experiments using now mature and published experimental techniques, and a 10K library was constructed and sequenced using a PacBio RSII sequencer. And for the same BGI-YH sample, 5 parallel independent capture, database construction, and sequencing operations were performed.
  • the PacBio RSII-based standardized sequencing procedure yields information containing the polymerase sequencing fragments, which are stored in binary form in the bax.h5 file.
  • Shorter subsequences were obtained using the software from the SMRT analysis software package provided by PacBio (https://github.com/PacificBiosciences) to remove the adapter sequences added during the sequencing process.
  • the length distribution of the subsequencing fragments is shown in Fig. 2.
  • the curve smoothing includes both the main peak of the sequencing fragment length of 2.5k and the more obvious tailing around 5k.
  • the CCQ circular consensus sequencing was performed using the RS_ReadsOfInsert.xml protocol in the SMRT analysis provided by PacBio to obtain a fastq file.
  • the bax.h5 file totals about 80G, and the ccs.fastq file obtained after CCS correction can reach 290M.
  • the length distribution of the CCS sequence segment is shown in Figure 3.
  • the curve smoothing includes both the main peak of the CCS sequence fragment length of 2.5k and the more obvious 5k sub-peak.
  • the CCS sequence file was compared to the human reference genome (GRCh37.p13) using the open-source BWA alignment software (Version: 0.5.9-r16) for the longer sequence alignment MEM algorithm (BWA-MEM).
  • BWA-MEM the longer sequence alignment MEM algorithm
  • the target region required for the study, such as the HLA-A gene and the adjacent region of this gene (NC_000006.11 (29910247..29913661) corresponding CCS sequence fragment is selected by the positional information of the CCS sequence fragment in the SAM file.
  • the information analysis method provided by the present invention is used to perform haploid phasing to distinguish two haplotypes, and the specific process is as follows:
  • cluster analysis is performed on all CCS sequence fragments covering the window according to the SNP position and type.
  • the specific process of this cluster analysis is to analyze the SNPs in the triple window size region obtained by summing a window and one adjacent window. The SNPs of the same location and different types are distinguished and the SNP sets corresponding to the two haplotypes are separated.
  • the artificial simulation is performed according to the obtained two haplotype SNP sets, and the window of the triple window size obtained by summing the window and the adjacent one of the left and right windows is selected as the length of the seed, and the frequency is selected.
  • the highest SNP combination is used as the SNP information carried by the seed, which in turn produces two sets of optimal SNP sets, which serve as the starting seeds for the two haplotypes.
  • Each seed was examined for all CCS sequence fragments.
  • each seed corresponds to The triple window size area is a known area, also known as an already extended area.
  • the SNPs corresponding to the portions of each CCS sequence fragment that overlap with the already extended region are judged, and the position, type and sequencing quality value of the SNP are compared.
  • This process uses a hierarchical approach to classify CCS sequence fragments belonging to the same haplotype according to their spatial overlap with the extended regions, in descending order of spatial coincidence.
  • the extended region is then added in sequence until it extends to the end of all CCS sequence fragments, constructing a complete haplotype and recording the CCS sequence fragment.
  • the SNP set corresponding to the optimal haplotype obtained in the above step is a standard, and each CCS sequence fragment is scored. According to the degree of overlap of the position and type of the SNP on each CCS sequence fragment and the SNP set obtained in the previous step, the weighted consistency ratio is calculated by using the sequencing quality value, scored and recorded, as shown in FIG. 5 . According to the score of each CCS sequence fragment, each CCS sequence fragment can be judged by distinguishing the haplotype, and the first 50% of the CCS sequence fragments with higher consistency are selected to achieve the purpose of distinguishing the two haplotypes.
  • the Canu assembly software https://github.com/marbl/canu was used to assemble the CCS sequence fragments under each haplotype to obtain two contigs with high accuracy and complete typing.
  • the detailed information of the sequence fragment and the number of bases in each step is shown in Table 1.
  • step Number of sequencing fragments Number of bases
  • Figures 6 to 12 show the integrated genomics view of HLA-A, HLA-B, HLA-C, HLA-DPA1, HLA-DPB1, HLA-DQA1, and HLA-DQB1 genes in this example, respectively (IGV, Integrative Genomics Viewer).
  • the haplotype typing method of this example was used to perform the typing operation to obtain two haploid corresponding sequencing fragments near each gene and the contigs assembled by Canu were used on the human reference genome. The distribution of these genes shows that the coverage of these gene regions is high and the coverage is complete.
  • the coverage bar graph shows the frequency of bases at each SNP position after haploid typing. Most of the strips in the bar graph are filled with one color, reflecting haploid typing. Higher accuracy.
  • the SEM algorithm under the BWA comparison software (version: 0.5.9-r16) was used to compare the contig sequence file obtained in the previous step with the human reference genome (GRCh37.p13) to obtain a SAM format file.
  • SNP detection was performed on the SAM file obtained in the previous step using the SNP detection software.
  • the SNP variation results carried by the HLA-A gene in the BGI-YH sample and the two haplotypes in the adjacent region are shown in Table 2.
  • the results of comparison with the gold standard Sanger sequencing are shown in Table 3.
  • the processing flow and the haploid typing method in it can achieve the same accuracy as the gold standard, which is better than the variation analysis results of the "second generation" sequencing.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Zoology (AREA)
  • Engineering & Computer Science (AREA)
  • Wood Science & Technology (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Physics & Mathematics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Cell Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种基于三代捕获测序对二倍体基因组单倍体分型的方法和装置,该方法包括:将目标基因区域对应的CCS序列比对到参考基因组得到最优比对测序片段,然后选取杂合SNP标记;选取测序深度高于预设值的区域,在该区域中寻找杂合SNP标记数目最多的窗口;对覆盖在窗口上的测序片段进行聚类,并产生两组最优SNP集合作为种子;对种子进行延伸得到一测序片段集合;找到测序片段集合对应的杂合SNP标记集合,依据各SNP的质量值得到最优单倍型对应的SNP集合;对每一测序片段打分,并依据得分将每一测序片段进行区分单倍型的判断。本发明能够对测序结果正常,覆盖度均匀的区域所包含的测序片段进行高准确度的聚类,以区分两条单倍型对应的测序片段,实现单倍体分型的目的。

Description

基于三代捕获测序对二倍体基因组单倍体分型的方法和装置 技术领域
本发明涉及生物信息学技术领域,具体涉及一种基于三代捕获测序对二倍体基因组单倍体分型的方法和装置。
背景技术
自太平洋生物科技公司(Pacific Biosciences)于2011年发布第一台商业化“第三代”测序仪PacBio RS后,又相继发布了PacBio RSⅡ以及PacBio Sequel System。“第三代”测序领域在近些年得到了快速发展。基于单分子实时(SMRT)测序技术的“第三代”测序有着全新的技术特征。相比于第二代测序“边合成边测序”的信号放大策略,SMRT测序进行实时测序,测序过程中不需要进行PCR扩增,进而避免了PCR过程带来的碱基偏向性;与此同时,SMRT测序利用零模波导孔(ZMW),产生极长的测序片段(reads),例如PacBio RS测序得到的测序片段中位数可达2,246bp,最大值可达23,000bp,较之“第二代”测序中最为广泛使用的Illumina测序仪所产生的100bp测序片段而言,是一个极大的提升。并且PacBio的测序仪已经可以用于全基因组测序(Whole Genome Sequencing)、目标区域测序(Targeted Sequencing)、复杂群体分析(Complex Populations)、RNA测序(RNA Sequencing)和表观遗传测序(Epigenetics)。技术细节可以参见文章(Eid,John,et al."Real-time DNA sequencing from single polymerase molecules."Science 323.5910(2009):133-138.)。
在PacBio测序技术为动植物基因组、微生物基因组研究带来更准确、更全面、更高精度分析可能性的同时,这一测序手段仍然包含诸多技术缺陷和不成熟的部分,例如测序过程中伴随极高的错误率还会产生数量上不可忽视的短插入删除缺失(indels),随之而来的是在下游的信息分析中面临较为严峻的挑战。例如,存在部分HLA区域测序深度过高,但另外的目标区域覆盖度不良的情况。一方面冗余下机数据达到约80G,另一方面利用现有的软件进行初步组装的效果不佳,组装出的重叠群(contig)较短(N50=约5kbp),并且单倍体分型(genotype phasing)的可靠性较差,尤其表现在HLA-A基因中,利用SAMtools软件进行分型的结果有明显偏差和错误,利用SAMtools对HLA-A基因附近的CCS环形矫正测序片段分型得到的两个单倍体(haplotypes)的分布显示单倍体分型(haplotypes phasing)得到的测序片段在染色体上分布非常不均衡,部分区域深度极低,另一些区域深度极高,并且在SNP的条形图中可以看到,每个条带都有多种颜色混杂,表明分型结果混乱。
现有的单倍体分型方法准确度较差,分型的分辨率不高。主要方法包括利用微阵列芯片(microarray genotyping chips)所产生的基因型分型信息(SNP genotypes),进行少量SNP分型;还包括利用高通量测序手段,对多个个体测序,从而得到一个相关群体SNPs的概况,再利用统计学模型对群体的SNPs进行分型。最常使用的生物信息学分析工具软件SAMtools包含利用隐马科夫模型HMM(Hidden Markov Model)对单个个体进行分型的工 具。但是,利用隐马科夫模型的SAMtools分型工具没有最大化利用“第三代”测序长测序片段所带来的生物信息上的优势,不能很准确的进行分型,会出现后续组装出错和组装出二倍体嵌合体的明显问题,对下游的信息分析带来了很大程度上的干扰。分型方法可以参见综述(Browning,S.R.,and Browning,B.L.(2011).haplotypes phasing:existing methods and new developments.Nat.Rev.Genet.12,703–714.)。
现有的技术专利申请(申请公布号CN105112518A,中国发明专利申请,申请公布日2015.12.02)中,为避免PacBio测序的不准确性所带来的后续信息分析的挑战,采用较为简单的原位PCR(PAC-PCR)实验手段,大量重复扩增出部分区域对应的DNA片段,再利用PacBio RSII测序仪对这些片段进行测序,利用引物的保守性方法尽量减少测序错误带来的下游比对分析错误,从而降低后续的信息分析错误。这一专利申请利用实验手段,试图降低后续生物信息分析过程的难度,弥补分析不准确的问题,实则无法进行高精度、全覆盖HLA区域的DNA检测手段,是妥协、折中的技术。
利用现有的技术专利申请(申请公布号CN105112518A,中国发明专利申请,申请公布日2015.12.02)中的方法,从数据来源的实验角度看,无法100%覆盖较长的基因,并且目标基因过于少只集中于人类第六号染色体HLA区域的几个主要基因,目标覆盖的区域过于狭小,目标基因数量过于少,无法满足日益增长的科研需求。现有专利技术中的实验方法会极大增加测序前的实验准备时间、过程和成本。需要设计特定序列的引物,并需要根据引物序列对PCR条件进行优化,实验过程复杂繁琐并且目标基因数目有限,仅能针对已有探针所对应的区域。若需要增加一个基因的测序和分型工作,将会带来极长的设计周期。从后续的数据分析角度,虽然这一方法利用了PacBio测序片段较长的优势,避免了PacBio测序过程错误率较高的劣势,但是最重要的是丧失了PacBio免于PCR过程的重要优势,无法避免测序的偏向性,使得PacBio测序作用无异于“第二代”测序手段中的“双端测序(mate-pair)”的方法。
发明内容
本发明提供一种基于三代捕获测序对二倍体基因组单倍体分型的方法和装置,能够对测序结果正常,覆盖度均匀的区域所包含的测序片段进行高准确度的聚类,以区分两条单倍型对应的测序片段,实现单倍体分型的目的。
根据第一方面,一种实施例中提供一种基于三代捕获测序对二倍体基因组单倍体分型的方法,包括:
将目标基因区域对应的CCS序列比对到参考基因组得到最优比对测序片段,其中上述CCS序列是第三代目标区域捕获测序片段经由环形矫正得到;然后依据上述最优比对测序片段选取杂合SNP标记;
依据上述最优比对测序片段选取测序深度高于预设值的区域,在上述区域中寻找上述杂合SNP标记数目最多的窗口;
对覆盖在上述窗口上的测序片段进行聚类,并依据上述聚类结果产生两组最优SNP集 合作为种子;
依据上述种子与属于同一单倍型的测序片段在基因组上的位置重合,对上述种子进行延伸得到一测序片段集合;
找到上述测序片段集合对应的杂合SNP标记集合,依据各SNP的质量值得到最优单倍型对应的SNP集合;
以上述最优单倍型对应的SNP集合为标准对每一测序片段打分,并依据得分将每一测序片段进行区分单倍型的判断。
根据第二方面,一种实施例中提供一种基于三代捕获测序对二倍体基因组单倍体分型的装置,包括:
存储器,用于存储程序;
处理器,用于通过执行上述存储器存储的程序以实现如下的方法:
将目标基因区域对应的CCS序列比对到参考基因组得到最优比对测序片段,其中上述CCS序列是第三代目标区域捕获测序片段经由环形矫正得到;然后依据上述最优比对测序片段选取杂合SNP标记;
依据上述最优比对测序片段选取测序深度高于预设值的区域,在上述区域中寻找上述杂合SNP标记数目最多的窗口;
对覆盖在上述窗口上的测序片段进行聚类,并依据上述聚类结果产生两组最优SNP集合作为种子;
依据上述种子与属于同一单倍型的测序片段在基因组上的位置重合,对上述种子进行延伸得到一测序片段集合;
找到上述测序片段集合对应的杂合SNP标记集合,依据各SNP的质量值得到最优单倍型对应的SNP集合;
以上述最优单倍型对应的SNP集合为标准对每一测序片段打分,并依据得分将每一测序片段进行区分单倍型的判断。
根据第三方面,一种实施例中提供一种计算机可读存储介质,包括程序,上述程序能够被处理器执行以实现如下的方法:
将目标基因区域对应的CCS序列比对到参考基因组得到最优比对测序片段,其中上述CCS序列是第三代目标区域捕获测序片段经由环形矫正得到;然后依据上述最优比对测序片段选取杂合SNP标记;
依据上述最优比对测序片段选取测序深度高于预设值的区域,在上述区域中寻找上述杂合SNP标记数目最多的窗口;
对覆盖在上述窗口上的测序片段进行聚类,并依据上述聚类结果产生两组最优SNP集合作为种子;
依据上述种子与属于同一单倍型的测序片段在基因组上的位置重合,对上述种子进行延伸得到一测序片段集合;
找到上述测序片段集合对应的杂合SNP标记集合,依据各SNP的质量值得到最优单倍 型对应的SNP集合;
以上述最优单倍型对应的SNP集合为标准对每一测序片段打分,并依据得分将每一测序片段进行区分单倍型的判断。
本发明使用第三代目标区域捕获测序的数据,利用第三代测序仪上机测序过程中能够得到对应染色体位置较为随机分布、片段长度较为随机并且在目标区域长度附近浮动的测序结果,既能够发挥长片段易于组装的优势,又能够体现短片段准确度高的特点。本发明的单倍体分型方法最适用于第三代测序数据,充分发挥出第三代测序手段的优势,相比第二代测序技术,可以得到高可信度的基因全长单倍体分型信息,并进而实现高精度的变异检测。
附图说明
图1为本发明一种实施例的基于三代捕获测序对二倍体基因组单倍体分型的方法流程图;
图2为本发明一个实施例中的样本测序下机数据初步处理后得到的子测序片段的长度分布图,横坐标表示子测序片段对应的长度,纵坐标代表特定长度下子测序片段的数目;
图3为本发明一个实施例中的样本测序数据进一步经过CCS环形矫正后得到的CCS序列的长度分布图,横坐标表示CCS序列的长度,纵坐标代表特定长度范围的CCS序列数目,显示CCS序列相比子测序片段在对应长度范围内的数量减少了近90%;
图4为本发明一个实施例中SNP频数与测序深度的比值分布图,图中横坐标代表这个商值,纵坐标代表特定范围内的商值数量;
图5为本发明一个实施例中CCS序列一致性图,每一个点代表样品HLA-A区域所包含的一条CCS序列,横坐标代表与杂合SNP标记一致的SNP个数,纵坐标代表与杂合SNP标记不一致的SNP个数;
图6至图12分别为本发明一个实施例中HLA-A、HLA-B、HLA-C、HLA-DPA1、HLA-DPB1、HLA-DQA1、HLA-DQB1基因综合基因组学视图(IGV,Integrative Genomics Viewer)。
具体实施方式
下面通过具体实施方式结合附图对本发明作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本发明能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或者可以由其他元件、材料、方法所替代。在某些情况下,本发明相关的一些操作并没有在说明书中显示或者描述,这是为了避免本发明的核心部分被过多的描述所淹没,而对于本领域技术人员而言,详细描述这些相关操作并不是必要的,他们根据说明书中的描述以及本领域的一般技术知识即可完整了解相关操作。
在本发明中,除非另有说明,否则本文中使用的科学和技术名词具有本领域技术人员所通常理解的含义。并且,本文中所使用的各种实验室操作步骤均为相应领域内广泛使用的 常规步骤。同时,为了更好地理解本发明而不是为了限定本发明的范围,下面提供相关术语的定义和/或示例性的解释。
如本文中所使用的,术语“三代捕获测序”(Targeted Sequencing)是指,例如,利用罗氏公司的磁珠捕获产品(Roche NimbleGen SeqCap EZ System)进行DNA样品处理,再利用PacBio公司的测序仪RSII进行测序。
如本文中所使用的,术语“第二代测序”是指,例如,利用最广泛使用的Illumina公司HiSeq 4000等测序仪进行测序,可以参见综述文献(Michael Metzker(2010),Sequencing technologies-the next generation,Nature Genetics)。
如本文中所使用的,术语“PacBio”是指,太平洋生物科技公司(Pacific Biosciences)发布的PacBio RSⅡ以及PacBio Sequel System测序仪。
如本文中所使用的,术语“第三代测序”是指,例如,目前最为成熟的基于太平洋生物科技公司SMRT测序方法进行的单分子实时测序。
如本文中所使用的,术语“聚合酶测序片段(Polymerase Read)”是指,例如,利用PacBio测序仪在测序过程中由光学信号直接转化成的包含序列信息的测序片段。
如本文中所使用的,术语“接头(adapter)”是指,例如,利用PacBio测序仪进行测序之前需要将DNA片段进行修饰,两端各需要加上一个DNA发夹结构单链,这段DNA发夹结构单链具有特定的序列。
如本文中所使用的,术语“子测序片段(subreads)”是指,上文提到的聚合酶测序片段去掉接头序列之后保留下的一段或者几段测序片段。
如本文中所使用的,术语“CCS(Circular Consensus Sequences)环形矫正”是指,将来自于同一聚合酶测序片段的几段子测序片段进行环形合并,以得到准确度较优的一条合成序列片段的过程。
如本文中所使用的,术语“单倍体分型(haplotypes phasing)”是指,例如,对于二倍体生物(例如人类),将测序得到的测序片段对应该生物同种的两条染色体,将所有的测序片段进行聚类区分所属的两条单倍型的过程。
如本文中所使用的,术语“单核苷酸突变(SNP)”是指,在生物体内单个核苷酸的变异所引起的DNA序列多态性(single nucleotide polymorphism)。
如本文中所使用的,术语“杂合SNPs”是指,二倍体生物例如人类,在成对的染色体上相同位置发生单核苷酸突变,并且这两个突变碱基的种类不同。
如本文中所使用的,术语“重叠群(contig)”是指,将具有一定序列重合的两条或者多条测序片段相连接,得到的更长的序列。
如本文中所使用的,术语“种子(seed)”是指,在单倍体分型方法中,作为测序片段分析的起始测序片段。
如本文中所使用的,术语“窗口”是指,在单倍体分型方法中,统计染色体特定坐标范围内对应数值时,所使用的坐标范围长度。
本发明针对现有单倍体分型软件单倍体分型结果准确度不高的问题,提供一种完整的 单倍体分型方法,可以对测序结果正常、覆盖度均匀的区域所包含的测序片段进行高准确度聚类,以区分两条单倍型对应的测序片段,达到单倍体分型的目的。
本发明提供一种完整的基于“第三代测序技术”的目标区域捕获测序手段得到单倍体精度准确、详细而完整的变异信息的方法,其中包括单核苷酸多态性(SNP)、插入缺失变异(Indel)、染色体结构变异(SV)以及拷贝数变异(CNV)等的下游信息分析方法,以解决目前尚且没有用于解决第三代目标区域捕获测序数据的信息分析和数据处理流程的问题。本发明包括一个完整的信息分析方法,可以将PacBio RSII测序的下机数据经由测序对应的bax.h5原始数据文件、CCS序列对应的FASTQ序列信息文件、比对得到的BAM比对信息文件、组装得到的FASTA组装基因组序列文件到最终的变异信息VCF文件。
本发明的数据分析方法所需要的数据来自于现已成熟并广泛使用的目标区域捕获测序的实验方法,例如HLA区域捕获测序。
在进行本发明的单倍体分型方法之前的数据预处理过程包括:
1)PacBio RSII的标准化测序流程,按照PacBio RSII的标准化测序说明书进行。
2)SMRT分析初步信息处理,具体包括:
a)基于PacBio RSII的标准化测序流程得到包含聚合酶测序片段的信息,以二进制形式存储在bax.h5文件中。
b)利用PacBio公司提供的SMRT分析软件包中的软件(https://github.com/PacificBiosciences)去掉测序建库过程中加入的接头(adapter)序列得到更短的子测序片段(subreads)。
c)对这些子测序片段(subreads)利用PacBio公司提供的生物分析软件包中的软件,对随机分布并且对随机发生的测序错误,利用测序质量值和频数等信息信息,进行CCS(Circular Consensus Sequences)环形矫正,对同一零模波导孔(ZMW)中的子测序片段进行合并,以减少子测序片段中的单核苷酸突变错误(SNVs)和插入缺失变异错误(Indels),以得到准确度更高的CCS序列。
3)比对至参考基因组,具体包括:
使用基于Burrows-Wheeler算法的比对软件BWA(http://bio-bwa.sourceforge.net/)将准确度较高的CCS序列比对到例如人参考基因组(GRCh37.p13)上,以确定这些CCS序列来源于人类基因组的哪些位置。然后选取目标基因区域,例如HLA区域的基因,如HLA-A、HLA-B、HLA-C、HLA-DQA1、HLA-DQB1、HLA-DPA1、HLA-DPB1等基因,提取出基因全长区域所对应的所有CCS序列。
然后进行本发明的单倍体分型方法,如图1所示,一种实施例中提供的基于三代捕获测序对二倍体基因组单倍体分型的方法,包括:
步骤S101:将目标基因区域对应的CCS序列比对到参考基因组得到最优比对的对应染色体上的位置,其中CCS序列是第三代目标区域捕获测序片段经由环形矫正得到;然后依据最优比对得到的CCS序列对应染色体位置选取杂合SNP标记。
最优比对测序片段(best hit read)是指对比分数(alignment score)最大的比对测序片 段。这些测序片段的起始和终止位置坐标信息以及所包含的全部SNPs型别和坐标信息被存储以备调用,例如储存在特定的变量结构(structure)中。
在本发明的一个优选的实施例中,依据最优比对测序片段选取杂合SNP标记的步骤具体包括:
对最优比对测序片段上的每一SNP,计算该SNP的频数(AF)与该位置的测序深度(depth)的比值数量关系,并选取比值介于预设截断值(cutoff value)范围的SNP作为杂合SNP标记(markers),用作单倍体分型的依据。在本发明的一个优选的实施例中,预设截断值范围为25%至75%,因为靠近0%和100%的部分是由于第三代测序过程中的测序错误造成的,因此0%至25%和75%至100%的范围内包含较多测序错误的SNP,故在杂合SNP标记选择时不考虑这两部分SNP。
步骤S102:依据最优比对CCS序列片段选取测序深度高于预设值的区域,在这些区域中寻找杂合SNP标记数目最多的窗口。
在本发明的一个优选的实施例中,测序深度通常需要大于最高测序深度的一半,这样的区域称为“高测序深度区域”,例如,在这样的区域中,CCS序列片段均匀分布,具有75×以上的测序深度。
窗口(window)的大小可以依据经验的默认值,例如500bp,在这些高测序深度的窗口中找到杂合度最高的部分窗口,即杂合SNP标记数目最多的窗口,确立这些窗口的位置,作为种子选择的依据。
步骤S103:对覆盖在窗口上的CCS序列片段进行聚类,并依据聚类结果产生两组最优SNP集合作为种子。
在本发明的一个优选的实施例中,对覆盖在窗口上的CCS序列片段进行聚类可以具体包括:将一窗口与左右相邻的窗口加和得到的三倍窗口区域内的SNP,对相同位置、不同种类的SNP进行区分统计,得到两个单倍型对应的SNP集合。
在本发明的一个优选的实施例中,依据聚类结果产生两组最优SNP集合作为种子可以具体包括:依据两个单倍型对应的SNP集合进行人工模拟,选取一窗口与左右相邻的窗口加和得到的三倍窗口区域作为种子的长度,并且选取频数最高的SNP组合作为种子携带的SNP信息,进而产生两组最优SNP集合,分别作为两个单倍型的起始种子。
步骤S104:依据种子与属于同一单倍型的CCS序列片段在基因组上的位置重合,对种子进行延伸得到一CCS序列片段集合。
在本发明的一个优选的实施例中,对种子进行延伸得到一CCS序列片段集合具体包括:每条种子分别都对全部CCS序列片段进行查阅。在延伸起始时每条种子所对应的三倍窗口区域作为已知区域(detected region),又称为已经延伸的区域,对每条CCS序列片段与已知区域有重合(overlap)的部分对应的SNP进行判断,比较其位置、种类和测序质量值;将属于同一单倍型的CCS序列片段,依据其在基因组上的位置与已知区域的空间重合度,按照空间重合度从大到小的顺序分级,再依次加入已知区域,直至延伸到所有CCS序列片段末端,进而构建完整的单倍型并记录CCS序列片段集合。
步骤S105:找到CCS序列片段集合对应的杂合SNP标记集合,依据各SNP的质量值得到最优单倍型对应的SNP集合。
在本发明的一个优选的实施例中,依据各SNP的质量值得到最优单倍型对应的SNP集合可以具体包括:计算杂合SNP标记集合中各SNP对应的测序质量值,选取测序质量值加和最高的SNP,得到最优单倍型对应的SNP集合。
步骤S106:以上述最优单倍型对应的SNP集合为标准对每一CCS序列片段打分,并依据得分将每一CCS序列片段进行区分单倍型的判断。
在本发明的一个优选的实施例中,上述打分和区分单倍型的判断可以具体包括:依据每一CCS序列片段上SNP的位置、种类与上述最优单倍型对应的SNP集合的重叠程度,利用测序质量值进行加权的一致性比值计算,根据每一CCS序列片段的得分情况将每一CCS序列片段进行区分单倍型的判断,选择一致性较高(例如前50%)的CCS序列片段,实现区分两个单倍型的目的。
在本发明的一个优选的实施例中,在区分单倍型的判断之后,还包括:对目标基因对应的两个单倍型下的CCS序列片段进行组装以构建重叠群(contig),进而得到目标基因的全长单倍型序列。
在得到覆盖目标基因区域的重叠群的基础上,利用已经成熟并广泛使用的重测序分析流程进行变异(例如SNPs、indels、SVs、CNVs)的标准检测。
本领域技术人员可以理解,上述实施方式中各种方法的全部或部分功能可以通过硬件的方式实现,也可以通过计算机程序的方式实现。当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘、光盘、硬盘等,通过计算机执行该程序以实现上述功能。例如,将程序存储在设备的存储器中,当通过处理器执行存储器中程序,即可实现上述全部或部分功能。另外,当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序也可以存储在服务器、另一计算机、磁盘、光盘、闪存盘或移动硬盘等存储介质中,通过下载或复制保存到本地设备的存储器中,或对本地设备的系统进行版本更新,当通过处理器执行存储器中的程序时,即可实现上述实施方式中全部或部分功能。
因此,本发明的另一个实施例提供一种基于三代捕获测序对二倍体基因组单倍体分型的装置,包括:存储器,用于存储程序;处理器,用于通过执行上述存储器存储的程序以实现如下的方法:将目标基因区域对应的CCS序列片段比对到参考基因组得到最优比对的染色体位置,其中上述CCS序列片段是第三代目标区域捕获测序片段经由环形矫正得到;然后依据上述最优比对CCS序列片段选取杂合SNP标记;依据上述最优比对CCS序列片段选取测序深度高于预设值的区域,在上述区域中寻找上述杂合SNP标记数目最多的窗口;对覆盖在上述窗口上的CCS序列片段进行聚类,并依据上述聚类结果产生两组最优SNP集合作为种子;依据上述种子与属于同一单倍型的CCS序列片段在基因组上的位置重合,对上述种子进行延伸得到CCS序列片段集合;找到上述CCS序列片段集合对应的杂合SNP标记集合,依据各SNP的质量值得到最优单倍型对应的SNP集合;以上述最优单倍型对应的 SNP集合为标准对每一CCS序列片段打分,并依据得分将每一CCS序列片段进行区分单倍型的判断。
本发明的又一个实施例提供一种计算机可读存储介质,包括程序,上述程序能够被处理器执行以实现如下的方法:将目标基因区域对应的CCS序列片段比对到参考基因组得到最优比对的染色体上的位置,其中上述CCS序列片段是第三代目标区域捕获测序CCS序列片段经由环形矫正得到;然后依据上述最优比对CCS序列片段选取杂合SNP标记;依据上述最优比对CCS序列片段选取测序深度高于预设值的区域,在上述区域中寻找上述杂合SNP标记数目最多的窗口;对覆盖在上述窗口上的CCS序列片段进行聚类,并依据上述聚类结果产生两组最优SNP集合作为种子;依据上述种子与属于同一单倍型的CCS序列片段在基因组上的位置重合,对上述种子进行延伸得到一CCS序列片段集合;找到上述CCS序列片段集合对应的杂合SNP标记集合,依据各SNP的质量值得到最优单倍型对应的SNP集合;以上述最优单倍型对应的SNP集合为标准对每一CCS序列片段打分,并依据得分将每一CCS序列片段进行区分单倍型的判断。
本发明的实施例使用第三代目标区域捕获测序的数据,利用第三代测序仪上机测序过程中能够得到对应染色体位置较为随机分布、片段长度较为随机并且在目标区域长度附近浮动的测序结果,既能够发挥长片段易于组装的优势,又能够体现短片段准确度高的特点。本发明的单倍体分型方法最适用于第三代测序数据,充分发挥出第三代测序手段的优势,相比第二代测序技术,可以得到高可信度的基因全长单倍体分型信息,并进而实现高精度的变异检测。
以下通过实施例详细说明本发明的技术方案和效果,应当理解,实施例仅是示例性的,不能理解为对本发明保护范围的限制。
实施例
本实施例对人类第六号染色体上HLA目标区域捕获测序,并对HLA-A、HLA-B、HLA-C、HLA-DPA1、HLA-DPB1、HLA-DQA1、HLA-DQB1基因全长区域的信息分析。
对华大基因BGI-YH细胞系样本利用现在已经成熟并公开的实验技术,进行HLA完整全长区域捕获实验,构建长度为10K的文库并使用PacBio RSII测序仪进行测序。并且对于同一BGI-YH样本,进行5次平行独立地捕获、建库、测序操作。基于PacBio RSII的标准化测序流程得到包含聚合酶测序片段的信息,以二进制形式存储在bax.h5文件中。
利用PacBio公司提供的SMRT分析软件包中的软件(https://github.com/PacificBiosciences)去掉测序建库过程中加入的接头(adapter)序列得到更短的子测序片段(subreads)。子测序片段的长度分布如图2所示,曲线平滑同时包含测序片段长度为2.5k的主峰和较为明显的在5k附近的拖尾。
利用PacBio公司提供的SMRT分析中的RS_ReadsOfInsert.xml协议进行CCS环形矫正(circular consensus sequencing)得到fastq文件。bax.h5文件总计约80G,CCS矫正之后得到的ccs.fastq文件可达290M,同时还存在约240M的clr.fastq文件(这一文件仅包含单次测序得到的连续的长测序片段,无法进行CCS矫正)。CCS序列片段的长度分布如图3所示, 曲线平滑同时包含CCS序列片段长度为2.5k的主峰和较为明显的5k的次峰。
利用开源的BWA比对软件(Version:0.5.9-r16)下的适用于较长序列比对的MEM算法(BWA-MEM)将这一CCS序列文件与人类参考基因组(GRCh37.p13)进行比对得到SAM格式的CCS序列片段比对文件。通过SAM文件中的CCS序列片段的位置信息选取研究所需的目标区域,例如HLA-A基因和这一基因临近的区域(NC_000006.11(29910247..29913661)对应的CCS序列片段。
利用开源的SAMtools软件(Version:0.5.9-r16)中view、sort、rmdup、index命令依次对该文件进行操作,先将SAM格式的文件转换成二进制的BAM文件,再用sort命令进行排序,接下来使用SAMtools软件中的rmdup命令去除由于PCR重复产生的CCS序列片段,用index命令产生.bai的索引文件。
根据其中杂合SNP的具体情况利用本发明提供的信息分析方法进行单倍体分型(haplotypes phasing)以区分两条单倍型(haplotypes),具体过程如下:
(a)对上一步提取出的基因全长区域所对应的所有CCS序列片段比对至人类参考基因组(GRCh37.p13)所得到的BAM文件进行完整查阅并记录和装载到内存中。将CCS序列片段信息储存在哈希变量结构(hash)中,用以找到对比分数(alignment score)最大的比对CCS序列片段,即最优比对CCS序列片段(best hit read)。并将这些CCS序列片段起始和终止位置坐标信息以及所包含的全部SNPs型别和坐标信息储存在特定的变量结构(structure)中。
(b)利用程序循环,完整查阅内存中所有的最优比对CCS序列片段信息,构建测序深度(depth)和覆盖度与SNPs频数(AF)的数量关系。选取25%-75%作为截断值(cutoff value)范围,用于选取杂合SNP作为单倍体分型的依据,即杂合SNP标记。SNP频数与测序深度的比值分布情况如图4所示。
(c)查阅内存中所有最优比对CCS序列片段,找到测序深度较高(大于最高测序深度的一半)的高测序深度区域(CCS序列片段均匀分布的区域,测序深度75×以上),并且设置统计窗口大小(依据经验的默认值为500bp),在这些高深度的窗口中找到杂合度最高的部分窗口,即杂合SNP标记数目最多的窗口,确立这些窗口的位置,作为种子选择的依据。
(d)利用上一步中得到的杂合SNP标记数目最多的窗口,对覆盖在窗口上的所有CCS序列片段,依据其带有的SNP位置和种类进行聚类分析。这一聚类分析的具体过程为:对一个窗口和左右各相邻的一个窗口加和得到的三倍窗口大小区域内的SNP进行分析。对相同位置、不同种类的SNP进行区分统计,分离得到两个单倍型对应的SNP集合。完成聚类分析过程后,再依据得到的两个单倍型SNP集合进行人工模拟,选取窗口和左右各相邻的一个窗口加和得到的三倍窗口大小的区域作为种子的长度,并且选取频数最高的SNP组合作为种子携带的SNP信息,进而产生两组最优SNP集合,分别作为两个单倍型的起始种子。
(e)每一种子分别都对全部CCS序列片段进行查阅。在延伸起始时,每条种子所对应 的三倍窗口大小区域为已知区域,又称为已经延伸的区域。对每条CCS序列片段与已已经延伸的区域有重合的部分对应的SNPs进行判断,比较SNP的位置、种类和测序质量值。这一过程采用分级(hierarchically)的思路,将属于同一单倍型的CCS序列片段依据其在基因组上的位置与已经延伸的区域的空间重合度,按照空间重合度从大到小的顺序分级。再依次加入已经延伸的区域,直至延伸到所有CCS序列片段的末端,构建一条完整单倍型并且记录CCS序列片段。
(f)利用上一步延伸得到的CCS序列片段集合,找到这些CCS序列片段上对应的杂合SNP标记集合。计算各个SNP对应的质量值,选取其中质量值加和最高的SNP,得到最优单倍型对应的SNP集合。
(g)以上一步得到的最优单倍型对应的SNP集合为标准,对每一条CCS序列片段进行打分判定。依据每一条CCS序列片段上SNP的位置、种类与上一步得到的SNP集合的重叠程度,利用测序质量值进行加权的一致性比值计算,评分并记录,如图5所示。依据每条CCS序列片段的得分情况,可以将每一条CCS序列片段进行区分单倍型的判断,选择一致性较高的前50%的CCS序列片段,以达到区分两个单倍型的目的。
利用Canu组装软件(https://github.com/marbl/canu)对每一条单倍型下的CCS序列片段进行组装得到高准确率、完整分型的两条重叠群。每一步的序列片段和碱基数目的详细信息如表1所示。
表1每一操作步骤的数据量详细信息
步骤 测序片段数目 碱基数目
子测序片段 1,405,529 3,902,016,486
CCS矫正后的测序片段 154,307 469,301,593
Canu矫正后的测序片段 31,204 95,842,806
Canu剪裁后的测序片段 27,692 90,664,025
图6至图12分别示出了本实施例中HLA-A、HLA-B、HLA-C、HLA-DPA1、HLA-DPB1、HLA-DQA1、HLA-DQB1基因综合基因组学视图(IGV,Integrative Genomics Viewer)。显示了PacBio SMRT分析CCS环形矫正后,利用本实施例的单倍体分型方法进行分型操作得到各个基因附近两条单倍体对应测序片段并利用Canu组装得到的重叠群在人类参考基因组上的分布情况,体现出这些基因区域测序深度较高的同时覆盖度(coverage)完整。图中覆盖度条形图表现出单倍体分型之后每一SNP位置碱基的频数情况,条形图中每个条带绝大部分由一种颜色填充,体现出单倍体分型的准确度较高。
利用BWA比对软件(版本:0.5.9-r16)下的MEM算法将上一步得到的重叠群序列文件与人类参考基因组(GRCh37.p13)进行比对得到SAM格式文件。
利用变异检测(SNP calling)软件,对上一步得到的SAM文件进行SNP检测,BGI-YH样本HLA-A基因全长以及临近区域的两条单倍型所携带的SNP变异检测结果如表2中 所示,与金标准的桑格测序(Sanger sequencing)的比较结果如表3所示,一致率达到100%(FP=0.0%,FN=0.0%),显示本发明实施例中所包含的数据处理流程和其中的单倍体分型方法能够达到与金标准相同的准确度,优于“第二代”测序的变异分析结果。
表2HLA-A基因全长以及临近区域SNP变异的详细信息
Figure PCTCN2017089108-appb-000001
Figure PCTCN2017089108-appb-000002
Figure PCTCN2017089108-appb-000003
Figure PCTCN2017089108-appb-000004
Figure PCTCN2017089108-appb-000005
Figure PCTCN2017089108-appb-000006
Figure PCTCN2017089108-appb-000007
Figure PCTCN2017089108-appb-000008
Figure PCTCN2017089108-appb-000009
Figure PCTCN2017089108-appb-000010
表3与金标准(Sanger测序)的详细比较信息
Figure PCTCN2017089108-appb-000011
Figure PCTCN2017089108-appb-000012
以上应用了具体个例对本发明进行阐述,只是用于帮助理解本发明,并不用以限制本发明。对于本发明所属技术领域的技术人员,依据本发明的思想,还可以做出若干简单推演、变形或替换。

Claims (11)

  1. 一种基于三代捕获测序对二倍体基因组单倍体分型的方法,其特征在于,包括:
    将目标基因区域对应的CCS序列比对到参考基因组得到最优比对测序片段,其中所述CCS序列是第三代目标区域捕获测序片段经由环形矫正得到;然后依据所述最优比对测序片段选取杂合SNP标记;
    依据所述最优比对测序片段选取测序深度高于预设值的区域,在所述区域中寻找所述杂合SNP标记数目最多的窗口;
    对覆盖在所述窗口上的测序片段进行聚类,并依据所述聚类结果产生两组最优SNP集合作为种子;
    依据所述种子与属于同一单倍型的测序片段在基因组上的位置重合,对所述种子进行延伸得到一测序片段集合;
    找到所述测序片段集合对应的杂合SNP标记集合,依据各SNP的质量值得到最优单倍型对应的SNP集合;
    以所述最优单倍型对应的SNP集合为标准对每一测序片段打分,并依据得分将每一测序片段进行区分单倍型的判断。
  2. 根据权利要求1所述的方法,其特征在于,所述依据所述最优比对测序片段选取杂合SNP标记具体包括:
    对所述最优比对测序片段上的每一SNP,计算该SNP的频数与该位置的测序深度的比值,并选取比值介于预设截断值范围的SNP作为所述杂合SNP标记;优选地,所述预设截断值范围为25%至75%。
  3. 根据权利要求1所述的方法,其特征在于,所述测序深度高于预设值的区域是指测序深度大于最高测序深度的一半的区域;优选地,所述测序深度高于预设值的区域是指测序深度为75×以上的区域。
  4. 根据权利要求1所述的方法,其特征在于,所述对覆盖在所述窗口上的测序片段进行聚类具体包括:将一窗口与左右相邻的窗口加和得到的三倍窗口区域内的SNP,对相同位置、不同种类的SNP进行区分统计,得到两个单倍型对应的SNP集合。
  5. 根据权利要求1所述的方法,其特征在于,所述依据所述聚类结果产生两组最优SNP集合作为种子具体包括:依据所述两个单倍型对应的SNP集合进行人工模拟,选取一窗口与左右相邻的窗口加和得到的三倍窗口区域作为种子的长度,并且选取频数最高的SNP组合作为种子携带的SNP信息,进而产生两组最优SNP集合,分别作为两个单倍型的起始种子。
  6. 根据权利要求1所述的方法,其特征在于,所述对所述种子进行延伸得到一测序片段集合具体包括:
    在延伸起始时每条种子所对应的三倍窗口区域作为已知区域,对每条测序片段与所述已知区域有重合的部分对应的SNP比较其位置、种类和测序质量值;将属于同一单倍型的测序片段,依据其在基因组上的位置与所述已知区域的空间重合度,按照所述空间重合度从大到小的顺序分级,再依次加入所述已知区域,直至延伸到所有测序片段末端,进而构建完 整的单倍型并记录测序片段集合。
  7. 根据权利要求1所述的方法,其特征在于,所述依据各SNP的质量值得到最优单倍型对应的SNP集合具体包括:
    计算所述杂合SNP标记集合中各SNP对应的测序质量值,选取所述测序质量值加和最高的SNP,得到所述最优单倍型对应的SNP集合。
  8. 根据权利要求1所述的方法,其特征在于,所述打分和区分单倍型的判断具体包括:
    依据每一测序片段上SNP的位置、种类与所述最优单倍型对应的SNP集合的重叠程度,利用测序质量值进行加权的一致性比值计算,根据每一测序片段的得分情况将每一测序片段进行区分单倍型的判断。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述方法还包括:
    在所述区分单倍型的判断之后,对所述目标基因对应的两个单倍型下的CCS序列进行组装以构建重叠群,进而得到所述目标基因的全长单倍型序列。
  10. 一种基于三代捕获测序对二倍体基因组单倍体分型的装置,其特征在于,包括:
    存储器,用于存储程序;
    处理器,用于通过执行所述存储器存储的程序以实现如权利要求1至9中任一项所述的方法。
  11. 一种计算机可读存储介质,其特征在于,包括程序,所述程序能够被处理器执行以实现如权利要求1至9中任一项所述的方法。
PCT/CN2017/089108 2017-06-20 2017-06-20 基于三代捕获测序对二倍体基因组单倍体分型的方法和装置 WO2018232580A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201780090335.9A CN110621785B (zh) 2017-06-20 2017-06-20 基于三代捕获测序对二倍体基因组单倍体分型的方法和装置
PCT/CN2017/089108 WO2018232580A1 (zh) 2017-06-20 2017-06-20 基于三代捕获测序对二倍体基因组单倍体分型的方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/089108 WO2018232580A1 (zh) 2017-06-20 2017-06-20 基于三代捕获测序对二倍体基因组单倍体分型的方法和装置

Publications (1)

Publication Number Publication Date
WO2018232580A1 true WO2018232580A1 (zh) 2018-12-27

Family

ID=64735460

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/089108 WO2018232580A1 (zh) 2017-06-20 2017-06-20 基于三代捕获测序对二倍体基因组单倍体分型的方法和装置

Country Status (2)

Country Link
CN (1) CN110621785B (zh)
WO (1) WO2018232580A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110592208A (zh) * 2019-10-08 2019-12-20 北京诺禾致源科技股份有限公司 地中海贫血症三类亚型的捕获探针组合物及其应用方法和应用装置
CN113496760A (zh) * 2020-04-01 2021-10-12 深圳华大基因科技服务有限公司 基于第三代测序的多倍体基因组组装方法和装置
CN116779035A (zh) * 2023-05-26 2023-09-19 成都基因汇科技有限公司 多倍体转录组亚基因组分型方法及计算机可读存储介质
CN117577178A (zh) * 2024-01-16 2024-02-20 山东大学 一种结构变异精确断裂信息的检测方法、系统及其应用

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583997B (zh) * 2020-05-06 2022-03-01 西安交通大学 杂合变异下校正第三代测序数据中测序错误的混合方法
CN112210597B (zh) * 2020-09-30 2022-11-11 青岛普泽麦迪生物技术有限公司 基于长DNA片段目标捕获及MinION长读数对HLA探针文库进行测序的方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104508144A (zh) * 2012-07-18 2015-04-08 伊鲁米纳剑桥有限公司 用于确定单倍型和定相单倍型的方法和系统
CN105112518A (zh) * 2015-08-18 2015-12-02 北京希望组生物科技有限公司 一种基于Pacbio RS II测序平台的HLA分型方法
CN106498050A (zh) * 2016-10-25 2017-03-15 中国医学科学院药用植物研究所 一种基于smrt测序技术的中成药生物物种组成成分监测方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012083505A1 (zh) * 2010-12-24 2012-06-28 深圳华大基因科技有限公司 Hla-c基因分型的方法及其相关引物
CN101921842B (zh) * 2010-06-30 2013-08-07 深圳华大基因科技有限公司 Hla-a,b基因分型用pcr引物及其使用方法
CN103993069B (zh) * 2014-03-21 2020-04-28 深圳华大基因科技服务有限公司 病毒整合位点捕获测序分析方法
CN104762406B (zh) * 2015-04-23 2017-08-25 东南大学 一种两核苷酸不同步合成测序分析pcr产物单体型方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104508144A (zh) * 2012-07-18 2015-04-08 伊鲁米纳剑桥有限公司 用于确定单倍型和定相单倍型的方法和系统
CN105112518A (zh) * 2015-08-18 2015-12-02 北京希望组生物科技有限公司 一种基于Pacbio RS II测序平台的HLA分型方法
CN106498050A (zh) * 2016-10-25 2017-03-15 中国医学科学院药用植物研究所 一种基于smrt测序技术的中成药生物物种组成成分监测方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110592208A (zh) * 2019-10-08 2019-12-20 北京诺禾致源科技股份有限公司 地中海贫血症三类亚型的捕获探针组合物及其应用方法和应用装置
CN110592208B (zh) * 2019-10-08 2022-05-03 北京诺禾致源科技股份有限公司 地中海贫血症三类亚型的捕获探针组合物及其应用方法和应用装置
CN113496760A (zh) * 2020-04-01 2021-10-12 深圳华大基因科技服务有限公司 基于第三代测序的多倍体基因组组装方法和装置
CN113496760B (zh) * 2020-04-01 2024-01-12 深圳华大基因科技服务有限公司 基于第三代测序的多倍体基因组组装方法和装置
CN116779035A (zh) * 2023-05-26 2023-09-19 成都基因汇科技有限公司 多倍体转录组亚基因组分型方法及计算机可读存储介质
CN116779035B (zh) * 2023-05-26 2024-03-15 成都基因汇科技有限公司 多倍体转录组亚基因组分型方法及计算机可读存储介质
CN117577178A (zh) * 2024-01-16 2024-02-20 山东大学 一种结构变异精确断裂信息的检测方法、系统及其应用
CN117577178B (zh) * 2024-01-16 2024-03-26 山东大学 一种结构变异精确断裂信息的检测方法、系统及其应用

Also Published As

Publication number Publication date
CN110621785A (zh) 2019-12-27
CN110621785B (zh) 2023-08-15

Similar Documents

Publication Publication Date Title
CN109033749B (zh) 一种肿瘤突变负荷检测方法、装置和存储介质
WO2018232580A1 (zh) 基于三代捕获测序对二倍体基因组单倍体分型的方法和装置
Yaari et al. Practical guidelines for B-cell receptor repertoire sequencing analysis
Cortés-Ciriano et al. Computational analysis of cancer genome sequencing data
Yuan et al. CNV_IFTV: an isolation forest and total variation-based detection of CNVs from short-read sequencing data
CN113168886A (zh) 用于使用神经网络进行种系和体细胞变体调用的系统和方法
Babarinde et al. Computational methods for mapping, assembly and quantification for coding and non-coding transcripts
WO2012168815A2 (en) Method for assembly of nucleic acid sequence data
EP3405573A1 (en) Methods and systems for high fidelity sequencing
CN110021355B (zh) 二倍体基因组测序片段的单倍体分型和变异检测方法和装置
Zhang et al. Identification of common carp innate immune genes with whole-genome sequencing and RNA-Seq data
Chen et al. Recent advances in sequence assembly: principles and applications
Ahsan et al. A survey of algorithms for the detection of genomic structural variants from long-read sequencing data
Kõks et al. Sequencing and annotated analysis of full genome of Holstein breed bull
WO2024051097A1 (zh) 肿瘤特异环状rna的新抗原鉴定方法及装置、设备、介质
KR20140099189A (ko) 유전자 서열 기반 개인 마커에 관한 정보를 제공하는 방법 및 이를 이용한 장치
WO2019236842A1 (en) Difference-based genomic identity scores
Esim et al. Determination of malignant melanoma by analysis of variation values
JP6902258B2 (ja) 被験者のhla遺伝子のアレルペアを判定する方法
Zhang et al. PocaCNV: a tool to detect copy number variants from population-scale genome sequencing data
Gerasimov Analysis of ngs data from immune response and viral samples
Chen et al. DeBreak: Deciphering the exact breakpoints of structural variations using long sequencing reads
CN114882943B (zh) 一种分析体细胞变异的方法及装置
RU2804535C1 (ru) Система обработки данных полногеномного секвенирования
Heller Structural variant calling using third-generation sequencing data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17914239

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17914239

Country of ref document: EP

Kind code of ref document: A1