WO2012116658A2 - 组装基因组序列的方法和装置 - Google Patents

组装基因组序列的方法和装置 Download PDF

Info

Publication number
WO2012116658A2
WO2012116658A2 PCT/CN2012/071876 CN2012071876W WO2012116658A2 WO 2012116658 A2 WO2012116658 A2 WO 2012116658A2 CN 2012071876 W CN2012071876 W CN 2012071876W WO 2012116658 A2 WO2012116658 A2 WO 2012116658A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
short
reads
fragment
module
Prior art date
Application number
PCT/CN2012/071876
Other languages
English (en)
French (fr)
Inventor
韩长磊
陈文彬
张秀清
杨焕明
Original Assignee
深圳华大基因科技有限公司
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技有限公司, 深圳华大基因研究院 filed Critical 深圳华大基因科技有限公司
Priority to US14/002,374 priority Critical patent/US20130345095A1/en
Publication of WO2012116658A2 publication Critical patent/WO2012116658A2/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the present invention relates to the field of bioinformatics, and in particular to a method and apparatus for assembling a genome sequence. Background technique
  • next-generation sequencing technologies such as 454 (Roche), Solexa (Illumina), and SOLiD (ABI)
  • sequencing costs have fallen dramatically while sequencing throughput has increased rapidly.
  • Next-generation sequencing technology has greatly advanced the development of genomics. Whole genome sequences of a large number of species were published, including James Watson's personal genome, the first Asian genome, and the genomes of giant pandas and cucumbers.
  • the length of the segment N50 (N50 is to arrange all the assembled sequences from large to small and add by length, when the sum is obtained for the total length of all assembled sequences.
  • the length of the assembly sequence at 50 o'clock, a detailed description of the N50 can be found in Miller et al. 2010. Assembly algorithms for next generation sequencing data. Genomics. 95 (6): 315-327, by reference
  • the indicators are also limited by the length of the inserts that can be constructed in the experiment.
  • the present invention aims to solve at least one of the technical problems existing in the prior art.
  • the present invention proposes a method and apparatus that can be used to assemble a genomic sequence so as to be able to assemble a genomic sequence using a short fragment sequence sequenced at the end of a long insert fragment library, thereby improving assembly efficiency and effectiveness.
  • the invention proposes a method of assembling a genomic sequence.
  • the method of assembling a genomic sequence comprises filtering a short segment sequence sequenced at a terminal end of a long insert library to remove a failed sequence; and comparing the filtered short segment sequence with a reference genome sequence; According to the alignment result, the paired short fragment sequences which are aligned are divided into a soap reads sequence, a single reads sequence and an unmap reads sequence, and the number of each sequence is counted; using the soap reads sequence, the short segment sequence on the comparison pair is calculated.
  • the distance on the same segment of the reference genomic sequence, and the distance distribution of the short segment sequences on each pair of comparison pairs on the reference genomic sequence; when the distance distribution meets the threshold requirement, the difference between the unique paired reference genomic sequences is used The single reads sequence of the fragment performs assembly of the genomic sequence. Thereby, the efficiency and effect of assembling the genome sequence can be improved.
  • the method of assembling a genomic sequence may further have the following additional technical features:
  • the filtered short segment sequence comprises a pair of segment segment sequences. Thereby, the efficiency of the assembly group can be further improved.
  • the method before the aligning the filtered short segment sequence with the reference genome sequence, the method further comprises: truncating the filtered short segment sequence into a short segment sequence of a set length. Thereby, the efficiency of the assembly group can be further improved.
  • the unqualified sequence comprises at least one selected from the group consisting of: an exogenous sequence, a short segment sequence in which the number of bases N reaches a predetermined ratio, a short segment sequence containing a polyA structure, and a low-mass base.
  • the number of bases reaches a predetermined number of short segment sequences, the linker-contaminated short segment sequence, the short segment sequence in which the paired short segment sequences have overlapping regions in the sequencing, and the repeatedly detected short segment sequences.
  • the soap reads sequence comprises a soap reads sequence unique to the same fragment of the pair of reference genomic sequences and a plurality of soap reads sequences of the same fragment of the reference genomic sequence, using the soap reads sequence
  • the step of calculating the distance of the short segment sequence on the alignment pair on the same segment of the reference genome sequence further comprises: calculating the short segment sequence on the comparison pair using a soap reads sequence unique to the same segment of the reference reference genome sequence The distance on the same segment of the reference genomic sequence.
  • the method further comprises: constructing a library of long insert fragments; and sequencing the ends of the long insert library to obtain the short sequence of the output. It is advantageous to assemble longer genomic sequence fragments.
  • the invention proposes an apparatus for assembling a genomic sequence.
  • the apparatus for assembling a genomic sequence comprises: a sequence filtering module, wherein the sequence filtering module is configured to filter a short segment sequence sequenced by a long insert fragment library to remove a failed sequence; a sequence alignment module, The sequence alignment module is connected to the sequence filtering module, and is configured to compare the filtered short segment sequence with the reference genome sequence, wherein the filtered short segment sequence comprises a pair of short segment sequences; the sequence classification module The sequence classification module is connected to the sequence comparison module, and is configured to compare the paired short segments according to the comparison result to soap reads / * ⁇ , single reads / *port unmap reads / * ⁇ , i3 ⁇ 4i ⁇ H / * ⁇ 1 j sequence length statistics module, the sequence length statistics module is connected to the sequence classification module, used to calculate a short segment sequence on the contrast pair using the soap reads sequence in the reference The distance on the same fragment of the genomic sequence,
  • the single reads sequence of different fragments of the genomic sequence is used to assemble the genomic sequence.
  • the apparatus for assembling the genomic sequence can effectively implement the aforementioned method for assembling the genome, thereby enabling the genomic sequence to be performed using the short fragment sequence sequenced by the end of the long insert library. Assembly, thereby improving assembly efficiency and effectiveness.
  • the apparatus for assembling a genomic sequence may further have the following additional technical features:
  • the apparatus for assembling a genomic sequence of the present invention further comprises: a sequence intercepting module, wherein the sequence intercepting module respectively The sequence filtering module is connected to the sequence comparison module, and is configured to intercept the filtered short segment sequence into a short segment of a set length before comparing the filtered short segment sequence with the reference genome sequence. sequence.
  • the unqualified sequence comprises at least one selected from the group consisting of: an exogenous sequence, a short fragment sequence having a predetermined number of bases N, a short fragment sequence containing a polyA structure, a low-mass base
  • the number of bases reaches a predetermined number of short segment sequences, the linker-contaminated short segment sequence, the short segment sequence in which the paired short segment sequences have overlapping regions in the sequencing, and the repeatedly detected short segment sequences.
  • the soap reads sequence comprises a soap reads sequence unique to the same fragment of the pair of reference genomic sequences and a plurality of soap reads sequences of the same fragment of the reference genomic sequence, wherein The distance of the short segment sequence on the aligned pair on the same fragment of the reference genomic sequence is calculated using the soap reads sequence uniquely to the same fragment of the pair of reference genomic sequences.
  • the apparatus for assembling a genomic sequence of the present invention further comprises: a sequence receiving module, the sequence receiving module being coupled to the sequence filtering module for receiving a sequence after the end of the long insert fragment library is sequenced.
  • a sequence receiving module being coupled to the sequence filtering module for receiving a sequence after the end of the long insert fragment library is sequenced.
  • FIG. 1 is a schematic flow chart of one embodiment of a method for assembling a genomic sequence of the present invention
  • FIG. 2 is a schematic flow chart of another embodiment of the method for assembling a genome sequence of the present invention.
  • FIG. 3 is a schematic flow chart of still another embodiment of the method for assembling a genome sequence of the present invention.
  • FIG. 4 is a schematic flow chart of still another embodiment of the method for assembling a genome sequence of the present invention.
  • FIG. 5 is a schematic diagram of library quality evaluation in still another embodiment of the method for assembling genomic sequences of the present invention
  • FIG. 6 is a schematic structural view of an embodiment of the apparatus for assembling genomic sequences of the present invention
  • Figure 7 is a schematic structural view of still another embodiment of the assembled genomic sequence device of the present invention.
  • Figure 8 is a schematic view showing the structure of still another embodiment of the assembled genomic sequence device of the present invention. Detailed description of the invention
  • the method of assembling a genome sequence may include the following steps:
  • the term "long insert" used in the present invention is not particularly limited in length, and can be achieved by the prior art. Any insertion length, for example, may be up to at least 200 kb, for example from 40 kb to 200 kb, for example from about 100 kb to 200 kb.
  • the above long inserts can be easily obtained by those skilled in the art using existing vectors.
  • fosmid and Bacterial Artificial Chromosome (BAC) are large fragment clones available for genomic studies. BACs can usually insert fragments of about 100 kb to 200 kb, and fosmid can usually insert about 40 kb.
  • the fragments, BAC and fosmid not only have the characteristics of long inserts, but also have very good stability, so they are important tools for genomics research, in gene map cloning, gene analysis, structural variation and genome assembly. Important role.
  • the type of the unqualified sequence that needs to be removed is not particularly limited.
  • the short fragment sequence, the short fragment sequence in which the paired short fragment sequence has overlapping regions in sequencing, and the repeated detection of the short fragment sequence is defined as a repeat.
  • the means for performing the comparison is not particularly limited, and for example, the known soap, bwa, and the like may be used for comparison with related software.
  • the resulting filtered short segment sequence comprises a pair of short segment sequences;
  • the paired short segment sequence to be compared is divided into a soap reads sequence, a single reads sequence, and an unmap reads sequence, and the number of each sequence is counted;
  • the term "soap reads sequence” is used to mean a short sequence that exists in pairs and can be aligned on the same assembled fragment of a reference genomic sequence.
  • the term “single reads sequence” means that only one of the two short sequences in the pair is aligned to a short sequence on a different assembled fragment of the reference genome sequence;
  • the term “unmap reads” means that the two short sequences in the pair are not Aligning to a short sequence on an assembled fragment of a reference genomic sequence;
  • the soap reads sequence since the soap reads sequence is paired and can be aligned to a short sequence on the same assembled fragment of the reference genome sequence, the soap reads sequence can be used to calculate the same fragment of the short sequence of the reference pair in the reference genome sequence.
  • the distance above ie, calculating the length of the soap reads sequence), and counting the distance distribution of the short segment sequences on each pair of comparison pairs on the reference genome sequence;
  • the threshold requirement is met in the distance distribution (the specific value of the threshold is not particularly limited according to an embodiment of the present invention, and can be obtained by a limited number of experiments by a person skilled in the art for a specific sequencing environment.
  • the specific value of the threshold is not particularly limited according to an embodiment of the present invention, and can be obtained by a limited number of experiments by a person skilled in the art for a specific sequencing environment.
  • the genomic sequence can be assembled using a single reads sequence that is uniquely different from the different assembled fragments of the reference genomic sequence;
  • a single reads sequence that is unique to a different assembled fragment of a pair of reference genomic sequences can be used to link adjacent genomic sequence fragments according to the intrinsic sequence length and spatial relationship of the sequencing library to enhance the genomic group Install the effect.
  • the assembled genomic sequence may include the following steps: Specifically, the sequenced short fragment sequence may be aligned with an experimentally introduced foreign sequence (for example, various linker sequences), if If there is an exogenous sequence in the sequence, it is considered to be a non-conforming sequence, and the unqualified short segment sequence is removed.
  • an experimentally introduced foreign sequence for example, various linker sequences
  • the unqualified sequence may further include at least one of the following: a short segment sequence in which the number of bases N reaches a predetermined ratio a short fragment sequence containing a polyA structure, a short fragment sequence having a low number of bases to a certain extent (for example, 40 bases), having a linker contamination (for example, at least 10 bp aligned with the linker sequence, and the number of mismatches is not Short clip sequences of more than 3), short sequences of pairs of short clip sequences in sequencing (for example, short-sequences of paired short-sequence sequences in sequencing are at least 10 bp, and the mismatch ratio is less than 10%) Segment sequence, repeated detection of short fragment sequences (paired short fragment sequences in sequencing are identically defined as repeated short fragment sequences). Finally, the short segment sequence with poor head or end quality will be cut off directly;
  • the lengths of the aligned fragments should be substantially the same, allowing a certain floating range (where the floating range can be set according to requirements), and sequencing the sequencing fragments in the normal range.
  • the obtained short segment sequence is referred to as a normal short sequence, and conversely referred to as an abnormal short sequence.
  • the set length is at least 40 bp. If the sequence length of the alignment is too short, the efficiency of the alignment is lowered on the one hand, and the N50 performance is lowered on the other hand, and the short sequence is allowed on the comparison.
  • the maximum number of mismatches should be as small as possible to ensure the accuracy of the comparison;
  • the filtered short segment sequence is compared with the reference genome sequence.
  • the means for performing the comparison is not particularly limited.
  • a known soap, bwa, etc. method and related methods may be used.
  • the software is compared.
  • the resulting filtered short segment sequence comprises a pair of short segment sequences;
  • the paired short segment sequences that are compared according to the comparison result are divided into a soap reads sequence, a single reads sequence, and an unmap reads sequence, and count the number of each sequence;
  • the specific value of the threshold is not particularly limited according to an embodiment of the present invention, and can be obtained by a limited number of experiments by a person skilled in the art for a specific sequencing environment.
  • the ⁇ value is a sequence ratio of more than 85% between 30 kb and 50 kb.
  • the genomic sequence is assembled using the single reads sequence extracted in S210 that is different from the different fragments of the reference genomic sequence.
  • the length of the segment to be compared is limited, and the length of the sequence to be compared is required to be within a set range to ensure the accuracy and efficiency of the alignment.
  • a method of assembling a genomic sequence according to still another embodiment of the present invention is described below with reference to FIG. As shown in Figure 3, the method of assembling a genome sequence can include the following steps:
  • the paired short segment sequence to be compared according to the comparison result is divided into a soap reads sequence, a single reads sequence, and an unmap reads sequence, and the number of each sequence is counted, wherein the soap reads sequence may include a unique pair of comparisons.
  • This example uses the soap reads sequence uniquely to compare the same fragment of the pair of reference genomic sequences to calculate the distance of the short segment sequence on the same pair of reference genomic sequences, and can accurately quantify the quality of the long insert library. , thereby improving the accuracy of genomic sequence assembly.
  • a method of assembling a genomic sequence may include the following steps:
  • constructing a library of long insert fragments can employ the following steps:
  • the vector into which the DNA to be tested is inserted is randomly disrupted to obtain a random disrupted fragment larger than the length of the vector, and then the resulting randomly broken fragment is subjected to end repair to make the end blunt-ended, wherein the vector is a plasmid, specifically , It may be a fosmid plasmid, a BAC plasmid or a cosmid plasmid;
  • the randomly broken fragments after repairing the ends in (1) are separated to obtain random interrupted fragments larger than the length of the vector;
  • the random interrupted fragments obtained in (2) are self-ligated to form a circular molecule, and then the fragments that are not self-ligated are removed;
  • the amplification product obtained in the above (4) is subjected to terminal repair to make the ends blunt-ended, and then a sequencing linker is used to select a next-generation sequencing platform for sequencing, and in order to ensure the required genome coverage, sequencing is obtained.
  • the total number of bases needs to be more than three times the size of the genome;
  • This example combines library construction methods with long insert library (eg, fosmid, BAC, etc.) and next-generation sequencing technologies to efficiently utilize the next-generation sequencing technology to quickly and inexpensively construct genomes, using fosmid and BAC library insertions.
  • Fragment length is much larger than the advantages of the general database construction method, and the use of more distant sequence topology relationships contained in the sequencing data to construct longer genomic sequence fragments, significantly improve the quality of the genomic map.
  • the source of the reference genomic sequence is: The National Center for Biotechnology Information, available at: http://www.ncbi. Nlm.nih.gov/ , genome number: gilll6010291lreflNC_004354.3l Drosophila melanogaster chromosome X, complete sequence.
  • the Drosophila genome X chromosome can be simulated and sequenced using Maq simulate software. For sequencing data. Among them, you need to set the following parameters for Maq simulate: -d, -N, -1, -2, fql, fq2 and simudars.dat.
  • -d parameter is the length of the sequencing segment, set to 500, 2000, 5000, 40000 respectively;
  • -N parameter indicates the total number of short segment sequences to be obtained by sequencing, according to the sequencing depth (Sequencing Depth)
  • the simulated sequencing depth of this example is 50 X (ie, 50 times the length of the reference genome sequence), the total length of the reference genome is 22M, and the length of the short fragment is set to 100 bp; the -1 , -2 parameters are the double ends of the alignment
  • the length of the short segment sequence 1 and the short segment sequence 2 in this example, is set to lOObp; fql, fq2 are output files, and the sequencing data after the simulation sequencing (ie, the short segment sequence 1 and the short segment sequence 2) are respectively in the fasta format.
  • simupars.dat is the system file of maq simulate software, which determines the length and quality value of the short segment sequence.
  • various common short sequence alignment software can be used to compare these sequences to the reference genome sequences of the corresponding species, and the length of the aligned fragments should be substantially the same. Allow a certain range of floating (the floating range can be set according to the requirements, for example, can be up and down 10%), and the short segment sequence obtained by sequencing the sequencing fragments whose length is within the normal range is called a normal short sequence. On the contrary, it is called an abnormal short sequence, and the minimum length of the aligned short segment sequence is 40 bp, and the maximum number of mismatches allowed on a short sequence in the comparison is as small as possible to ensure accurate alignment.
  • the floating range can be set according to the requirements, for example, can be up and down 10%
  • the short segment sequence obtained by sequencing the sequencing fragments whose length is within the normal range is called a normal short sequence.
  • the minimum length of the aligned short segment sequence is 40 bp, and the maximum number of mismatches allowed on a short sequence in the comparison is as small as possible to ensure accurate alignment
  • the software used for the comparison is soap2, and the following parameters need to be set when performing the comparison: -P, -a, -b, -D, -o, -2, -u, -m, -x, -s, -1, -v.
  • -P parameter indicates the memory required for the script to run;
  • -a parameter indicates the fql file obtained by resequencing the input file during double-end sequencing (the file where the short segment sequence 1 is located;)
  • the -b parameter indicates that the input file is a re-sequencing fq2 file at the time of double-end sequencing (the file in which the short fragment sequence 2 is located;);
  • the -D parameter indicates that the reference genome sequence is input in the fasta file format (where the fasta sequence file is the first)
  • the line is any text description beginning with the ">" or semicolon"; for the sequence tag; starting from the second line for the sequence itself, only the specified nucleotide or amino acid coding symbol is allowed);
  • the -0 parameter the output is the paired short segment sequence on the reference genome, and the output file is suffixed with .soap;
  • the -2 parameter the output is only one of the paired short segment sequences.
  • the output file is suffixed with . single; the -u parameter, the output of which is a pair of short segment sequences that are not aligned to the reference genome sequence,
  • the file is suffixed with . unmap; the -t parameter is not set to preserve the original ID number of the short segment sequence; -m, -X parameter is the floating range of the inserted segment, and the -m parameter refers to the floating lower limit of the sequence segment, ie, the negative percentage X-sequence fragment length, -X parameter refers to the floating upper limit of the sequenced fragment, ie, the positive percentage X-sequence fragment length.
  • the floating range of the sequenced fragments is relaxed, and the -m, -X parameters are respectively set to the length of the sequencing fragment ⁇ 0.88 X sequencing fragment length;
  • -S parameter For the minimum alignment length, set to 40; -1 parameter is the seed sequence on the initial alignment (the 3' end error rate of the long segment sequence is high, and the sequence of a certain length is set from the 5' end as the seed sequence) length, setting The 32; -V parameter indicates the maximum number of mismatches allowed on a short segment sequence when aligned, which is set to be as small as possible in this embodiment to ensure accurate alignment. In addition, you need to pay attention to the consistency of the soap parameter settings.
  • the abscissa "insert size ( kb )" indicates “the length of the inserted fragment"
  • the ordinate “Uniq PE Reads” indicates "the unique paired end sequencing result”
  • the data is used for the library insert size.
  • the results show that the insert size is normal and the fluctuation range is within the acceptable range.
  • the N50 of the simulated assembly results of the Drosophila genome was increased from 0.32M to 1.48M using sequence information located on different assembled segments of the reference genome sequence for genomic assembly.
  • the Yunling Black Goat genome is randomly interrupted.
  • the high-throughput sequencing technology can be Illumina GA sequencing technology or other existing high-throughput sequencing technologies.
  • the bioinformatics method was used to remove the sequence of the linker at the time of sequencing and the data of poor end quality, and then the sequence repeatedly detected was removed, and finally 2,611,182 pairs of sequences having unique characteristics were obtained.
  • a sequence with unique characteristics a total of 1,589,054 pairs with unique matching sites were mapped to the same scaffold (an assembled fragment of the reference genomic sequence).
  • the number of locating to the same scaffold with a distance of less than 500 bp is 338, 255 pairs, and the number of locating to the same scaffold is more than 232,544 pairs, of which 206,697 pairs are 30.
  • the results showed that the insert size was normal and the fluctuation range was within an acceptable range.
  • a total of 18,255 pairs were mapped to different scaffolds. Using the 18,255 pairs of genome-assisted assembly, the N50 of Yunling black goats could be increased from 2.2M to 3.1M.
  • the polar bear genomic DNA is randomly interrupted to ensure that the size of the interrupted DNA is not less than 36 Kb, and the fosmid library of the polar bear is obtained through separation, cyclization, and amplification. Then, using the next-generation sequencing technology, 14.4M pairs of original sequencing short sequences can be obtained.
  • high-throughput sequencing technology can be Illumina GA sequencing technology or other existing high-throughput sequencing technologies.
  • the bioinformatics method was used to remove the linker sequence and the poor end quality data, and then the repeated sequence was removed, resulting in 15,225,082 pairs of sequences.
  • the 15,225,082 pair sequence there were 2,865,235 pairs with unique matching sites.
  • the number of distances less than 500 bp is 209,600 pairs, and the number of locations on the same scaffold is greater than 10 kb, which is 531,028 pairs, of which 520,897 pairs of 30 kb _ 50 kb, accounting for 98.09%, are located differently.
  • There are 185,888 pairs on the scaffold There are 185,888 pairs on the scaffold. Using the 185,888 pairs for genome-assisted assembly, the N50 can be increased from 2.3M to 6.5M.
  • the apparatus 10 may include: a sequence filtering module 11, a sequence comparison module 12, a sequence classification module 13, a sequence length statistics module 14, and a sequence assembly module 15.
  • the sequence filtering module 11 is configured to include at least one selected from the group consisting of: an exogenous sequence, a short segment sequence having a predetermined number of bases N, and a short segment sequence containing a polyA structure.
  • the sequence alignment module 12 is coupled to the sequence filtering module 11 for comparing the filtered short segment sequence to the reference genome sequence.
  • the sequence classification module 13 is connected to the sequence alignment module 12 for dividing the paired short segment sequences to be compared according to the comparison result into a soap reads sequence, a single reads sequence, and an unmap reads sequence.
  • the soap reads sequence refers to a short sequence that exists in pairs and can be aligned to the same assembled fragment of the reference genome sequence; a single reads sequence refers to only one of the two short sequences that are paired Short sequences aligned to different assembled fragments of the reference genomic sequence; unmap reads means that the pair of short sequences are not aligned to the short sequence on the assembled fragment of the reference genomic sequence.
  • the sequence length statistic module 14 is coupled to the sequence categorization module 13 for calculating the distance of the short segment sequence on the contrast pair on the same segment of the reference genomic sequence using the soap reads sequence, and counting the comparisons. The distance distribution of the short segment sequence on the reference genomic sequence.
  • the sequence assembly module 15 is coupled to the sequence classification module 13 and the sequence length statistic module 14 for using a single reads sequence uniquely to compare different segments of the reference genomic sequence to the intrinsic sequence of the sequencing library when the distance distribution satisfies the threshold requirement. Sequence length and spatial relationships connect adjacent genomic sequence fragments for assembly of genomic sequences.
  • the aforementioned method of assembling a genomic sequence can be efficiently carried out, whereby the embodiment can utilize the relative presence contained in the sequencing data due to the use of the long insert fragment library. Sequences with more distant distances are constructed to construct longer genomic sequence fragments, which in turn increases the efficiency of genome assembly.
  • the soap reads sequence may include a soap reads sequence unique to the same fragment of the pair of reference genomic sequences and a plurality of soap reads sequences of the same fragment of the reference genomic sequence.
  • the distance to the same fragment of the reference genomic sequence on the aligned pair can be further calculated using the soap reads sequence uniquely paired to the same fragment of the reference genomic sequence.
  • the calculation process can be performed by the sequence length statistics module 14.
  • This embodiment uses the soap reads sequence uniquely to compare the same fragment of the pair of reference genomic sequences to calculate the distance of the short segment sequence on the opposite pair on the same segment of the reference genomic sequence, and can accurately count the long insert
  • the quality of the library the quality of the library is highly conducive to accurate assembly.
  • the device 20 further includes on the basis of the device 10 shown in FIG. 6:
  • the sequence intercepting module 21 is connected to the sequence filtering module 11 and the sequence matching module 12, and is configured to intercept the filtered short segment sequence into a short segment sequence of a set length before performing sequence alignment, wherein the minimum alignment length It is 40bp.
  • the length of the segment to be compared is limited, and the length of the sequence to be compared is required to be within the set range, so that the accuracy and efficiency of the alignment can be ensured.
  • Figure 8 is a schematic view showing the structure of still another embodiment of the assembled genomic sequence device of the present invention.
  • the apparatus 30 for assembling a genomic sequence further includes: a sequence receiving module 31 connected to the sequence filtering module 11 for receiving the end of the long insert fragment library after sequencing. the sequence of.
  • the method and apparatus for assembling a genomic sequence according to an embodiment of the present invention can be effectively used to assemble a genomic sequence.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Description

組装基因組序列的方法和装置
优先权信息
本申请请求 2011 年 3 月 2 日向中国国家知识产权局提交的、 专利申请号为 201110049885.0的专利申请的优先权和权益, 并且通过参照将其全文并入此处。 技术领域
本发明涉及生物信息技术领域, 特别地, 涉及一种组装基因组序列的方法和装置。 背景技术
随着新一代测序技术诸如 454 ( Roche ) 、 Solexa ( Illumina ) 和 SOLiD ( ABI ) 的 诞生, 在测序通量迅速提升的同时, 测序成本急剧下降。 新一代测序技术极大地推动了 基因组学的发展。 大量物种的全基因组序列被发表, 其中包括 James Watson的个人基 因组、 第一个亚洲人的基因组、 以及大熊猫和黄瓜的基因组等。
新一代测序仪器的每一轮测序都能产生百万计的短片段序列。通常,对一个基因组 进行完全测序, 需要进行多轮这样的测序工作, 这也就意味着, 为了获得一份完整的全 基因组图谱, 必须对数百万甚至是数十亿的短小片段序列进行作图、 定位和拼接。
因而, 目前的基因组序列的组装手段仍有待改进。 发明内容
本发明是基于发明人的下列发现而完成的:
目前, 在利用新一代测序技术进行测序时, 所产生的都是长约 25bp~100bp左右的 短片段序列, 这些短片段序列都是待测样品大片段的某一部分, 如何将测序得到的海量 短片段序列数据组装还原为大片段数据给后续的信息分析工作提出了极大的挑战。在现 有技术中, 由于测序时产生的片段序列非常短, 所以需要通过非常大的运算量才能完成 对大片段数据的还原。
同时, 作为衡量基因组图谱质量之一的片段长度 N50 ( N50为将所有的组装得到的 序列从大到小排列起来并按长度相加,当相加得到的长度为所有组装得到的序列总长的 百分之五十时的那条组装序列的长度, 关于 N50的详细描述可以参考 Miller et al. 2010. Assembly algorithms for next generation sequencing data. Genomics.95 ( 6 ) : 315-327 , 通 过参照将其并入本文) 指标也会受到实验中所能构建文库的插入片段长度的限制。 本发明旨在至少解决现有技术中存在的技术问题之一。
为此, 本发明提出了可以用于组装基因组序列的方法和装置, 以便能够利用长插入片 段文库末端测序后的短片段序列进行基因组序列的组装, 从而提高组装效率和效果。
根据本发明的一方面, 本发明提出了一种组装基因组序列的方法。 根据本发明的实施 例, 该组装基因组序列的方法包括对长插入片段文库末端测序输出的短片段序列进行过滤 以去除不合格的序列; 将经过过滤的短片段序列与参考基因组序列进行比对; 根据比对结 果将进行比对的成对短片段序列分为 soap reads序列、 single reads序列和 unmap reads序列, 并统计各类序列的数量; 利用 soap reads序列, 计算成对比对上的短片段序列在参考基因组 序列的同一片段上的距离, 并统计各个成对比对上的短片段序列在参考基因组序列上的距 离分布; 在距离分布满足阈值要求时, 利用唯一成对比对上参考基因组序列的不同片段的 single reads序列进行基因组序列的组装。 由此, 可以提高组装基因组序列的效率和效果。
根据本发明的实施例, 组装基因组序列的方法还可以具有下列附加技术特征: 根据本发明的一个实施例, 所述经过过滤的短片段序列包括成对段片段序列。 由此, 可以进一步提高组装因组的效率。
根据本发明的一个实施例, 在将经过过滤的短片段序列与参考基因组序列进行比对之 前进一步包括: 将所述经过过滤的短片段序列截取为设定长度的短片段序列。 由此, 可以 进一步提高组装因组的效率。
根据本发明的一个实施例, 所述不合格的序列包括选自下列的至少一种: 外源序列、 碱基 N数目达到预定比例的短片段序列、 含有 polyA结构的短片段序列、 低质量碱基数目 达到预定个数的短片段序列、 接头污染的短片段序列、 测序中成对短片段序列有重叠区域 的短片段序列、 以及重复测到的短片段序列。 由此, 可以进一步提高组装基因组的效率。
根据本发明方法的一个实施例, soap reads序列包括唯一成对比对上参考基因组序列的 同一片段的 soap reads序列和多次成对比对上参考基因组序列的同一片段的 soap reads序列, 利用 soap reads序列计算成对比对上的短片段序列在参考基因组序列的同一片段上的距离的 步骤进一步包括: 利用唯一成对比对上参考基因组序列的同一片段的 soap reads序列, 计算 成对比对上的短片段序列在参考基因组序列的同一片段上的距离。 由此, 可以进一步提高 组装因组的效率。
根据本发明方法的一个实施例, 所述方法进一步包括: 构建长插入片段文库; 以及对 长插入片段文库末端进行测序, 以便获得所述输出的短片段序列。 有利于组装出更长的基 因组序列片段。
根据本发明的另一方面, 本发明提出了一种组装基因组序列的装置。 根据本发明的实 施例, 该组装基因组序列的装置包括: 序列过滤模块, 所述序列过滤模块用于对长插入片 段文库末端测序输出的短片段序列进行过滤, 以便去除不合格的序列; 序列比对模块, 所 述序列比对模块与所述序列过滤模块相连, 用于将经过过滤的短片段序列与参考基因组序 列进行比对, 其中, 所述经过过滤的短片段序列包括成对短片段序列; 序列分类模块, 所 述序列分类模块与所述序列比对模块相连, 用于根据比对结果, 将进行比对的成对短片段 歹 ^"为 soap reads / *歹 、 single reads / *歹 口 unmap reads / *歹 , i¾i十 H/ *歹1 j 序列长度统计模块, 所述序列长度统计模块与所述序列分类模块相连, 用于利用 soap reads 序列, 计算成对比对上的短片段序列在所述参考基因组序列的同一片段上的距离, 并统计 各个成对比对上的短片段序列在所述参考基因组序列上的距离分布; 以及序列组装模块, 所述序列组装模块分别与所述序列分类模块和所述序列长度统计模块相连, 用于在所述距 离分布满足阈值要求时,利用唯一成对比对上所述参考基因组序列的不同片段的 single reads 序列进行基因组序列的组装。 利用该组装基因组序列的装置能够有效地实施前述组装基因 组的方法, 从而能够利用长插入片段文库末端测序后的短片段序列进行基因组序列的组装, 从而提高组装效率和效果。
根据本发明的实施例, 组装基因组序列的装置还可以具有下列附加技术特征: 根据本发明的一个实施例, 本发明的组装基因组序列的装置进一步包括: 序列截取模 块, 所述序列截取模块分别与所述序列过滤模块和所述序列比对模块相连, 用于在将经过 过滤的短片段序列与参考基因组序列进行比对之前, 将所述经过过滤的短片段序列截取为 设定长度的短片段序列。
根据本发明装置的另一实施例, 不合格的序列包括选自下列的至少一种: 外源序列、 碱基 N数目达到预定比例的短片段序列、 含有 polyA结构的短片段序列、 低质量碱基数目 达到预定个数的短片段序列、 接头污染的短片段序列、 测序中成对短片段序列有重叠区域 的短片段序列、 以及重复测到的短片段序列。 由此, 可以进一步提高组装因组的效率。
根据本发明装置的又一实施例, soap reads序列包括唯一成对比对上参考基因组序列的 同一片段的 soap reads序列和多次成对比对上参考基因组序列的同一片段的 soap reads序列, 其中, 进一步利用唯一成对比对上参考基因组序列的同一片段的 soap reads序列, 计算成对 比对上的短片段序列在参考基因组序列的同一片段上的距离。 由此, 可以评估文库质量, 进一步提高组装基因组的效率。
根据本发明装置的一个实施例, 本发明的组装基因组序列的装置进一步包括: 序列接 收模块, 该序列接收模块与所述序列过滤模块相连, 用于接收长插入片段文库末端测序后 的序列。 由此, 可以进一步提高组装因组的效率。 根据本发明实施例的组装基因组序列的方法和装置, 由于对长插入片段文库末端进行 测序, 能够利用测序数据中包含的相对现有技术更远距离的序列关系构建出更长的基因组 序列片段, 进而提高了基因组组装的效果。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得 明显, 或通过本发明的实践了解到。 附图说明
本发明的上述和 /或附加的方面和优点从结合下面附图对实施例的描述中将变得明 显和容易理解, 其中:
图 1是本发明组装基因组序列方法的一个实施例的流程示意图;
图 2是本发明组装基因组序列方法的另一实施例的流程示意图;
图 3是本发明组装基因组序列方法的又一实施例的流程示意图;
图 4是本发明组装基因组序列方法的再一实施例的流程示意图;
图 5是本发明组装基因组序列方法的再一实施例中的文库质量评估示意图; 图 6是本发明组装基因组序列装置的一个实施例的结构示意图;
图 7是本发明组装基因组序列装置的又一实施例的结构示意图; 以及
图 8是本发明组装基因组序列装置的再一实施例的结构示意图。 发明详细描述
下面详细描述本发明的实施例, 所述实施例的示例在附图中示出, 其中自始至终相 同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附 图描述的实施例是示例性的, 仅用于解释本发明, 而不能理解为对本发明的限制。
下面首先参考附图对本发明的组装基因组序列的方法进行详细描述。
参考图 1 , 才艮据本发明的实施例, 组装基因组序列方法可以包括以下步骤: 在本发明中所使用的术语 "长插入片段" 的长度并不受特别限制, 可以为现有技术能够达 到的任何插入长度, 例如可以长达至少 200kb, 例如可以为 40kb-200kb, 例如可以为大约 100kb-200kb。 本领域技术人员利用现有的载体, 可以容易地得到上述长插入片段。 例如, fosmid和细菌人工染色体( Bacterial Artificial Chromosome , BAC )是基因组研究中可用的 大片段克隆, BAC通常可以插入大约 lOOkb - 200kb的片段, fosmid通常可以插入大约 40kb 的片段, BAC和 fosmid不仅具有插入片段长的特点, 而且还具有非常好的稳定性, 因而他 们是基因组学研究的重要工具, 在基因图位克隆、 基因分析、 结构性变异和基因组组装中 有重要的作用。 根据本发明的实施例, 需要去除的不合格的序列的类型并不受特别限制。 根据本发明的一些实例, 可以将选自下列的至少一种除去: 外源序列 (例如可以为由实验 引入的外源序列例如, 各种接头序列)、 碱基 N数目达到预定比例(例如至少 10% )的短片 段序列、 含有 polyA结构的短片段序列、 低质量碱基数目达到预定个数的短片段序列 (测 序时给出的质量值小于或等于 20的碱基为低质量碱基, 质量值大于 20的碱基数目占总碱 基数的比例(Q20 )小于等于 0.7的序列)、 接头污染(例如, 与接头序列至少 10bp比对上, 且错配数不多于 3 个) 的短片段序列、 测序中成对短片段序列有重叠区域的短片段序列、 以及重复测到的短片段序列 (测序中成对的短片段序列完全一样的情况被定义为重复)。 在 本文中所使用的术语 "成对短片段序列" 的含义是, 在从同一个短片段序列的两端分别向 内侧测序, 这两个相向的序列被称为成对短片段序列;
S104, 将经过过滤的短片段序列与参考基因组序列进行比对。 根据本发明的实施例, 进行比对的手段并不受特别限制, 例如可以釆用已知的 soap、 bwa等方法和相关的软件进 行比对。 根据本发明的实施例, 所得到的经过过滤的短片段序列中包括成对短片段序列;
S106, 根据比对结果, 将进行比对的成对短片段序列分为 soap reads序列、 single reads 序列和 unmap reads序列, 并统计各类序列的数量;
在本发明中, 所使用的术语 "soap reads序列" 的含义是指成对存在且都能比对到参考 基因组序列的同一组装片段上的短序列。 术语 " single reads序列 " 的含义是指成对的两条 短序列中只有一条比对到参考基因组序列的不同组装片段上的短序列; 术语 "unmap reads" 指成对的两条短序列均未比对到参考基因组序列的组装片段上的短序列;
S108,由于 soap reads序列为成对存在且都能比对到参考基因组序列的同一组装片段上 的短序列,所以可以利用 soap reads序列计算成对比对上的短片段序列在参考基因组序列的 同一片段上的距离 (即, 计算 soap reads序列的长度), 并统计各个成对比对上的短片段序 列在参考基因组序列上的距离分布情况;
S110, 在距离分布满足阈值要求 (根据本发明的实施例, 阈值的具体数值并不受特别 限制, 可以由本领域技术人员针对具体的测序环境来通过有限次实验获得。 例如用 fosmid 构建文库时, 阙值为距离在 30kb-50kb之间的序列比例大于 85 % ) 时, 可以利用唯一成对 比对上参考基因组序列的不同组装片段的 single reads序列进行基因组序列的组装;
具体地,可以利用唯一成对比对上参考基因组序列的不同组装片段的 single reads序列, 按照测序文库的内在序列长度和空间关系, 连接相邻的基因组序列片段, 以提升基因组组 装效果。
该实施例由于对长插入片段文库末端进行测序, 因而能够利用测序数据中包含的相对 现有技术更远距离的序列关系构建出更长的基因组序列片段, 进而提高了基因组组装的效 果。
下面, 参考图 2描述根据本发明又一实施例的组装基因组序列的方法。
如图 2所示, 根据本发明实施例的组装基因组序列可以包括以下步骤: 具体地, 可以将测序后的短片段序列与实验引入的外源序列 (例如, 各种接头序列) 比对, 若序列中存在外源序列, 则认为是不合格序列, 并将不合格的短片段序列去除, 此 外, 不合格的序列还可以包括下列的至少一种: 碱基 N数目达到预定比例的短片段序列、 含有 polyA结构的短片段序列、 低质量碱基数目达到一定程度(例如, 40个碱基) 的短片 段序列、 有接头污染(例如, 与接头序列至少 10bp比对上, 且错配数不多于 3个) 的短片 段序列、 测序中成对的短片段序列有重叠区域(例如, 测序中成对的短片段序列的重叠区 域至少为 10bp, 且错配比例低于 10% ) 的短片段序列、 重复测到的短片段序列 (测序中成 对的短片段序列完全一样被定义为重复的短片段序列)。 最后对于头部或者末端质量比较差 的短片段序列将直接截掉;
S204, 将经过过滤的短片段序列截取为设定长度的短片段序列;
具体地, 为了提高比对的准确性, 进行比对的片段的长度应基本相同, 允许有一定的 浮动范围 (其中, 浮动范围可根据需求自行设置), 针对长度在正常范围内的测序片段测序 所获得的短片段序列被称为正常短序列, 反之被称为异常短序列。 根据本发明的实施例, 设定长度为至少 40bp如果进行比对的序列长度过短, 一方面降低了比对的效率, 另一方面 会使 N50性能降低), 比对时一条短序列上允许的最大不匹配数要尽量小, 以保证比对的精 确性;
S206 , 将过滤后的短片段序列与参考基因组序列进行比对, 根据本发明的实施例, 进 行比对的手段并不受特别限制, 例如可以釆用已知的 soap、 bwa等方法和相关的软件进行 比对。 根据本发明的实施例, 所得到的经过过滤的短片段序列中包括成对短片段序列;
S208,根据比对结果将进行比对的成对短片段序列分为 soap reads序列、 single reads序 列和 unmap reads序列, 并统计各类序列的数量;
S210, 依据比对结果, 提取只有一条与参考基因组序列比对上, 并且只比对到参考基 因组序列上一次的 single reads, 以保证比对结果的特异性;
S212,利用 soap reads序列计算成对比对上的短片段序列在参考基因组序列的同一片段 上的距离, 并统计各个成对比对上的短片段序列在参考基因组序列上的距离分布;
S214, 在距离分布满足阈值要求时 (根据本发明的实施例, 阈值的具体数值并不受特 别限制,可以由本领域技术人员针对具体的测序环境来通过有限次实验获得。例如用 fosmid 构建文库时, 阙值为, 距离在 30kb-50kb之间的序列比例大于 85 % ), 利用 S210中提取出 的唯一成对比对上参考基因组序列的不同片段的 single reads序列进行基因组序列的组装。
在该实施例中, 对待比对的片段长度进行了一定的限定, 要求待比对序列的长度在设 定范围内, 以保证比对的精度和效率。 下面参考图 3 , 描述根据本发明又一实施例的组装基因组序列的方法。 如图 3所示, 组 装基因组序列的方法可以包括以下步骤:
S304 , 将过滤后的短片段序列与参考基因组序列进行比对;
S306,根据比对结果将进行比对的成对短片段序列分为 soap reads序列、 single reads序 列和 unmap reads序列, 并统计各类序列的数量, 其中, soap reads序列又可以包括唯一成 对比对上参考基因组序列的同一片段的 soap reads序列和多次成对比对上参考基因组序列的 同一片段的 soap reads序歹1 J;
S308,利用唯一成对比对上参考基因组序列的同一片段的 soap reads序列计算成对比对 上的短片段序列在参考基因组序列的同一片段上的距离, 并统计各个成对比对上的短片段 序列在参考基因组序列上的距离分布;
S310, 在距离分布满足阈值要求时, 利用唯一成对比对上参考基因组序列的不同片段 的 single reads序列进行基因组序列的组装。
该实施例利用唯一成对比对上参考基因组序列的同一片段的 soap reads序列计算成对比 对上的短片段序列在参考基因组序列的同一片段上的距离, 可以准确地统计出长插入片段 文库的质量, 从而提高基因组序列组装的准确率。
下面参考图 4, 描述根据本发明再一实施例的组装基因组序列的方法。
如图 4所示, 根据本发明实施例的组装基因组序列的方法可以包括以下步骤:
S402, 构建长插入片段文库。 根据本发明的实施例, 构建长插入片段文库的方法并不 受特别限制。 根据本发明的具体实施例, 构建长插入片段文库可以釆用下列步骤:
( 1 ) 随机打断:
将插入有待测 DNA的载体进行随机打断处理, 以获得大于载体长度的随机打断片段, 然后将得到的随机打断片段进行末端修复, 使末端平端化, 其中, 载体是质粒, 具体地, 可以是 fosmid质粒、 BAC质粒或 cosmid质粒等;
( 2 )分离:
将(1 ) 中的末端修复后的被随机打断的片段进行分离, 得到大于载体长度的随机打断 片段;
( 3 )环化:
将(2 ) 中得到的随机打断片段进行自身连接, 形成环形分子, 然后清除未自身连接的 片段;
( 4 )扩增:
根据载体序列设计引物, 扩增环形分子中存留的待测基因的核酸片段, 即, (1 ) 中所 述的待测核酸片段的末端序列;
S404, 对长插入片段文库末端进行测序;
具体地, 将上述(4 ) 中得到的扩增产物进行末端修复, 以使末端平端化, 然后加上测 序用接头, 选择新一代测序平台进行测序, 为了保证所需的基因组覆盖度, 测序得到的碱 基总量需在基因组大小的 3倍以上;
S408 , 将经过过滤的短片段序列与参考基因组序列进行比对;
S410,才艮据比对结果将进行比对的成对短片段序列分为 soap reads序列、 single reads序 列和 unmap reads序列, 并统计各类序列的数量;
S412,利用 soap reads序列计算成对比对上的短片段序列在参考基因组序列的同一片段 上的距离, 并统计各个成对比对上的短片段序列在参考基因组序列上的距离分布;
S414, 在距离分布满足阈值要求时, 利用唯一成对比对上参考基因组序列的不同片段 的 single reads序列进行基因组序列的组装。
该实施例结合长插入片段文库(例如, fosmid, BAC 等) 的文库构建方法以及新一代 测序技术有效地利用新一代测序技术在构建基因组上的速度快和廉价的特点、 利用 fosmid 及 BAC文库插入片段长度远远大于普通建库方法的优势、 以及利用测序数据中包含的更远 距离的序列拓朴关系构建出更长的基因组序列片段, 显著提高基因组图谱的质量。
在本发明组装基因组序列方法的再一实施例中, 以果蝇基因组的 X染色体为例, 其参 考基因组序列的来源为: The National Center for Biotechnology Information , 网址为: http://www.ncbi.nlm.nih.gov/ , 基因组编号为: gilll6010291lreflNC_004354.3l Drosophila melanogaster chromosome X, complete sequence。
可以利用 Maq simulate软件对果蝇基因组 X染色体进行模拟测序, 测序得到的结果作 为测序数据。其中,需要为 Maq simulate设置如下参数: -d, -N, -1 , -2, fql , fq2和 simupars.dat。 下面对各个参数做详细的说明: -d参数为测序片段长度,分别设置为 500、 2000、 5000、 40000; -N 参数表示测序所要获得的短片段序列总数, 该参数根据测序深度(Sequencing Depth )来确定, 测序深度是评价测序质量的指标之一, 表示测序得到的碱基总量(bp ) 与 基因组大小 (Genome ) 的比值, 利用公式: N=测序深度 x参考基因组总长度 /(2 x reads 长 度)来计算。 该实施例的模拟测序深度为 50 X (即, 50倍的参考基因组序列长度), 参考基 因组总长度为 22M, 短片段序列长度设为 lOObp; -1 , -2参数为进行比对的双末端短片段序 列 1和短片段序列 2的长度, 本例中设为 lOObp; fql , fq2为输出文件, 将模拟测序后的测 序数据(即,短片段序列 1和短片段序列 2 )分别以 fasta格式存入 fal , fa2文件中; simupars.dat 为 maq simulate软件的系统文件, 决定短片段序列的长度和质量值。
在该实施例中, 可以使用各种常见短序列比对软件(如 soap、 bwa等)将这些序列与 相应物种的参考基因组序列进行相似性比对, 进行比对的测序片段的长度应基本相同, 允 许有一定的浮动范围 (浮动范围可才艮据需求自行设置, 例如可以为上下浮动 10% ), 针对长 度在正常范围内的测序片段测序所获得的短片段序列被称为正常短序列, 反之被称为异常 短序列, 进行比对的短片段序列的最低长度为 40bp, 比对时一条短序列上允许的最大不匹 配数要尽量小, 以保证精确比对。
在本实施例中, 进行比对时使用的软件为 soap2, 在进行比对时需要设置如下参数: -P, -a, -b, -D, -o , -2, -u, -m, -x, -s, -1, -v。
下面对各个参数做详细的说明: -P参数表示该脚本运行时所需要的内存; -a参数表示 双末端测序时输入文件为重测序得到的 fql文件(短片段序列 1所在的文件;); -b参数表示 双末端测序时输入文件为重测序得到的 fq2文件(短片段序列 2所在的文件;); -D参数表示 参考基因组序列以 fasta文件格式输入(其中, fasta序列文件的第一行是由大于号 ">"或分号 ";"开头的任意文字说明, 用于序列标记; 从第二行开始为序列本身, 只允许使用既定的核 苷酸或氨基酸编码符号); 输出参数有三项, -0参数, 输出的结果为比对到参考基因组上的 成对短片段序列, 其输出文件以 .soap为后缀; -2参数, 其输出结果为成对的短片段序列中 只有一条比对到参考基因组序列上, 输出文件以. single作为后缀; -u参数, 其输出结果是 未比对到参考基因组序列的成对短片段序列, 输出文件以. unmap作为后缀; 不设置 -t参数 以保留短片段序列的原始 ID号; -m, -X参数为插入片段的浮动范围, -m参数指测序片段 的浮动下限, 即, 负百分数 X测序片段长度, -X参数指测序片段的浮动上限, 即, 正百分 数 X测序片段长度。 在该实施例中, 为了最大范围的找到符合条件的短片段序列, 将测序 片段的浮动范围放宽, -m, -X参数分别设置为测序片段长度 ± 0.88 X测序片段长度; -S参数 为最小比对长度, 设置为 40; -1参数为初始比对上的种子序列 (长片段序列的 3' 端错误率 高, 从 5' 端设定一定长度的序列作为种子序列)长度, 设置为 32; -V参数表示比对时一 条短片段序列上允许的最大不匹配数, 在该实施例中该参数设置要尽量小, 以保证精确比 对。 此外, 需要注意对 soap参数设置的一致性。
如图 5所示,横坐标 "insert size ( kb )"表示 "插入片段的长度",纵坐标 "Uniq PE Reads" 表示 "唯一的成对末端测序结果", 使用这些数据进行文库插入片段大小的分析, 结果显示 插入片段大小正常, 波动范围在可接受范围内。 利用定位到参考基因组序列的不同组装片 段上的序列信息进行基因组的辅助组装, 将果蝇基因组的模拟组装结果的 N50从 0.32M提 高到 1.48M。
在本发明组装基因组序列方法的再一实施例中, 首先, 随机打断云岭黑山羊基因组
DNA, 确保被打断的 DNA大小不低于 36Kb, 通过分离、 环化、 扩增过程得到云岭黑山羊 的 fosmid文库。 然后, 使用新一代测序技术得到 14.4M对原始测序短序列, 其中, 高通量 测序技术可以为 Illumina GA测序技术, 也可以为现有的其他高通量测序技术。
接下来, 利用生物信息学方法除去测序时的接头序列以及末端质量较差数据, 随后去 掉重复测到的序列,最终得到 2,611,182对具有唯一特征的序列。在具有唯一特征的序列中, 共有 1,589,054对具有唯一匹配位点定位到同一个 scaffold (参考基因组序列的组装片段) 上。 其中, 定位到同一个 scaffold上且距离小于 500bp的数目为 338,255对, 定位到同一个 scaffold上且距离大于 10 kb的数目为 232,544对, 其中 30kb - 50 kb的有 206,697对, 占 86.42%。 使用这些数据进行文库插入片段大小的分析, 结果显示插入片段大小正常, 波动 范围在可接受范围内。 定位到不同 scaffold上的有 18,255对, 利用这 18,255对进行基因组 的辅助组装, 可以将云岭黑山羊的组装结果的 N50从 2.2M提高到 3.1M。
在本发明组装基因组序列方法的再一实施例中, 首先, 随机打断北极熊基因组 DNA, 确保被打断的 DNA大小不低于 36Kb, 通过分离、 环化、 扩增过程得到北极熊的 fosmid文 库。 然后, 使用新一代测序技术得到 14.4M对原始测序短序列, 其中, 高通量测序技术可 以为 Illumina GA测序技术, 也可以为现有的其他高通量测序技术。
接下来, 利用生物信息学方法除去测序时的接头序列以及末端质量较差数据, 随后去 掉重复测到的序列, 最终得到 15,225,082对序列, 在 15,225,082对序列中, 共有 2,865,235 对具有唯一匹配位点定位到同一个 scaffold上,其中,距离小于 500bp的数目为 209,600对, 定位到同一个 scaffold上且距离大于 10 kb的数目为 531,028对, 其中 30kb _ 50kb的有 520,897对, 占 98.09% , 定位到不同 scaffold上的有 185,888对, 利用这 185,888对进行基 因组的辅助组装, 可以将 N50从 2.3M提高到 6.5M。 下面, 参考图 6描述根据本发明实施例的组装基因组序列的装置。 如图 6所示, 该装 置 10可以包括: 序列过滤模块 11、 序列比对模块 12、 序列分类模块 13、 序列长度统计模 块 14以及序列组装模块 15。 根据本发明的实施例, 序列过滤模块 11用于对长插入片段文 括选自下列的至少一种: 外源序列、 碱基 N数目达到预定比例的短片段序列、 含有 polyA 结构的短片段序列、 低质量碱基数目达到预定个数的短片段序列、 接头污染的短片段序列、 测序中成对短片段序列有重叠区域的短片段序列、 以及重复测到的短片段序列。 根据本发 明的实施例, 序列比对模块 12与序列过滤模块 11相连, 用于将过滤后的短片段序列与参 考基因组序列进行比对。 根据本发明的实施例, 序列分类模块 13与序列比对模块 12相连, 用于才艮据比对结果将进行比对的成对短片段序列分为 soap reads序列、 single reads序列和 unmap reads序列, 并统计各类序列的数量, 其中, soap reads序列指成对存在且都能比对到 参考基因组序列的同一组装片段上的短序列; single reads序列指成对的两条短序列中只有 一条比对到参考基因组序列的不同组装片段上的短序列; unmap reads指成对的两条短序列 均未比对到参考基因组序列的组装片段上的短序列。 根据本发明的实施例, 序列长度统计 模块 14与序列分类模块 13相连,用于利用 soap reads序列计算成对比对上的短片段序列在 参考基因组序列的同一片段上的距离, 并统计各个成对比对上的短片段序列在参考基因组 序列上的距离分布。 序列组装模块 15 , 与序列分类模块 13和序列长度统计模块 14相连, 用于在距离分布满足阈值要求时, 利用唯一成对比对上参考基因组序列的不同片段的 single reads序列, 按照测序文库的内在序列长度和空间关系连接相邻的基因组序列片段进行基因 组序列的组装。
利用根据本发明实施例的组装基因组序列的装置, 能够有效地实施前述的组装基因组 序列的方法, 由此, 该实施例由于釆用了长插入片段文库, 因而能够利用测序数据中包含 的相对现有技术更远距离的序列关系构建出更长的基因组序列片段, 进而提高了基因组组 装的效果。
才艮据本发明的实施例, soap reads序列可以包括唯一成对比对上参考基因组序列的同一 片段的 soap reads序列和多次成对比对上参考基因组序列的同一片段的 soap reads序列, 由 此,可以进一步利用所述唯一成对比对上所述参考基因组序列的同一片段的 soap reads序列, 计算所述成对比对上的短片段序列在所述参考基因组序列的同一片段上的距离。 根据本发 明的实施例, 该计算处理可以通过序列长度统计模块 14来进行。
该实施例利用唯一成对比对上参考基因组序列的同一片段的 soap reads序列计算成对比 对上的短片段序列在参考基因组序列的同一片段上的距离, 可以准确地统计出长插入片段 文库的质量, 文库质量高利于准确的组装。
参考图 7, 描述才艮据本发明又一实施例的组装基因组序列的装置。 如图 7所示, 该装置 20在图 6所示装置 10的基础上进一步包括:
序列截取模块 21 , 与序列过滤模块 11和序列比对模块 12相连, 用于在进行序列比对 之前, 将过滤后的短片段序列截取为设定长度的短片段序列, 其中, 最低比对长度为 40bp。
在该实施例中, 对待比对的片段长度进行了一定的限制, 要求待比对序列的长度在设 定范围内, 从而可以保证比对的精度和效率。
图 8是本发明组装基因组序列装置的再一实施例的结构示意图。 如图 8所示, 该组装 基因组序列的装置 30在与图 6所示装置 10的基础上, 进一步包括: 序列接收模块 31 , 与 序列过滤模块 11相连, 用于接收长插入片段文库末端测序后的序列。
需要说明的是在本文中所使用的术语 "相连" 应做广义理解, 可以是直接相连, 也可 以是间接相连, 只要实现功能上的衔接即可。
需要说明的是, 在本发明中分别针对组装基因组序列的方法和装置描述了多个实施例, 本领域技术人员可以理解, 在特定实施例中的各个技术特征可以直接或者经过适应性改造 后适用于其他实施例中。 为描述方便, 再次不在赘述各个实施例特征的互相组合。 工业实用性
根据本发明实施例的组装基因组序列的方法和装置, 可以有效地用于组装基因组序列。
尽管本发明的具体实施方式已经得到详细的描述, 本领域技术人员将会理解。 根据已 经公开的所有教导, 可以对那些细节进行各种修改和替换, 这些改变均在本发明的保护范 围之内。 本发明的全部范围由所附权利要求及其任何等同物给出。
在本说明书的描述中, 参考术语 "一个实施例"、 "一些实施例"、 "示意性实施例"、 "示 例"、 "具体示例"、 或 "一些示例" 等的描述意指结合该实施例或示例描述的具体特征、 结 构、 材料或者特点包含于本发明的至少一个实施例或示例中。 在本说明书中, 对上述术语 的示意性表述不一定指的是相同的实施例或示例。 而且, 描述的具体特征、 结构、 材料或 者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。

Claims

权利要求书
1.一种组装基因组序列的方法, 其特征在于, 包括:
对长插入片段文库末端测序输出的短片段序列进行过滤, 以便去除不合格的序列; 将经过过滤的短片段序列与参考基因组序列进行比对, 其中, 所述经过过滤的短片段 序列包括成对短片段序列;
才艮据比对结果, 将进行比对的成对短片段序列分为 soap reads序列、 single reads序列和 unmap reads序列, 并统计各类序列的数量;
利用 soap reads序列,计算成对比对上的短片段序列在所述参考基因组序列的同一片段 上的距离, 并统计各个成对比对上的短片段序列在所述参考基因组序列上的距离分布; 以 及
在所述距离分布满足阈值要求时, 利用唯一成对比对上所述参考基因组序列的不同片 段的 single reads序列进行基因组序列的组装。
2. 根据权利要求 1所述的方法, 其特征在于, 在将经过过滤的短片段序列与参考基因 组序列进行比对之前进一步包括:
将所述经过过滤的短片段序列截取为设定长度的短片段序列。
3. 根据权利要求 1所述的方法, 其特征在于, 所述不合格的序列包括选自下列的至少 一种: 外源序列、 碱基 N数目达到预定比例的短片段序列、 含有 polyA结构的短片段序列、 低质量碱基数目达到预定个数的短片段序列、 接头污染的短片段序列、 测序中成对短片段 序列有重叠区域的短片段序列、 以及重复测到的短片段序列。
4. 根据权利要求 1所述的方法,其特征在于,所述 soap reads序列包括唯一成对比对上 所述参考基因组序列的同一片段的 soap reads序列和多次成对比对上所述参考基因组序列的 同一片段的 soap reads序列, 所述利用 soap reads序列计算成对比对上的短片段序列在所述 参考基因组序列的同一片段上的距离的步骤进一步包括:
利用唯一成对比对上所述参考基因组序列的同一片段的 soap reads序列,计算成对比对 上的短片段序列在所述参考基因组序列的同一片段上的距离。
5. 根据权利要求 1所述的方法, 其特征在于, 所述方法进一步包括:
构建长插入片段文库; 以及
对所述长插入片段文库末端进行测序, 以便获得所述输出的短片段序列。
6.一种组装基因组序列的装置, 其特征在于, 包括:
序列过滤模块, 所述序列过滤模块用于对长插入片段文库末端测序输出的短片段序列 进行过滤, 以便去除不合格的序列;
序列比对模块, 所述序列比对模块与所述序列过滤模块相连, 用于将经过过滤的短片 段序列与参考基因组序列进行比对, 其中, 所述经过过滤的短片段序列包括成对短片段序 列;
序列分类模块, 所述序列分类模块与所述序列比对模块相连, 用于根据比对结果, 将 进行比对的成对短片段序列分为 soap reads序列、 single reads序列和 unmap reads序列, 并 统计各类序列的数量;
序列长度统计模块, 所述序列长度统计模块与所述序列分类模块相连, 用于利用 soap reads序列, 计算成对比对上的短片段序列在所述参考基因组序列的同一片段上的距离, 并 统计各个成对比对上的短片段序列在所述参考基因组序列上的距离分布; 以及
序列组装模块, 所述序列组装模块分别与所述序列分类模块和所述序列长度统计模块 相连, 用于在所述距离分布满足阈值要求时, 利用唯一成对比对上所述参考基因组序列的 不同片段的 single reads序列进行基因组序列的组装。
7. 根据权利要求 6所述的装置, 其特征在于, 所述装置进一步包括:
序列截取模块, 所述序列截取模块分别与所述序列过滤模块和所述序列比对模块相连, 用于在将经过过滤的短片段序列与参考基因组序列进行比对之前, 将所述经过过滤的短片 段序列截取为设定长度的短片段序列。
8. 根据权利要求 6所述的装置, 其特征在于, 所述不合格的序列包括选自下列的至少 一种: 外源序列、 碱基 N数目达到预定比例的短片段序列、 含有 polyA结构的短片段序列、 低质量碱基数目达到预定个数的短片段序列、 接头污染的短片段序列、 测序中成对短片段 序列有重叠区域的短片段序列、 以及重复测到的短片段序列。
9. 根据权利要求 6所述的装置,其特征在于,所述 soap reads序列包括唯一成对比对上 所述参考基因组序列的同一片段的 soap reads序列和多次成对比对上所述参考基因组序列的 同一片段的 soap reads序歹1 J ,
其中,
所述序列长度统计模块进一步包括利用所述唯一成对比对上所述参考基因组序列的同 一片段的 soap reads序列,计算所述成对比对上的短片段序列在所述参考基因组序列的同一 片段上的距离,统计各个所述成对比对上所述参考基因组序列同一片段的 soap reads序列在 所述参考基因组序列上的距离分布;
所述序列组装模块进一步包括在所述距离分布满足阈值要求时, 利用唯一成对比对上 所述参考基因组序列的不同片段的 single reads序列进行基因组序列的组装。
10. 根据权利要求 6所述的装置, 其特征在于, 所述装置进一步包括:
序列接收模块, 所述序列接收模块与所述序列过滤模块相连, 用于接收长插入片段文 库末端测序后的序列。
PCT/CN2012/071876 2011-03-02 2012-03-02 组装基因组序列的方法和装置 WO2012116658A2 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/002,374 US20130345095A1 (en) 2011-03-02 2012-03-02 Method and device for assembling genome sequence

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2011100498850A CN102206704B (zh) 2011-03-02 2011-03-02 组装基因组序列的方法和装置
CN201110049885.0 2011-03-02

Publications (1)

Publication Number Publication Date
WO2012116658A2 true WO2012116658A2 (zh) 2012-09-07

Family

ID=44695763

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/071876 WO2012116658A2 (zh) 2011-03-02 2012-03-02 组装基因组序列的方法和装置

Country Status (3)

Country Link
CN (1) CN102206704B (zh)
HK (1) HK1162614A1 (zh)
WO (1) WO2012116658A2 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111564182A (zh) * 2020-05-12 2020-08-21 西藏自治区农牧科学院水产科学研究所 一种高重复原鮡属鱼类的染色体级别组装的方法
CN113724788A (zh) * 2021-07-29 2021-11-30 哈尔滨医科大学 一种鉴定肿瘤细胞的染色体外环状dna组成基因的方法
CN116403647A (zh) * 2023-06-08 2023-07-07 上海精翰生物科技有限公司 一种检测慢病毒整合位点的生物信息检测方法及其应用

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103160937B (zh) * 2011-12-15 2015-02-18 深圳华大基因科技服务有限公司 对高等植物复杂基因组基因进行富集建库和snp分析的方法
CN102682226B (zh) * 2012-04-18 2015-09-30 盛司潼 一种核酸测序信息处理系统及方法
CN102789553B (zh) * 2012-07-23 2015-04-15 中国水产科学研究院 利用长转录组测序结果装配基因组的方法及装置
CN102867134B (zh) * 2012-08-16 2016-05-18 盛司潼 一种对基因序列片段进行拼接的系统和方法
KR101508817B1 (ko) * 2012-10-29 2015-04-08 삼성에스디에스 주식회사 염기 서열 정렬 시스템 및 방법
WO2015062184A1 (en) 2013-11-01 2015-05-07 Accurascience, Llc Method and apparatus for calling single-nucleotide variations and other variations
WO2015062183A1 (en) * 2013-11-01 2015-05-07 Origenome, Llc Method and apparatus for separating quality levels in sequence data and sequencing longer reads
CN103810402B (zh) * 2014-02-25 2017-01-18 北京诺禾致源生物信息科技有限公司 用于基因组的数据处理方法和装置
CN105989249B (zh) * 2014-09-26 2019-03-15 南京无尽生物科技有限公司 用于组装基因组序列的方法、系统及装置
CN104484558B (zh) * 2014-12-08 2018-04-24 深圳华大基因科技服务有限公司 生物信息项目的分析报告自动生成方法及系统
CN105219765A (zh) * 2015-11-09 2016-01-06 中国水产科学研究院 利用蛋白质序列构建基因组的方法和装置
WO2017143585A1 (zh) * 2016-02-26 2017-08-31 深圳华大基因研究院 对分隔长片段序列进行组装的方法和装置
CN109817280B (zh) * 2016-04-06 2023-04-14 晶能生物技术(上海)有限公司 一种测序数据组装方法
CN107858408A (zh) * 2016-09-19 2018-03-30 深圳华大基因科技服务有限公司 一种基因组二代序列组装方法和系统
CN108629156B (zh) * 2017-03-21 2020-08-28 深圳华大基因科技服务有限公司 三代测序数据纠错的方法、装置和计算机可读存储介质
CN108866173A (zh) * 2017-05-16 2018-11-23 深圳华大基因科技服务有限公司 一种标准序列的验证方法、装置及其应用
CN110021359B (zh) * 2017-07-24 2021-05-04 深圳华大基因科技服务有限公司 一种二代和三代序列联合组装结果去冗余的方法和装置
US20210217493A1 (en) * 2018-07-27 2021-07-15 Seekin, Inc. Reducing noise in sequencing data
CN115273984B (zh) * 2022-09-30 2022-11-29 北京诺禾致源科技股份有限公司 鉴定基因组串联重复区域的方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101504697B (zh) * 2008-12-12 2010-09-08 深圳华大基因研究院 一种片段连接支架的构建方法和系统
CN101894211B (zh) * 2010-06-30 2012-08-22 深圳华大基因科技有限公司 一种基因注释方法和系统
CN101967684B (zh) * 2010-09-01 2013-02-27 深圳华大基因科技有限公司 一种测序文库及其制备方法、一种末端测序方法和装置

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111564182A (zh) * 2020-05-12 2020-08-21 西藏自治区农牧科学院水产科学研究所 一种高重复原鮡属鱼类的染色体级别组装的方法
CN111564182B (zh) * 2020-05-12 2024-02-09 西藏自治区农牧科学院水产科学研究所 一种高重复原鮡属鱼类的染色体级别组装的方法
CN113724788A (zh) * 2021-07-29 2021-11-30 哈尔滨医科大学 一种鉴定肿瘤细胞的染色体外环状dna组成基因的方法
CN113724788B (zh) * 2021-07-29 2023-09-12 哈尔滨医科大学 一种鉴定肿瘤细胞的染色体外环状dna组成基因的方法
CN116403647A (zh) * 2023-06-08 2023-07-07 上海精翰生物科技有限公司 一种检测慢病毒整合位点的生物信息检测方法及其应用
CN116403647B (zh) * 2023-06-08 2023-08-15 上海精翰生物科技有限公司 一种检测慢病毒整合位点的生物信息检测方法及其应用

Also Published As

Publication number Publication date
HK1162614A1 (en) 2012-08-31
CN102206704B (zh) 2013-11-20
CN102206704A (zh) 2011-10-05

Similar Documents

Publication Publication Date Title
WO2012116658A2 (zh) 组装基因组序列的方法和装置
US20240120021A1 (en) Methods and systems for large scale scaffolding of genome assemblies
Zerbino et al. Velvet: algorithms for de novo short read assembly using de Bruijn graphs
WO2013097257A1 (zh) 一种检验融合基因的方法及系统
WO2012034251A2 (zh) 一种基因组结构性变异检测方法和系统
CN115206436A (zh) 用于多重分类学分类的方法和系统
WO2013127049A1 (zh) 一种检测染色体sts区域微缺失的方法及其装置
CA2965849A1 (en) Sequencing controls
CN107798216B (zh) 采用分治法进行高相似性序列的比对方法
WO2015003427A1 (zh) 基于克隆dna混合池的全基因组测序方法
CN106845152B (zh) 一种基因组胞嘧啶位点表观基因型分型方法
CN103114150A (zh) 基于酶切建库测序与贝叶斯统计的单核苷酸多态性位点鉴定的方法
Sotcheff et al. ViReMa: a virus recombination mapper of next-generation sequencing data characterizes diverse recombinant viral nucleic acids
WO2012097474A1 (zh) 检测转基因外源片段插入位点的方法和系统
US20130345095A1 (en) Method and device for assembling genome sequence
US10395757B2 (en) Parental genome assembly method
CN114420213A (zh) 一种生物信息分析方法及装置、电子设备及存储介质
WO2013097143A1 (zh) 估计基因组杂合率的方法和装置
Babak Identification of imprinted loci by transcriptome sequencing
WO2013097149A1 (zh) 估计基因组重复序列含量的方法和装置
CN114424288A (zh) 用于确定与两个突变的序列读段衍生自包含突变的相同序列的概率相关的量度的方法
Lian et al. A repetitive sequence assembler based on next-generation sequencing
WO2016141516A1 (zh) 获取子代特异性序列、检测子代新突变的方法和装置
Stander De novo assembly of the rooibos genome
Goel et al. SyRI: identification of syntenic and rearranged regions from whole-genome assemblies

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12752959

Country of ref document: EP

Kind code of ref document: A2

WWE Wipo information: entry into national phase

Ref document number: 14002374

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12752959

Country of ref document: EP

Kind code of ref document: A2