WO2012097474A1 - 检测转基因外源片段插入位点的方法和系统 - Google Patents

检测转基因外源片段插入位点的方法和系统 Download PDF

Info

Publication number
WO2012097474A1
WO2012097474A1 PCT/CN2011/000095 CN2011000095W WO2012097474A1 WO 2012097474 A1 WO2012097474 A1 WO 2012097474A1 CN 2011000095 W CN2011000095 W CN 2011000095W WO 2012097474 A1 WO2012097474 A1 WO 2012097474A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
short
short segment
fragment
exogenous
Prior art date
Application number
PCT/CN2011/000095
Other languages
English (en)
French (fr)
Inventor
郭小森
陈帅
张雪梅
郎继东
李俊
胡雪松
陈聪群
Original Assignee
深圳华大基因科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技有限公司 filed Critical 深圳华大基因科技有限公司
Priority to PCT/CN2011/000095 priority Critical patent/WO2012097474A1/zh
Priority to CN201180062256.XA priority patent/CN103270175B/zh
Publication of WO2012097474A1 publication Critical patent/WO2012097474A1/zh

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1082Preparation or screening gene libraries by chromosomal integration of polynucleotide sequences, HR-, site-specific-recombination, transposons, viral vectors

Definitions

  • the present invention relates to the field of bioinformatics technology, and in particular, to a method and system for detecting insertion sites of transgenic foreign fragments. Background technique
  • the safety of genetically modified products is mainly reflected in two aspects: First, the safety of food, mainly considering whether genetically modified products will have adverse effects on living organisms after being eaten by humans (animals); second, environmental safety, One of the main concerns is whether the exogenous genes will drift into the environment after release from the environment, and will remain and spread in the environment, causing genetic variation, leading to ecological risks, namely individuals, populations, communities, ecosystems. As well as the adverse ecological consequences that the overall environment may cause, the ecological environment is destroyed and the balance of the original ecological population is broken. Therefore, the safety assessment and control of GMO circulation is particularly important. Genetic mutation detection is a very important part.
  • Whole genome resequencing is a method of genome sequencing of individuals with known genome sequences and differential analysis at the individual or population level. Resequencing technology It is an important way to detect genetic variation.
  • the current mutation detection includes single nucleotide polymorphism (SNP), insertion and deletion (Indel), structural variation (SV) and copy number variation (CNV).
  • SNP single nucleotide polymorphism
  • Indel insertion and deletion
  • SV structural variation
  • CNV copy number variation
  • the four mutation detection technologies provide a powerful tool for mutation detection at the genomic level. However, the above four mutation detection techniques have poor accuracy for detecting large variations (> lkb). Summary of the invention
  • One technical problem to be solved by an aspect of the present disclosure is to provide a method for detecting an insertion site of a transgenic foreign fragment with good accuracy.
  • a method for detecting a transgenic foreign fragment insertion site comprising:
  • the paired short segment sequence and the exogenous insert sequence are aligned to determine exogenous single-side short reads, and the paired short segment sequences are obtained by double-end resequencing of the test fragments of the sample to be tested;
  • the insertion site of the exogenous fragment in the genomic sequence is determined based on the intersection of the exogenous unilateral short fragment and the genomic unilateral short fragment.
  • the exogenous unilateral short fragment comprises a pair of short fragment sequences aligned to the paired short fragment sequence on the exogenous insert sequence; the genomic unilateral short film The segment includes only one short film ⁇ column aligned to the paired short segment sequence on the reference genome sequence.
  • the exogenous unilateral short fragment comprises only one short fragment sequence and only one alignment to the normal paired short fragment sequence on the exogenous insert sequence; Genomic one-sided short segment This includes only one short fragment sequence and only one alignment to the normal paired short fragment sequence on the reference genome sequence.
  • the method further comprises: filtering the pair of short segment sequences to remove the unqualified short segment sequence.
  • filtering the paired short segment sequence to remove the unqualified paired short segment sequence comprises: removing the number of bases whose sequencing quality is lower than a predetermined threshold exceeds the short film Number of bases in the sequence
  • the length of the sequence of the sample to be tested is 170-500 bp, 500-1000 bp, 1000-2000 bp, 2000-1000 bp; and/or short sequence
  • the length is 40 - 75 bp, 75 bp - 200 bp; and / or the total number of bases sequenced by the sequencing fragment of the test article is 5-10 times, 10-20 times, or 20 times or more of the total number of bases of the reference genomic sequence.
  • the method for detecting the insertion site of the transgenic foreign fragment compares the short segment sequence obtained by double-end resequencing with the exogenous insert to obtain a one-side short segment enriched at both ends of the exogenous insert, and simultaneously Aligning the short fragment sequence obtained in the double-end resequencing with the reference genome sequence to obtain a single reads at both ends of the exogenous insert; then, the two parts are taken together to determine the position of the intersection sequence on the reference sequence; The position of the insertion site is determined by the support of each corresponding site on the reference sequence, and the positional characteristics of the external insertion segment are fully utilized, and the accuracy is good.
  • a transgenic foreign fragment insertion site detection system comprising:
  • a sequencing unit for performing double-end resequencing of the sequenced fragments of the sample to be obtained to obtain a pair of short films
  • An exogenous one-sided short segment determining unit configured to compare the paired short segment sequence and the exogenous inserted segment sequence to determine an exogenous one-sided short segment (single reads);
  • genomic one-sided short segment determining unit that compares the paired short segment sequence and the reference genome sequence to determine a single-side short segment of the genome
  • An insertion site determining unit is configured to determine an insertion site of the exogenous fragment in the genomic sequence based on the intersection of the exogenous one-sided short segment and the genomic one-sided short segment.
  • the exogenous one-sided short segment comprises a pair of short segment sequences aligned to the pair of short segments on the exogenous insertion fragment sequence;
  • the unilateral short segment of the genome includes only one short segment sequence aligned to the paired short segment sequence on the reference genome sequence.
  • the exogenous unilateral short fragment comprises only one short fragment sequence and only one alignment to the normal paired short fragment sequence on the exogenous insert sequence;
  • the unilateral short segment of the genome includes only one short segment sequence and only one alignment to the normal paired short segment sequence on the reference genome sequence.
  • the system further includes a filtering unit for filtering the pair of short segment sequences to remove the unacceptable movie segments.
  • the filtering unit removes a pair of short segment sequences in which the number of bases whose sequencing quality is lower than a predetermined value exceeds 50% of the number of bases of the short segment sequence; And/or a pair of short films in which the number of bases whose sequencing result is uncertain is more than 10% of the number of pairs of short fragments Columns; and/or removal of linker sequences in pairs of short fragment sequences.
  • the sequence of the sample to be tested has a length of 170-500 bp, 500-1000 bp, 1000-2000 bp, 2000-1000 bp; and/or a short fragment sequence
  • the length is 40 - 75 bp, 75 bp - 200 bp; and / or the total number of bases sequenced by the sequencing fragment of the sample to be tested is 5-10 times, 10-20 times, or 20 times or more of the total number of bases of the reference genomic sequence.
  • the exogenous one-side short segment determining unit compares the short segment sequence obtained by double-end resequencing with the exogenous insert to obtain the rich at the both ends of the exogenous insert.
  • the genomic one-sided short fragment determining unit aligns the short fragment sequence obtained in the double-end resequencing with the reference genomic sequence to obtain a single reads at both ends of the exogenous insert; the insertion site determining unit sets the two parts Taking the intersection, determining the position of the intersection sequence on the reference sequence; determining the position of the insertion site by the support of each intersection of the statistical intersection sequence on the reference sequence, fully utilizing the positional characteristics of the exogenous insertion fragment, accuracy it is good.
  • FIG. 1 is a schematic diagram showing a method for detecting an insertion site of a transgenic foreign fragment of the present invention
  • Figure 2 is a flow chart showing an embodiment of the transgenic foreign fragment insertion site detecting method of the present invention
  • Fig. 3 is a flow chart showing an application example of the transgenic foreign fragment insertion site detecting method of the present invention
  • Figure 4 is a schematic view showing an example of determining an insertion site of the present invention.
  • Figure 5 is a view showing the structure of one embodiment of the transgenic foreign fragment insertion site detecting system of the present invention
  • Fig. 6 is a view showing the structure of another embodiment of the transgenic foreign fragment insertion site detecting system of the present invention.
  • FIG. 1 is a schematic view showing the method for detecting an insertion site of a transgenic foreign fragment of the present invention.
  • the paired short segment sequence obtained by double-end resequencing of the sequenced sample of the sample to be tested is compared with the exogenous insert sequence to determine the exogenous unilateral short segment ( Single reads ).
  • the exogenous unilateral short segment comprises only one short segment sequence of the two short segment sequences aligned to the paired short segment sequence on the exogenous insert sequence.
  • Sequencing fragment (sequencing data) is compared with an exogenous insert sequence (insertsize), and a large number of unpaired short fragment sequences aligned to the exogenous fragment are enriched at both ends of the inserted fragment, and this is labeled as Single reads 1.
  • Exogenous inserts generally refer to exogenous fragments or transposon sequences introduced in transgenic techniques, and exogenous inserts can be obtained, for example, by cloning.
  • step S104 the paired short segment sequence and the reference genome sequence are aligned to determine a genomic one-sided short segment.
  • the unilateral short segment of the genome includes only one short segment sequence of the two short segment sequences aligned to the paired short segment sequence on the reference genome sequence.
  • the sequencing data is aligned with the reference genome sequence, and a large number of unpaired short fragment sequences aligned to the foreign fragment are also enriched at both ends of the insert, labeled as single reads 2.
  • Reference Genomic Sequences A genome usually refers to the genomic sequence of a species, and genomic sequences can be obtained by sequencing and assembly.
  • an insertion site of the exogenous fragment in the genomic sequence is determined based on the intersection of the exogenous one-sided short fragment and the genomic one-sided short fragment. Will single reads 1 and The single reads in the single reads 2 set take the intersection and determine the position of these intersection sequences on the reference sequence.
  • two alignment processes are performed.
  • the short segment sequence obtained by double-end resequencing is aligned with the exogenous insert to obtain a single reads enriched at both ends of the exogenous insert, and the double ends are simultaneously
  • the short-length sequence obtained in the re-sequencing is aligned with the reference genome sequence to obtain a single reads at both ends of the exogenous insert; then the two parts are taken together to determine the position of the intersection sequence on the reference sequence; the statistical intersection sequence is in the reference sequence
  • the corresponding support situation of each site, thereby determining the position of the insertion site, fully utilizing the positional characteristics of the external insertion segment, and the accuracy is good.
  • Fig. 2 is a flow chart showing an embodiment of the transgenic foreign fragment insertion site detecting method of the present invention.
  • step 202 the sequenced fragments of the sample to be tested are subjected to pair-end resequencing to obtain a paired short fragment sequence.
  • the DNA sample to be tested is randomly broken into fragments of a certain length, and the fragment is referred to as a sequencing fragment.
  • the length of the sequenced fragment is, for example, from 170-500 bp, 500-1000 bp, 1000-2000 bp, or 2000-1000 bp.
  • sequencing is performed from both ends of the sequencing fragment to obtain a pair of short segment sequence information at both ends of the sequence, and an ID number is assigned to each pair of short segment sequences, and two short segment sequences in the same pair of short segment sequences have The same ID number.
  • Sequencing fragments of the sample to be tested are sequenced to obtain 5-10 times, 10-20 times, or 20 times more total bases of the reference genome sequence to ensure the required genome coverage.
  • the total number of bases sequenced is more than 20 times the size of the genome.
  • the paired short fragment sequences obtained by double-end resequencing are filtered.
  • the unqualified short fragment sequence is removed by filtration. For example, bases with sequencing quality below a predetermined low mass threshold The number of bases exceeds the entire sequence of fragments, for example, 50% of the sequencing sequence is removed, wherein the low quality threshold is determined by the specific sequencing technology and the sequencing environment, for example, the base quality value is B (ASCII value) as low quality.
  • the value of the base in the short segment sequence (such as N in the Illumina GA sequencing result) is more than 10% of the entire sequence of the short film; the number of the sequence in the short column Sequence removal; the short fragment sequence of the deleted linker sequence is aligned with other exogenous sequences introduced by the experiment, such as various linker sequences, and if the exogenous sequence is present in the sequence, it is considered to be an unqualified sequence and removed. By filtering, the unacceptable short segment sequence is removed, thereby improving the accuracy of the detection.
  • step 206 the paired short segment sequence is compared with the exogenous insertion slice column to obtain an exogenous one-sided segment.
  • the length of the sequenced fragments should be approximately the same, allowing for a certain range of floats.
  • the short-sequence sequences obtained by sequencing the sequencing fragments in the normal range are called normal short sequences, and the sequencing fragments obtained outside the normal range are obtained by sequencing.
  • Short sequence sequences are called abnormal short sequences.
  • the floating range can be set according to the requirements. When aligned, the minimum alignment length of the resequencing short sequence is 40 bp, and the maximum number of mismatches allowed on a short sequence is as small as possible to ensure accurate alignment. For example, the short segment sequence length is 90 bp and the maximum mismatch number is set to 1 or 2. The actual set value can vary with the length of the short segment sequence.
  • the sequence of short fragments obtained by resequencing is divided into three types, 1 soap reads: pairs exist and can be aligned to normal short sequences on the sequence of the foreign insert; 2 single reads: paired two Only one of the normal short sequences is aligned to the foreign insert sequence.
  • This type of short sequence is marked as single reads.
  • the short sequence alignment software may mark the paired abnormal short sequences as single reads. In this case, you can add a filtering step to remove these abnormal short segment sequences; 3 unmap reads: the two short sequences in pairs are not compared to the foreign source. On the inserted sequence, this type of short sequence is marked as unmap reads.
  • the alignment result only one of the extracted sequences is aligned with the exogenous insert sequence, and only the single reads of the exogenous insert sequence are aligned, which ensures the specificity of the alignment result.
  • the obtained single reads are stored, for example, in a document named single file 1.
  • Step 208 comparing the paired short segment sequence with the reference genome sequence to obtain a genomic one-sided fragment.
  • the alignment can be performed using various common short sequence alignment software such as soap, b a, and the like.
  • the short sequences obtained by resequencing are classified into three types, 1 soap reads: pairs exist and can be aligned to normal short sequences on the reference genome; 2 single reads: paired two normal short sequences Only one of them is aligned to the reference genome.
  • This type of short sequence is marked as single reads; in addition, the short sequence alignment software may mark the paired abnormal short sequences as single reads. In this case, filtering can be added. Steps to remove these short sequences of the same sequence; 3 unmap reads: The two short sequences in the pair are not aligned to the reference genome, and this type of short sequence is marked as unmap reads.
  • the comparison result only one pair of short sequence sequences extracted is aligned to the reference genome, and only the single reads of the reference genome are compared, in order to ensure the specificity of the comparison result, put it into A document named single file2.
  • the unpaired single reads in single file 2 are sorted by sample number order, and the alignment results are separated by chromosome order.
  • Step 210 determining an exogenous fragment insertion site based on the exogenous unilateral fragment and the genomic one-sided fragment.
  • the single reads with the same ID number and matching each other in the two files are the single reads after the matching intersection. (in pairs of two short segment sequences One of the comparisons to the reference genome, the other may be compared to the foreign insertion.
  • the support of the short segment sequence at each corresponding point on the reference sequence will exhibit a trough curve with a lowest point in the middle and a peak at the ends. There will be a peak of sequence enrichment at both ends of the insertion site, and the more the insertion site is, the shorter the sequence support number will be gradually reduced, and a short segment sequence near the insertion site will appear (ie short segment) The lowest point of the sequence support number).
  • the range of this fault (lowest point) can be considered as the insertion position of the foreign fragment.
  • the sequenced segment belongs to a normal segment or an abnormal segment according to the set length and its floating range.
  • the degree of partiality (SD value) of the inserted segment is calculated, and the appropriate range of the inserted segment is determined, thereby improving the accuracy of the alignment and obtaining a reasonable comparison result.
  • Count the length of the inserted segment and the number of occurrences of the length and find the length in which the highest frequency of occurrence is recorded as M; the length of the inserted segment that is not equal to M is recorded as x, and the number of occurrences is recorded as n, which can be calculated using the formula Out SD value:
  • the distribution of the insert should be approximately obeying the normal distribution, the highest point is the average of the length of the insert, and the length of the insert has a floating upper limit and a lower floating limit.
  • the value between the floating limit of the insert length is a reasonable range of inserts.
  • Fig. 3 is a flow chart showing an application example of the transgenic foreign fragment insertion site detecting method of the present invention.
  • a process of simulating insertion and sequencing of an exogenous fragment is used, and the biological sample to be tested is a transgenic Arabidopsis inserting a foreign gene fragment;
  • the 10 kb exogenous fragment was randomly inserted into the Arabidopsis reference genome as a transgenic sample of Arabidopsis thaliana, and the Arabidopsis thaliana transgenic samples were simulated and sequenced using Maq simulate software. The sequencing results were used as sequencing data.
  • Transgenic exogenous fragment a 10 kb fragment in the mouse genome
  • source database UCSC Genome Browser, http://hgdownload.cse.ucsc.edu/ goldenPath/mm9/chromosomes/ Reference sequence genome: Arabidopsis genome
  • Source Database Ensembl Genome Browser Website: http://plants.ensembl.org/Arabidopsis_thaliana/Info/ Index
  • -d parameter is the length of the sequencing segment, set to 500
  • -N parameter indicates the total number of short segment sequences to be obtained by sequencing, which is determined according to the sequencing depth (Sequencing Depth), and the sequencing depth is One of the indicators for evaluating the quality of sequencing, indicating the ratio of the total number of bases (bp) obtained by sequencing to the genome size (Genome).
  • the simulation sequencing depth is 20 times, the total length of the reference genome is 121M, the length of the short segment is set to 75 bp, and the -N is set to 16 Mb; the -1, -2 parameter is the short segment sequence 1 and the short column 2 in the double-end resequencing
  • the length of this example is set to 75; fql and fq2 are output files, and the sequencing data after the simulation sequencing, that is, the short segment sequence 1 and the short segment sequence 2 are respectively stored in the fql, fq2 file in the fastq format;
  • Simupars.dat is the system file of maq simulate software, which determines the length and quality value of the short segment sequence.
  • step S301 the sequencing data is received, and the sequencing data is preprocessed (because the analog data is used here, the data is not preprocessed) and stored in the fastq file format.
  • step S302 the content of the two parts is included, and the specific steps are decomposed into:
  • -P parameter indicates the memory required for the foot line
  • -a parameter indicates the fql file obtained by re-sequencing the input file during double-end sequencing (the file where the movie ⁇ column 1 is located)
  • the -b parameter indicates that the input file is a re-sequencing fq2 file at the time of double-end sequencing (the file in which the short fragment sequence 2 is located)
  • the -D parameter indicates that the sequence of the reference genome is input in the fasta file format (the first line of the fasta sequence file) Any textual description beginning with the greater than ">" or semicolon ";” for sequence labeling; starting from the second line for the sequence itself, only the specified nucleotide or amino acid encoding symbol is allowed); Item, -0 parameter, the result of the output is a pair of short segment sequences aligned to the reference genome or the foreign segment, the output file is suffixed with .so; the -2 parameter, the output is a pair of
  • a short segment sequence that satisfies the condition is found for the widest range.
  • the number of mismatches, in the present invention, the parameter setting should be as small as possible to ensure accurate comparison. Need to pay attention to the consistency of the soap parameter settings, 1 ⁇ .
  • step S303 extracting the sequence 3 ⁇ 4 of the sequence of the sequenced data and the sequence of the foreign inserted sequence, and storing the file in the file single file 1;
  • step S304 the short segment sequence on the alignment of the sequencing data and the reference genome is extracted and stored in the document single file 2;
  • step S305 the short segment sequence in the single file is processed to remove the paired short segment sequence in the single file and to retain the single reads with a comparison value of 1 to ensure the specificity of the short segment sequence alignment.
  • the reason for the paired short segment sequence is that the floating range of the sequenced segment is set when the alignment is performed, for example, the range value is ⁇ X x the length of the inserted segment, and the sequence of the pair of abnormal short segments that are not in the range of the comparison result is also Put it in a single file. Therefore, in order to ensure the specificity of the short segment sequence alignment, it is necessary to remove the paired short segment sequences in the single file and retain the number of single reads of one.
  • step S306 the single reads in the single file 2 obtained in step S305 are sorted in the order of each sample number, and are separated in chromosomal order.
  • step S307 to determine the corresponding position of the single reads on the reference genome obtained by comparing with the sequence of the foreign fragments, the single file 1 is required to intersect with the single reads in the single file 2 obtained after the step S306 is sorted. Determine the location of the intersection on the reference genome.
  • step S308 the obtained intersection is corresponding to the reference genome.
  • the invention can accurately determine a range of insertion sites of exogenous fragments (equal to several hundred bases to more than one hundred bases), and although it cannot accurately locate the insertion site of the foreign fragment, it combines traditional PCR ( Polymerase chain reaction) The experiment can accurately find the insertion site.
  • the detection efficiency of the invention was verified by a large number of simulation experiments. The detection rate was 92.15% ⁇ 0.01%, the false positive rate was 3.87% ⁇ 0.013%, and the false negative rate was 4.4% ⁇ 0.05%.
  • the invention detects the insertion site by bioinformatics means, has a fast cycle and low cost, and solves the problem that the current pure experimental technique has low detection efficiency and high cost.
  • Fig. 5 is a view showing the configuration of an embodiment of the transgenic foreign fragment insertion site detecting system of the present invention.
  • the system includes a sequencing unit 51, an exogenous one-sided short segment determining unit 52, a genomic one-sided short segment determining unit 53, and an insertion site determining unit 54.
  • the sequencing unit 51 performs double-end resequencing of the sequenced fragments of the sample to be tested to obtain a paired short segment sequence; the exogenous one-sided short segment determining unit 52 compares the paired short segment sequence and the exogenous insert sequence to determine the exogenous unilateral a short segment; the genomic one-sided short segment determining unit 53 compares the paired short segment sequence and the reference genome sequence to determine a unilateral short segment of the genome; the insertion site determining unit 54 is based on the exogenous unilateral short segment and the genomic list The intersection of the short side fragments determines the insertion site of the foreign fragment in the genomic sequence.
  • the exogenous unilateral short segment comprises only one short segment sequence aligned to the paired short segment sequence on the exogenous insert sequence; the genomic unilateral short segment includes only one short segment sequence aligned to the reference genomic sequence For short segment sequences.
  • the exogenous unilateral short segment includes only one short segment sequence and only one alignment to the normal paired short segment sequence on the exogenous insert sequence; the genomic one-sided short segment includes only one short segment sequence and only one alignment to the reference Normal paired short fragment sequences on the genomic sequence.
  • the length of the sequenced fragment of the sample to be tested is
  • the length of the short fragment sequence is 40-75 bp, 75 bp - 200 bp; the total number of bases sequenced by the sequencing fragment of the sample to be tested is the total base of the reference genomic sequence 5-10 times, 10-20 times, or 20 times more than the amount.
  • Fig. 6 is a view showing the configuration of another embodiment of the transgenic foreign fragment insertion site detecting system of the present invention.
  • the system of this embodiment further includes a filtering unit 65 for filtering pairs of short segment sequences to remove unacceptable short segment sequences.
  • the filtering unit 65 removes the paired short segment sequence in which the number of bases whose sequencing quality is lower than a predetermined threshold exceeds 50% of the number of bases of the short segment sequence; the number of bases whose undefined sequencing result is determined to exceed the paired short segment sequence A pair of short fragment sequences having a base number of 10%; the linker sequence in the paired short fragment sequence is removed.
  • the unqualified short segment sequence is removed by the filtering unit, and the accuracy of the detection can be improved.
  • the detection method and system provided by the invention solve the problems that the existing four mutation detection technologies cannot accurately detect the mutation site and other experimental techniques cannot be solved based on the whole genome re-sequencing technology, and the accuracy is good, the method is simple and quick, and the cost is low.
  • FIG. 5 It can be implemented by a separate computing device or integrated into a single device. They are shown in boxes in Figures 5 to 6 to illustrate their function.
  • These functional blocks can be implemented in hardware, software, firmware, middleware, microcode, hardware description speech, or any combination thereof.
  • one or both of the functional blocks can be implemented by code running on a microprocessor, digital signal processor (DSP), or any other suitable computing device.
  • a code can represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, or any combination of instructions, data structures, or program statements.
  • the code can be located on a computer readable medium.
  • the computer readable medium can include one or more storage devices including, for example, RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, mobile hard disk, CD-ROM, or any other form known in the art. Storage medium.
  • the computer readable medium can also include a carrier wave that encodes the data signal.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Biomedical Technology (AREA)
  • Wood Science & Technology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Plant Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Microbiology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Virology (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Description

检测转基因外源片段插入位点的方法和系统
技术领域
本发明涉及生物信息学技术领域, 尤其涉及一种转基因外源 片段插入位点的检测方法和系统。 背景技术
随着转基因产品的上市和大面积推广, 其安全性问题逐渐引 起人们的关注。 总的来说, 转基因产品的安全性问题主要体现在 两个方面: 一是食用安全性, 主要考虑转基因产品被人 (畜)食用 后是否会对生物体产生不利影响; 二是环境安全性, 其关注的主 要问题之一是转基因生物释放到外界环境后外源基因是否会漂移 到环境中, 并在环境中保留和扩散, 造成基因变异, 导致生态风 险, 即个体、 种群、 群落、 生态系统以及整体环境可能因此产生 的不利的生态后果, 使生态环境受到破坏, 打破原有生态种群的 平衡。 因此对转基因生物流通领域的安全性评估和控制就显得尤 为重要, 基因变异检测是其中非常重要的一部分。
在高通量, 高分辨率基因组学技术没有出现以前, 基因变异 检测主要是通过传统的细胞遗传学技术, 尤其是高分辨率的染色 体带型技术实现的, 随着 DNA 测序技术的发展, 检测基因变异 的方法也发展为基于芯片技术和 DNA 测序技术的高通量分析以 及基于 PCR (聚合酶链式反应 )技术的靶向性分析。 这些方法费 时费力, 实验成本也很高, 并且检测某些基因变异的分辨率较 低。 因此亟待开发一种新的具有较高灵敏度、 特异度, 分辨率 高, 成本低的基因变异检测技术。
全基因组重测序技术是对基因组序列已知的个体进行基因组 测序, 并在个体或群体水平进行差异性分析的方法。 重测序技术 是检测遗传变异的一个重要途径。 现行的变异检测包含单碱基突 变 (single nucleotide polymorphism, 简称 SNP)、 插入缺失 (insertion and deletion, 简称 Indel)、 结构变异 (structure variation, 简称 SV) 和拷贝数变异 (copy number variation, 简 称 CNV )四大变异检测技术, 为在基因组水平上进行变异检测提 供了强有力的工具。 但上述四种变异检测技术对于大片段( > lkb ) 的变异检测效果准确性较差。 发明内容
本公开的一个方面要解决的一个技术问题是提供一种转基因 外源片段插入位点的检测方法, 准确性好。
根据本发明的一个方面, 提供一种转基因外源片段插入位点 检测方法, 包括:
将成对短片段序列和外源插入片段序列进行比对确定外源单 侧短片段(single reads ), 成对短片段序列通过对待测样品的测 序片段进行双末端重测序获得;
将成对短片段序列和参考基因组序列进行比对确定基因组单 侧短片段 ( single reads );
根据外源单侧短片段和基因组单侧短片段的交集确定外源片 段在基因组序列中的插入位点。
根据本发明的转基因外源片段插入位点检测方法的一个实施 例, 外源单侧短片段包括只有一条短片段序列比对到外源插入片 段序列上的成对短片段序列; 基因组单侧短片段包括只有一条短 片^^列比对到参考基因组序列上的成对短片段序列。
根据本发明的转基因外源片段插入位点检测方法的一个实施 例, 外源单侧短片段包括只有一条短片段序列且只有一次比对到 外源插入片段序列上的正常成对短片段序列; 基因组单侧短片段 包括只有一条短片段序列且只有一次比对到参考基因组序列上的 正常成对短片段序列。
根据本发明的转基因外源片段插入位点检测方法的一个实施 例, 该方法还包括: 过滤成对短片段序列以去除不合格的短片段 序列。
根据本发明的转基因外源片段插入位点检测方法的一个实施 例, 过滤成对短片段序列以去除不合格的成对短片段序列包括: 去除测序质量低于预定阈值的碱基个数超过短片段序列碱基个数
50%的成对短片段序列; 和 /或去除测序结果不确定的碱基个数超 过成对短片段序列碱基个数 10%的成对短片段序列; 和 /或去除 成对短片段序列中的接头序列。
根据本发明的转基因外源片段插入位点检测方法的一个实施 例, 待测样品的测序片段的长度为 170-500bp、 500-1000bp , 1000-2000bp、 2000 - lOOOObp;和 /或短片段序列的长度为 40 - 75bp、 75bp - 200bp; 和 /或待测 品的测序片段测序得到的碱基 总量为参考基因组序列碱基总量的 5-10倍、 10 - 20倍、 或 20倍 以上。
本发明实施例提供的转基因外源片段插入位点的检测方法, 将双末端重测序获得的短片段序列与外源插入片段比对获得在外 源插入片段两端富集的单侧短片段, 同时将双末端重测序中获得 的短片段序列与参考基因组序列比对获得在外源插入片段两端的 single reads; 然后将这两部分取交集, 确定交集序列在参考序列 上的位置; 通过统计交集序列在参考序列上对应的每个位点的支 持情况确定插入位点的位置, 充分利用了外源插入片段的位置特 性, 准确性好。
本公开的另一个方面要解决的一个技术问题是提供一种转基 因外源片段插入位点的检测系统, 准确性好。 根据本发明的另一方面提供一种转基因外源片段插入位点检 测系统, 包括:
测序单元, 用于对待测样品的测序片段进行双末端重测序获 得成对短片^ ^列;
外源单侧短片段确定单元, 用于将成对短片段序列和外源插 入片段序列进行比对确定外源单侧短片段( single reads ),;
基因组单侧短片段确定单元, 将成对短片段序列和参考基因 组序列进行比对确定基因组单侧短片段( single reads );
插入位点确定单元, 用于根据外源单侧短片段和基因组单侧 短片段的交集确定外源片段在基因组序列中的插入位点。
根据本发明的转基因外源片段插入位点检测系统的一个实施 例, 外源单侧短片段包括只有一条短片段序列比对到外源插入片 段序列上的成对短片 列;
基因组单侧短片段包括只有一条短片段序列比对到参考基因 组序列上的成对短片段序列。
根据本发明的转基因外源片段插入位点检测系统的一个实施 例, 外源单侧短片段包括只有一条短片段序列且只有一次比对到 外源插入片段序列上的正常成对短片段序列; 基因组单侧短片段 包括只有一条短片段序列且只有一次比对到参考基因组序列上的 正常成对短片段序列。
根据本发明的转基因外源片段插入位点检测系统的一个实施 例, 该系统还包括过滤单元, 用于过滤成对短片段序列以去除不 合格的短片^ ^列。
根据本发明的转基因外源片段插入位点检测系统的一个实施 例, 过滤单元去除测序质量低于预定阁值的碱基个数超过短片段 序列碱基个数 50%的成对短片段序列; 和 /或去除测序结果不确 定的碱基个数超过成对短片段序列 个数 10%的成对短片 列; 和 /或去除成对短片段序列中的接头序列。
根据本发明的转基因外源片段插入位点检测系统的一个实施 例, 待测样品的测序片段的长度为 170-500bp、 500-1000bp、 1000-2000bp、 2000 - lOOOObp;和 /或短片段序列的长度为 40 - 75bp、 75bp - 200bp; 和 /或待测样品的测序片段测序得到的碱基 总量为参考基因组序列碱基总量的 5-10倍、 10 - 20倍、 或 20倍 以上。
本发明实施例提供的转基因外源片段插入位点的检测系统, 外源单侧短片段确定单元将双末端重测序获得的短片段序列与外 源插入片段比对获得在外源插入片段两端富集的单侧短片段, 基 因组单侧短片段确定单元将双末端重测序中获得的短片段序列与 参考基因组序列比对获得在外源插入片段两端的 single reads; 插 入位点确定单元将这两部分取交集, 确定交集序列在参考序列上 的位置; 通过统计交集序列在参考序列上对应的每个位点的支持 情况确定插入位点的位置, 充分利用了外源插入片段的位置特 性, 准确性好。 附图说明
图 1 示出本发明的转基因外源片段插入位点检测方法的原理 图;
图 2 示出本发明的转基因外源片段插入位点检测方法的一个 实施例的流程图;
图 3示出本发明的转基因外源片段插入位点检测方法的一个 应用例的流程图;
图 4示出本发明的确定插入位点例子的示意图;
图 5示出本发明的转基因外源片段插入位点检测系统的一个 实施例的结构图; 图 6 示出本发明的转基因外源片段插入位点检测系统的另一 个实施例的结构图。 具体实施方式
下面参照附图对本发明进行更全面的描述, 其中说明本发明 的示例性实施例。
图 1 示出本发明的转基因外源片段插入位点检测方法的原理 图。
如图 ¾所示, 在步骤 S102, 将通t^t待测样品的测序片段进 行双末端重测序获得的成对短片段序列和外源插入片段序列进行 比对确定外源单侧短片段 ( single reads )。 外源单侧短片段包括 两条短片段序列中只有一条短片段序列比对到外源插入片段序列 上的成对短片段序列。 测序片段(测序数据)( transgenosis ) 与 外源插入片段序列 (insertsize ) 比对, 在插入片段两端会富集大 量的比对到外源片段上的不成对的短片段序列, 将此标记为 single reads 1。 外源插入片段通常指转基因技术中导入的外源片 段或是转座子序列, 外源插入片段例如可以由克隆获得。
在步骤 S104, 将成对短片段序列和参考基因组序列进行比对 确定基因组单侧短片段。 基因组单侧短片段包括两条短片段序列 中只有一条短片段序列比对到参考基因组序列上的成对短片段序 列。 测序数据与参考基因组序列 (reference ) 比对, 在插入片段 两端同样会富集大量的比对到外源片段上的不成对的短片段序 列, 将此标记为 single reads 2。 参考基因组序列基因组通常指某 个物种的基因组序列, 基因组序列可以由测序并组装获得。
在步骤 S106, 根据外源单侧短片段和基因组单侧短片段的交 集确定外源片段在基因组序列中的插入位点。 将 single reads 1和 single reads 2集合中的 single reads取交集, 确定这些交集序列 在参考序列上的位置。
在上述实施例中, 需进行两次比对过程, 首先将双末端重测 序获得的短片段序列与外源插入片段比对, 获得在外源插入片段 两端富集的 single reads, 同时将双末端重测序中获得的短片^^ 列与参考基因组序列比对获得在外源插入片段两端的 single reads; 然后将这两部分取交集, 确定交集序列在参考序列上的位 置; 通过统计交集序列在参考序列上对应的而每个位点的支持情 况, 从而确定插入位点的位置, 充分利用了外源插入片段的位置 特性, 准确性好。
图 2 示出本发明的转基因外源片段插入位点检测方法的一个 实施例的流程图。
如图 2 所示, 在步骤 202, 待测样品的测序片段进行双末端 ( pair-end )重测序, 获得成对短片段序列。
将待测 DNA样品随机打断成一定长度的片段, 该片段被称 为测序片段。 测序片段的长度例如从 170-500bp、 500-1000bp, 1000-2000bp、 或 2000 - lOOOObp 中取值。 测序时从测序片段的 两端分别进行测序, 从而获得一对该测序片段两端的短片段序列 信息, 为各个成对短片段序列分配 ID 号, 同一对短片段序列中 的两条短片段序列具有相同的 ID 号。 待测样品的测序片段测序 得到的碱基总量为参考基因组序列碱基总量的 5-10 倍、 10 - 20 倍、 或 20倍以上, 以保证所需的基因组覆盖度。 优选地, 测序 得到的碱基总量在基因组大小的 20倍以上。
在步骤 204, 过滤双末端重测序所得到的成对短片段序列。 接收到双末端重测序所得短片段序列后, 通过过滤将不合格 的短片段序列去除。 例如, 将测序质量低于预定低质量阈值的碱 基个数超过整条短片 列 ½个数例如 50%的测序序列去除, 其中低质量阈值由具体测序技术及测序环境而定, 例如, 将碱基 的质量值为 B ( ASCII值)作为低质量阁值; 将短片段序列中测 序结果不确定的碱基(如 Illumina GA测序结果中的 N )个数超 过整条短片 ^^列^ ^个数 10%的测序序列去除; 短片 列中 的接头序列去除; 将去除接头序列地短片段序列与其它实验引入 的外源序列比对, 如各种接头序列, 若序列中存在外源序列则认 为是不合格序列, 去除。 通过过滤, 去除不合格的短片段序列, 从而提高检测的准确度。
在步驟 206,成对短片段序列与外源插入片^^列进行比对获 得外源单侧片段。
可以用各种常见短序列比对软件如 soap、 bwa等进行比对。 测序片段的长度应基本相同, 可以允许有一定的浮动范围, 针对 长度在正常范围内的测序片段测序所获得的短片段序列称为正常 短序列, 在正常范围之外的测序片段测序所获得的短片段序列称 为异常短序列。 浮动范围可 ^据需求自行设置。 比对时, 重测序 所得短片段序列的最低比对长度为 40bp, 比对时一条短序列上允 许的最大不匹配数要尽量小, 以保证精确比对。 例如, 短片段序 列长度为 90bp, 最大不匹配数设置为 1或 2。 实际设置的值可以 随着短片段序列的长度而变化。
比对后, 重测序得到的短片段序列被分为三种类型, ① soap reads: 成对存在且都能比对到外源插入片段序列上的正常 短序列; ② single reads: 成对的两条正常短序列中只有一条比 对到外源插入片段序列上, 这种类型的短序列被标记为 single reads 另外, 短序列比对软件可能会将成对的异常短序列标记为 single reads, 这种情况下, 可以增加过滤步骤去除这些异常短片 段序列; ③ unmap reads: 成对的两条短序列都没有比对到外源 插入片段序列上, 这种类型的短序列被标记为 unmap reads。 依 据比对结果, 提取只有一条与外源插入片段序列比对上, 并且只 比对到外源插入片段序列上一次的 single reads, 这样可以保证比 对结果的特异性。 将获得的 single reads 进行存储, 例如存储在 一个命名 single file 1的文档中。
步骤 208, 成对短片段序列与参考基因组序列进行比对获得 基因组单侧片段。
可以用各种常见短序列比对软件如 soap、 b a等进行比对。 比对后, 重测序得到的短序列被分为三种类型, ① soap reads: 成对存在且都能比对到参考基因组上的正常短序列; ② single reads: 成对的两条正常短序列中只有一条比对到参考基因组上, 这种类型的短序列被标记为 single reads; 另外, 短序列比对软件 可能会将成对的异常短序列标记为 single reads, 这种情况下, 可 以增加过滤步厥去除这些异索短片段序列; ③ unmap reads: 成 对的两条短序列都没有比对到参考基因组上, 这种类型的短序列 被标记为 unmap reads。 依据比对结果, 提取成对的短片段序列 只有一条比对到参考基因组上, 并且只比对到参考基因组上一次 的 single reads, 这样做是为了保证比对结果的特异性, 将其放进 一个命名为 single file2 的文档中。 将 single file 2 中不成对的 single reads按照样品号顺序进行排序, 并将比对结果按染色体顺 序分离。
步骤 210, 根据外源单侧片段和基因组单侧片段确定外源片 段插入位点。
提取 single file 1 和 single file 2 中的 ID号相同的 single reads。 根据 single file 1 和 single file 2中的 single reads的 ID号 取交集, 两个文件中 ID号相同且相互配对的 single reads即为符 合条件的交集后的 single reads。 (成对的两条短片段序列中其中 的一条比对到参考基因组上, 另一条就可能会比对到外源插入片 段上。 这种成对存在的短片段序列的 ID号是相同的。)按照交集 在参考基因组上对应位点 (短序列比对到参考基因组上的起始位 置) 的大小排序, 加上短片段序列的步长, 统计交集后的 single reads 在参考序列上对应的每个位点上的支持情况, 确定转基因 外源片段的插入位点。 参考序列上对应的每个位点上的短片段序 列的支持情况会呈现出波谷式曲线, 即中间有一个最低点, 两端 存在突起的高峰。 在插入位点两端会有一个短片段序列富集的高 峰, 越趋向于插入位点, 短片段序列支持数会逐渐降低, 在插入 位点附近处出现一个短片段序列的断层(即短片段序列支持数的 最低点)。 这个断层 (最低点) 的范围即可被认定为外源片段的 插入位置。
上述实施例中, 根据设定的长度及其浮动范围确定测序片段 属于正常片段还是异常片段。 根据本发明的一个实施例, 在比对 之前, 先计算插入片段的偏分程度(SD值), 确定合适的插入片 段范围, 从而提高比对的精确度, 获得合理的比对结果。 统计插 入片段的长度及该长度的出现次数, 并在其中找到出现频率最高 的长度记为 M; 将不等于 M的插入片段长度记为 x, 出现次数记 为 n, 使用所示公式即可计算出 SD值:
sd = V∑i=1 ni(xi - M)V∑|=1 ni
插入片段的分布应近似服从正态分布, 最高点为插入片段长 度的平均值, 插入片段长度有浮动上限和浮动下限 。 根据插入 片段的平均长度和小于平均长度的插入片段出现的次数, 计算得 到一个左 SD值(L-SD ), 然后根据浮动下限值 = (平均长度 -L- SD值)获得浮动下限的值; 根据插入片段的平均长度和大于平 均长度的插入片段出现的次数, 计算得到一个右 SD 值 (R- SD ), 根据浮动上限值= ( R-SD值 +平均长度)获得浮动上限的 值。 在插入片段长度上下浮动限制之间的值为合理的插入片段范 围。
图 3示出本发明的转基因外源片段插入位点检测方法的一个 应用例的流程图。 在该应用例中, 采用模拟外源片段插入和测序 的过程, 待测生物样品为插入外源基因片段的转基因拟南芥; 将
10kb的外源片段随机插入到拟南芥参考基因组中, 作为拟南芥转 基因样品, 并利用 Maq simulate软件将对拟南芥转基因样品进行 模拟测序, 测序得到的结果作为测序数据。
转基因外源片段: 小鼠基因组中长 10kb 的片段, 来源数据 库 : UCSC Genome Browser , 网 址 : http://hgdownload.cse.ucsc.edu/ goldenPath/mm9/chromosomes/ 参考序列基因组: 拟南芥基因组, 来源数据库: Ensembl Genome Browser 网 址 : http://plants.ensembl.org/Arabidopsis_thaliana/Info/ Index
模拟长度为 10kb 的外源片段插入到基因组中, 然后进行模 拟测序。 模拟程序为 maq simulate, 需要设置如下参数: -d, - N, -1 , -2 , fql , fq2 和 simupars.dat。 下面对各个参数做详细 的说明: -d参数为测序片段长度, 设置为 500; -N参数表示测序 所要获得的短片段序列总数, 该参数根据测序深度(Sequencing Depth ) 来确定, 测序深度是评价测序质量的指标之一, 表示测 序得到的碱基总量(bp ) 与基因组大小 (Genome ) 的比值。 利 用公式: N-测序深度 X参考基因组总长度 /(2 X reads 长度)来计 算。 本案例模拟测序深度为 20乘, 参考基因组总长度为 121M, 短片段序列长度设为 75bp, -N设为 16Mb; -1, -2参数为双末端 重测序中短片段序列 1和短片 列 2的长度, 本例中设为 75; fql , fq2 为输出文件, 将模拟测序后的测序数据即短片段序列 1 和短片段序列 2 分别以 fastq 格式存入 fql, fq2 文件中; simupars.dat为 maq simulate软件的系统文件, 决定短片段序列 的长度和质量值。
如图 3 所示, 在步骤 S301 中, 接收测序数据, 进行测序数 据预处理 (由于此处用的是模拟数据, 故未进行数据预处理), 以 fastq文件格式储存。
在步骤 S302中, 包含两部分的内容, 具体步骤分解为:
( 1 )测序数据与外源插入序列的 soap比对;
( 2 )测序数据与参考基因组序列的 soap比对;
在进行上述两个部分的 soap 比对时, 需要设置如下参数: - p, -a, -b, -D, -o, -2, -u, -m, -x , -s , -1 , -v 。 下面 对各个参数做详细的说明: -P 参数表示该脚^行时所需要的内 存; -a参数表示双末端测序时输入文件为重测序得到的 fql文件 (短片 ^^列 1 所在的文件); -b 参数表示双末端测序时输入文 件为重测序得到的 fq2文件(短片段序列 2 所在的文件); -D参 数表示参考基因组的序列以 fasta文件格式输入 ( fasta序列文件 的第一行是由大于号" > "或分号"; "开头的任意文字说明, 用于序 列标记; 从第二行开始为序列本身, 只允许使用既定的核苷酸或 氨基酸编码符号); 输出参数有三项, -0 参数, 输出的结果为比 对到参考基因组或外源片段上的成对的短片段序列, 其输出文件 以. soap 为后缀; -2 参数, 其输出结果为成对的短片段序列中只 有一条比对到参考基因组或外源片段上, 输出文件以 .single作为 后缀; -u 参数, 其输出结果是未比对到参考基因组或外源片段上 的成对短片 ¾ 列, 输出文件以 .unmap作为后缀; 不设置 -t参数 以保留短片段序列的原始 ID号; -m, -X参数为插入片段的浮动 范围, -m参数指测序片段的浮动下限, 即(负百分数 X测序片段 长度), -X 参数指测序片段的浮动上限, 即 (正百分数 X测序片段 长度)。 在本发明中, 为了最大范围的找到符合条件的短片段序 列, 将测序片段的浮动范围放宽, -m,-x 参数分别设置为测序片 段长度 ± 0.88 X测序片段长度; -s 参数为最小比对长度, 设置为 40; -1参数为初始比对上的种子序列 (长片段序列的 3,端错误率 高, 从 5,端设定一定长度的序列作为种子序列) 长度, 设置为 32; -V参数表示比对时一条短片段序列上允许的最大不匹配数, 在本发明中该参数设置要尽量小, 以保证精确比对。 需要注意 soap参数的设置的一致, 1±。
在步骤 S303 中, 提取测序数据与外源插入片段序列比对上 的短片 ¾ 列, 存入文档 single file 1;
在步骤 S304 中, 提取测序数据与参考基因组比对上的短片 段序列, 存入文档 single file 2;
在步骤 S305中, 对 single file 中的短片段序列进行处理, 去 除 single file 中成对的短片段序列并保留比对值为 1 的 single reads 以保证短片段序列比对的特异性。 出现成对短片段序列的 原因在于比对的时候设置了测序片段的浮动范围, 例如范围值为 ± X x插入片段长度, 比对结果中不在这个范围内的成对异常短 片段序列, 也会放到 single file 中。 所以, 为保证短片段序列比 对的特异性, 需去除 single file 中成对的短片段序列, 保留比对 数目为 1的 single reads。
在步驟 S306中, 按照每个样品号的顺序对步骤 S305中得到 的 single file 2 中的 single reads进行排序, 并按染色体顺序分 离。
在步骤 S307 中, 要确定与外源片段序列比对得到的 single reads在参考基因组上的相应位置, 就需要 single file 1与经过步 骤 S306排序后获得的 single file 2 中的 single reads取交集, 以 确定交集在参考基因组上的位置。
在步骤 S308 中, 将取得的交集按照其在参考基因组上对应 位点的大小排序, 加上短片段序列的长度, 统计参考序列每个位 点的短片段序列的支持。
在插入位点两端会有一个短片段序列富集的高峰, 越趋向于 插入位点, 短片段序列支持数会逐渐降低, 在插入位点附近处出 现一个短片段序列的断层。 这个断层的范围, 确定为外源片段的 插入位置(如图 4中箭头所指的虚线圏内的位置)。
本发明可以较准确的确定外源片段的插入位点的一个范围 (几个碱基到一百多个碱基不等), 虽不能准确定位外源片段的 插入位点, 但结合传统 PCR (聚合酶链式反应 ) 实验可以准确的 找出插入位点。 经大量的模拟实验验证, 该发明的检测效率较 高, 检出率为 92.15% ± 0.01% , 假阳性率为 3.87% ± 0.013% , 假阴性率为 4.4% ± 0.05%。 本发明通过生物信息学手段检测插入 位点, 周期快、 成本低, 解决了目前纯实验技术检测效率低, 耗 费成本高的问题。
图 5 示出本发明的转基因外源片段插入位点检测系统的一个 实施例的结构图。 如图 5所示, 该系统包括测序单元 51、 外源单 侧短片段确定单元 52、 基因组单侧短片段确定单元 53和插入位 点确定单元 54。 测序单元 51 过对待测样品的测序片段进行双末 端重测序获得成对短片段序列; 外源单侧短片段确定单元 52 将 成对短片段序列和外源插入片段序列进行比对确定外源单侧短片 段(single reads ); 基因组单侧短片段确定单元 53将成对短片段 序列和参考基因组序列进行比对确定基因组单侧短片段; 插入位 点确定单元 54根据外源单侧短片段和基因组单侧短片段的交集 确定外源片段在基因组序列中的插入位点。 其中, 外源单侧短片 段包括只有一条短片段序列比对到外源插入片段序列上的成对短 片段序列; 基因组单侧短片段包括只有一条短片段序列比对到参 考基因组序列上的成对短片段序列。 根据本发明的一个实施例, 外源单侧短片段包括只有一条短片段序列且只有一次比对到外源 插入片段序列上的正常成对短片段序列; 基因组单侧短片段包括 只有一条短片段序列且只有一次比对到参考基因组序列上的正常 成对短片段序列。
根据本发明的一个实施例, 待测样品的测序片段的长度为
170-500bp、 500-1000bp, 1000-2000bp、 2000 - lOOOObp; 短片段 序列的长度为 40 - 75bp、 75bp - 200bp; 待测样品的测序片段测 序得到的碱基总量为参考基因组序列碱基总量的 5-10倍、 10 - 20 倍、 或 20倍以上。
图 6示出本发明的转基因外源片段插入位点检测系统的另一 个实施例的结构图。 和图 5相比, 该实施例的系统还包括过滤单 元 65, 用于过滤成对短片段序列以去除不合格的短片段序列。 例 如, 过滤单元 65 去除测序质量低于预定阈值的碱基个数超过短 片段序列碱基个数 50%的成对短片段序列; 去除测序结果不确定 的碱基个数超过成对短片段序列碱基个数 10%的成对短片段序 列; 去除成对短片段序列中的接头序列。
上述实施例中, 通过过滤单元去除不合格的短片段序列, 能 够提高检测的准确性。
对于图 5至图 6中各个装置或单元的功能, 可以参考上文中 关于本发明方法的实施例中对应部分的说明, 为简洁起见, 在此 不再详述 o
本发明提供的检测方法和系统, 基于全基因组重测序技术, 解决了现有四种变异检测技术不能准确检测变异位点及其他实验 技术不能解决的问题, 准确性好, 简便快捷, 成本低, 为转基因 生物及其产品的检测和监管中关于大片段的变异检测提供科学、 准确、 可靠的检测手段。
本领域的技术人员应当理解, 对于图 5 至图 6 中的各个装 置, 可以通过单独的计算处理设备实现, 或者将其集成为一个独 立的设备实现。 在图 5至图 6中用框示出以说明它们的功能。 这 些功能块可以用硬件、 软件、 固件、 中间件、 微代码、 硬件描述 语音或者它们的任意组合来实现。 举例来说, 一个或者两个功能 块都可以利用运行在微处理器、 数字信号处理器 (DSP )或任何 其他适当计算设备上的代码实现。 代码可以表示过程、 功能、 子 程序、 程序、 例行程序、 子例行程序、 模块或者指令、 数据结构 或程序语句的任意组合。 代码可以位于计算机可读介质中。 计算 机可读介质可以包括一个或者多个存储设备, 例如, 包括 RAM 存储器、 闪存存储器、 ROM 存储器、 EPROM 存储器、 EEPROM存储器、 寄存器、 硬盘、 移动硬盘、 CD-ROM或本领 域公知的其他任何形式的存储介质。 计算机可读介质还可以包括 编码数据信号的载波。
本领域技术人员将意识到硬件、 固件和软件配置在这些情况 下的可替换性, 以及如何最好地实现每个特定应用地该功能。
本发明的描述是为了示例和描述起见而给出的, 而并不是无 遗漏的或者将本发明限于所公开的形式。 很多修改和变化对于本 领域的普通技术人员而言是显然的。 选^^和描述实施例是为了更 好说明本发明的原理和实际应用 , 并且使本领域的普通技术人员 能够理解本发明从而设计适于特定用途的带有各种修改的各种实 施例。

Claims

权 利 要 求
1. "-种转基因外源片段插入位点检测方法, 其特征在于, 包括: 将成对短片段序列和外源插入片段序列进行比对确定外源单侧短 片段, 所述成对短片段序列通过对待测样品的测序片段进行双末 端重测序获得;
将所述成对短片段序列和参考基因组序列进行比对确定基因组单 侧短片段;
根据所述外源单侧短片段和基因组单侧短片段的交集确定外源片 段在基因组序列中的插入位点。
2. 根据权利要求 1 所述的检测方法, 其特征在于, 所述外源单 侧短片段包括只有一条短片段序列比对到所述外源插入片段序列 上的成对短片段序列;
所述基因组单侧短片段包括只有一条短片段序列比对到所述参考 基因组序列上的成对短片 列。
3. 根据权利要求 2 所述的检测方法, 其特征在于, 所述外源单 侧短片段包括只有一条短片段序列且只有一次比对到所述外源插 入片段序列上的正常成对短片段序列; 所述基因组单侧短片段包 括只有一条短片段序列且只有一次比对到所述参考基因组序列上 的正常成对短片^^列。
4. 根据权利要求 1所述的检测方法, 其特征在于, 还包括: 过滤所述成对短片段序列以去除不合格的短片段序列。
5. 根据权利要求 4 所述的检测方法, 其特征在于, 所述过滤所 述成对短片 列以去除不合格的成对短片 列包括: 去除测序质量低于预定阈值的碱基个数超过所述短片段序列碱基 个数 50 %的成对短片段序列;
和 /或 去除测序结果不确定的碱基个数超过所述成对短片段序列碱基个 数 10%的成对短片段序列;
和 /或
去除所述成对短片段序列中的接头序列。
6. 根据权利要求 1 所述的检测方法, 其特征在于, 所述待测样 品的测序片段的长度为 170-500bp、 500-1000bp、 1000-2000bp、 2000 - lOOOObp;
和 /或
所述短片段序列的长度为 40 - 75bp、 75bp - 200bp;
和 /或
所述待测样品的测序片段测序得到的碱基总量为所述参考基因组 序列碱基总量的 5-10倍、 10 - 20倍、 或 20倍以上。
7. 一种转基因外源片段插入位点检测系统, 其特征在于, 包括: 测序单元, 用于对待测样品的测序片段进行双末端重测序获得成 对短片 列;
外源单侧短片段确定单元, 用于将所述成对短片段序列和外源插 入片段序列进行比对确定外源单侧短片段;
基因组单侧短片段确定单元, 将所述成对短片段序列和参考基因 组序列进行比对确定基因组单侧短片段;
插入位点确定单元, 用于根据所述外源单侧短片段和基因组单侧 短片段的交集确定外源片段在基因组序列中的插入位点。
8. 根据权利要求 7 所述的检测系统, 其特征在于, 所述外源单 侧短片段包括只有一条短片段序列比对到所述外源插入片段序列 上的成对短片段序列;
所述基因组单侧短片段包括只有一条短片段序列比对到所述参考 基因组序列上的成对短片 列。
9. 根据权利要求 8 所述的检测系统, 其特征在于, 所述外源单 侧短片段包括只有一条短片段序列且只有一次比对到所述外源插 入片段序列上的正常成对短片段序列; 所述基因组单侧短片段包 括只有一条短片段序列且只有一次比对到所述参考基因组序列上 的正常成对短片^ ^列。
10. 根据权利要求 8所述的检测系统, 其特征在于, 还包括: 过滤单元, 用于过滤所述成对短片段序列以去除不合格的短片段 序列。
11. 根据权利要求 10 所述的检测系统, 其特征在于, 所述过滤 单元去除测序质量低于预定阈值的碱基个数超过所述短片段序列 碱基个数 50%的成对短片段序列;
和 /或
去除测序结果不确定的碱基个数超过所述成对短片段序列碱基个 数 10%的成对短片^ ^列;
和 /或
去除所述成对短片段序列中的接头序列。
12. 根据权利要求 8 所述的检测系统, 其特征在于, 所述待测 样品的测序片段的长度为 170-500bp、 500-1000bp、 1000- 2000bp、 2000 - lOOOObp;
和 /或
所述短片段序列的长度为 40 - 75bp、 75bp - 200bp;
和 /或
所述待测样品的测序片段测序得到的碱基总量为所述参考基因组 序列碱基总量的 5-10倍、 10 - 20倍、 或 20倍以上。
PCT/CN2011/000095 2011-01-20 2011-01-20 检测转基因外源片段插入位点的方法和系统 WO2012097474A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2011/000095 WO2012097474A1 (zh) 2011-01-20 2011-01-20 检测转基因外源片段插入位点的方法和系统
CN201180062256.XA CN103270175B (zh) 2011-01-20 2011-01-20 检测转基因外源片段插入位点的方法和系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/000095 WO2012097474A1 (zh) 2011-01-20 2011-01-20 检测转基因外源片段插入位点的方法和系统

Publications (1)

Publication Number Publication Date
WO2012097474A1 true WO2012097474A1 (zh) 2012-07-26

Family

ID=46515067

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/000095 WO2012097474A1 (zh) 2011-01-20 2011-01-20 检测转基因外源片段插入位点的方法和系统

Country Status (2)

Country Link
CN (1) CN103270175B (zh)
WO (1) WO2012097474A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014183270A1 (zh) * 2013-05-15 2014-11-20 深圳华大基因科技有限公司 一种检测染色体结构异常的方法及装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103993069B (zh) * 2014-03-21 2020-04-28 深圳华大基因科技服务有限公司 病毒整合位点捕获测序分析方法
CN107630079B (zh) * 2016-07-19 2020-07-28 中国农业科学院作物科学研究所 确定转基因生物中外源dna片段的序列、插入位置和边际序列的方法
CN108959853B (zh) * 2018-05-18 2020-01-17 广州金域医学检验中心有限公司 一种拷贝数变异的分析方法、分析装置、设备及存储介质
CN110556165B (zh) * 2019-09-12 2022-03-18 浙江大学 一种快速鉴定转基因或基因编辑材料及其插入位点的方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999053095A2 (en) * 1998-04-09 1999-10-21 Whitehead Institute For Biomedical Research Biallelic markers
CN101646782A (zh) * 2007-01-29 2010-02-10 科学公共卫生研究所(Iph) 转基因植物事件检测
CN101914628A (zh) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 检测基因组目标区域多态性位点的方法及 系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999053095A2 (en) * 1998-04-09 1999-10-21 Whitehead Institute For Biomedical Research Biallelic markers
CN101646782A (zh) * 2007-01-29 2010-02-10 科学公共卫生研究所(Iph) 转基因植物事件检测
CN101914628A (zh) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 检测基因组目标区域多态性位点的方法及 系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHAO SUQIN ET AL.: "Detection Method for Transgenic Food.", PESTICIDE SCIENCE AND ADMINISTRATION, vol. 23, no. 3, 2002, pages 26 - 28 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014183270A1 (zh) * 2013-05-15 2014-11-20 深圳华大基因科技有限公司 一种检测染色体结构异常的方法及装置
CN104302781A (zh) * 2013-05-15 2015-01-21 深圳华大基因科技有限公司 一种检测染色体结构异常的方法及装置
US11004538B2 (en) 2013-05-15 2021-05-11 Bgi Genomics Co., Ltd. Method and device for detecting chromosomal structural abnormalities

Also Published As

Publication number Publication date
CN103270175B (zh) 2015-06-24
CN103270175A (zh) 2013-08-28

Similar Documents

Publication Publication Date Title
Deschamps et al. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping
Lea et al. Genome-wide quantification of the effects of DNA methylation on human gene regulation
Jerlström-Hultqvist et al. Genome analysis and comparative genomics of a Giardia intestinalis assemblage E isolate
Cirulli et al. Screening the human exome: a comparison of whole genome and whole transcriptome sequencing
Clément‐Ziza et al. Natural genetic variation impacts expression levels of coding, non‐coding, and antisense transcripts in fission yeast
WO2012116658A2 (zh) 组装基因组序列的方法和装置
Zhang et al. Isoform evolution in primates through independent combination of alternative RNA processing events
El Baidouri et al. A new approach for annotation of transposable elements using small RNA mapping
Corney RNA-seq using next generation sequencing
WO2012097474A1 (zh) 检测转基因外源片段插入位点的方法和系统
Hird et al. PRGmatic: an efficient pipeline for collating genome‐enriched second‐generation sequencing data using a ‘provisional‐reference genome’
WO2013097048A1 (zh) 基因组单核苷酸多态性位点的标记方法和装置
WO2015043278A1 (zh) 同时进行单体型分析和染色体非整倍性检测的方法和系统
Normand et al. An introduction to high-throughput sequencing experiments: design and bioinformatics analysis
Arif et al. Discovering millions of plankton genomic markers from the Atlantic Ocean and the Mediterranean Sea
Sotcheff et al. ViReMa: a virus recombination mapper of next-generation sequencing data characterizes diverse recombinant viral nucleic acids
Pereira et al. RNA‐seq: applications and best practices
CN113564266B (zh) Snp分型遗传标记组合、检测试剂盒及用途
Eché et al. A Bos taurus sequencing methods benchmark for assembly, haplotyping, and variant calling
WO2013097328A1 (zh) 基因组indel位点标记方法和装置
Te Boekhorst et al. Computational problems of analysis of short next generation sequencing reads
Donaldson et al. Development of a genotype‐by‐sequencing immunogenetic assay as exemplified by screening for variation in red fox with and without endemic rabies exposure
Li et al. Mutation of a major CG methylase alters genome-wide lncRNA expression in rice
Huang et al. CRISPR-detector: fast and accurate detection, visualization, and annotation of genome-wide mutations induced by genome editing events
KR20220064959A (ko) 낮은 빈도 변이의 검출 및 리포팅을 용이하게 하기 위한 dna 라이브러리 생성 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11855930

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11855930

Country of ref document: EP

Kind code of ref document: A1