WO2014005329A1 - Method and system for determining integration manner of foreign gene in human genome - Google Patents

Method and system for determining integration manner of foreign gene in human genome Download PDF

Info

Publication number
WO2014005329A1
WO2014005329A1 PCT/CN2012/078311 CN2012078311W WO2014005329A1 WO 2014005329 A1 WO2014005329 A1 WO 2014005329A1 CN 2012078311 W CN2012078311 W CN 2012078311W WO 2014005329 A1 WO2014005329 A1 WO 2014005329A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
human genome
assembly
data
result
Prior art date
Application number
PCT/CN2012/078311
Other languages
French (fr)
Chinese (zh)
Inventor
曾玺
李伟阳
陈盛培
蒋慧
汪建
王俊
杨焕明
张秀清
Original Assignee
深圳华大基因科技有限公司
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技有限公司, 深圳华大基因研究院 filed Critical 深圳华大基因科技有限公司
Priority to PCT/CN2012/078311 priority Critical patent/WO2014005329A1/en
Priority to CN201280074522.5A priority patent/CN104428423A/en
Publication of WO2014005329A1 publication Critical patent/WO2014005329A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • the present invention relates to the field of biotechnology, and in particular, to a bioinformatics analysis method for detecting the integration of a pathogen genome in a human genome, and more particularly, to a method for determining a method for integrating a foreign gene in a human genome , system and computer readable media. Background technique
  • HBV, HIV, and HPV viruses can integrate their own DNA into the human genome. After HBV infection, the virus causes inflammation of the infected site through replication, which in turn triggers cell cancer.
  • HPV16 in the HPV virus causes inflammation in the HPV patients after cervical infection, and promotes the abnormal growth of cervical cells, resulting in canceration. The integrated imaging of HPV16 is evident in the cancerous tissue.
  • the present invention aims to solve at least one of the technical problems existing in the prior art.
  • the present invention aims to propose a method for efficiently identifying integration sites of precise pathogen genomic fragments within the genome-wide range.
  • the invention proposes a method of determining the manner in which a foreign gene is integrated in the human genome.
  • the method comprises: capturing a DNA fragment that may contain an integration of a foreign gene fragment from a human genomic nucleic acid sample using a capture probe; sequencing the captured DNA fragment to obtain a plurality of sequencing data a sequencing result; performing a first impurity removal on the sequencing result to obtain a sequencing result of the first impurity removal; performing the first impurity-clearing sequencing result with a known human genome sequence and a foreign gene sequence First alignment to obtain sequencing data that may contain an integrated fragment of a foreign gene; assembling the obtained sequencing data that may contain an integrated fragment of the foreign gene to obtain an assembly result, the assembly result being composed of a plurality of assembly data Performing a second impurity removal on the assembly result to obtain a second impurity-free assembly result; performing a second ratio of the second impurity-free assembly result to a known human genome sequence and a foreign gene sequence
  • the invention also provides a method for determining the integration of a foreign gene in the human genome System.
  • the system comprises: a capture device adapted to capture, from a human genomic nucleic acid sample, a DNA fragment that may contain an integration of a foreign gene fragment using a capture probe; a sequencing device, the sequencing device The capture device is coupled and adapted to sequence the captured DNA fragments to obtain sequencing results consisting of a plurality of sequencing data; a first impurity removing device, the first impurity removing device being coupled to the sequencing device, And adapted to perform the first impurity removal on the sequencing result, so as to obtain the sequencing result after the first impurity removal; the first comparison device, the first comparison device is connected to the first impurity removal device, and is suitable Performing a first alignment of the first impurity-cleared sequencing result with a known human genome sequence and a foreign gene sequence to obtain sequencing data that may contain an integrated fragment of the foreign gene; assembling device, the assembling
  • the invention provides a computer readable medium.
  • instructions are stored on the computer readable medium, the instructions being adapted to be executed by the processor to determine how the foreign gene is integrated in the human genome by the following steps: performing a first impurity removal on the sequencing result In order to obtain the sequencing result of the first impurity removal; the first alignment result of the first impurity is first aligned with the known human genome sequence and the foreign gene sequence, so as to obtain an integration fragment which may contain the foreign gene Sequencing data; assembling the obtained sequencing data, which may contain the integrated fragment of the foreign gene, to obtain an assembly result, the assembly result being composed of a plurality of assembly data; performing second impurity removal on the assembly result to obtain Performing a second impurity-free assembly result; and performing a second alignment of the second impurity-free assembly result with a known human genome sequence and a foreign gene sequence, and determining based on the second alignment result The manner in which the foreign gene is integrated in the human genome, wherein
  • FIG. 1 is a flow chart of a method for determining a manner of integration of a foreign gene in a human genome according to an embodiment of the present invention.
  • FIG. 2 is a flow chart showing a method of determining a manner of integration of a foreign gene in a human genome according to still another embodiment of the present invention
  • FIG. 3 is a schematic illustration of the structure of a system for determining the manner in which foreign genes are integrated in the human genome, in accordance with one embodiment of the present invention. detailed description
  • PER Assembly As described herein, PER refers to the assembly of pair end sequence data. That is, each pair of PE sequencing data obtained by pair end sequencing is assembled according to the overlapping relationship between the sequences.
  • Substitution As described herein, a segment of a genomic DNA fragment of a pathogen is inserted into the human genome, and the deletion of the human genomic DNA at the insertion site is called a substitution.
  • PCR repeat repeated amplification during PCR.
  • Adaptor Sequencing connector, adaptor will appear in the sequence data of some of the lower machines.
  • BWA Short for Burrows-Wheeler Aligner, a sequence alignment software. Soap: short for Short Oligonucleotide Analysis Package, is a comparison software.
  • connection should be understood broadly, and may be, for example, a fixed connection, a detachable connection, or an integral connection, unless otherwise explicitly stated and defined;
  • the mechanical connection may also be an electrical connection; it may be directly connected, or may be indirectly connected through an intermediate medium, and may be internal communication between the two elements.
  • specific meaning of the above terms in the present invention can be understood on a case-by-case basis.
  • the invention proposes a method of determining the manner in which a foreign gene is integrated in the human genome.
  • the method includes:
  • Capture step S100 Capture a DNA fragment that may contain an integration of a foreign gene fragment from a human genomic nucleic acid sample using a capture probe.
  • the type of the foreign gene that can be analyzed using the method of the present invention is not particularly limited. As long as it can be integrated with the human genome, and its gene sequence can be obtained or already known.
  • the foreign gene that can be studied is the pathogen genome. Further, according to a specific example of the present invention, the pathogen is HBV. Thereby, the integration of pathogens such as HBV with the human genome can be effectively analyzed.
  • Sequencing step S200 Sequencing the captured DNA fragments to obtain sequencing results consisting of multiple sequencing data.
  • the manner of sequencing the captured DNA fragments is not particularly limited.
  • sequencing is performed by a second generation sequencing platform.
  • the whole genome sequencing library can be sequenced using at least one selected from the group consisting of Hiseq2000, SOLiD, 454 and single molecule sequencing devices.
  • the efficiency of determining the integration mode of the foreign gene in the human genome can be further improved by utilizing the characteristics of high-throughput and deep sequencing of these sequencing devices.
  • other sequencing methods and devices can be used for whole genome sequencing, such as third generation sequencing techniques, as well as more advanced sequencing techniques that may be developed in the future.
  • the length of the sequencing data obtained by whole genome sequencing is not particularly limited. According to an embodiment of the present invention, it is preferable that the sequencing length is lOObp, whereby the analysis effect can be further improved.
  • the first impurity removing step S300 performing first impurity removal on the sequencing result to obtain a sequencing result after the first impurity removal.
  • the type of the first impurity removal is not particularly limited.
  • the first impurity removal may further include removing PCR duplication, removing low quality sequencing data, and removing at least one of the linker-containing sequencing data. kind. Thereby, the analysis efficiency can be further improved.
  • the first alignment step S400 the first impurity-sequencing sequencing result is first aligned with the known human genome sequence and the foreign gene sequence to obtain sequencing data which may contain the foreign gene integration fragment.
  • the first alignment can be performed using SOAP. Thereby, the analysis efficiency can be further improved.
  • Assembly step S500 The obtained sequencing data, which may contain the foreign gene integration fragment, is assembled to obtain an assembly result, which is composed of a plurality of assembly data. According to an embodiment of the invention, the assembly is performed by based on an overlapping relationship between the sequenced data.
  • the second impurity removing step S600 performing second impurity removal on the assembly result to obtain an assembly result after the second impurity removal.
  • the second impurity removal further comprises removing duplicate assembly data.
  • a second alignment step S700 performing a second alignment of the second impurity-free assembly result with a known human genome sequence and a foreign gene sequence.
  • the second alignment is performed using BWA.
  • determining the manner in which the foreign gene is integrated in the human genome based on the second alignment result further comprises: selecting and simultaneously aligning the known human genome sequence and the foreign gene sequence Assembly data, the assembly data includes human genome breakpoint information and foreign gene breakpoint information.
  • FIG. 2 a system for determining the manner in which foreign genes are integrated in the human genome according to an embodiment of the present invention will be explained in detail by taking HBV as an example. As shown in FIG. 2, it specifically includes the following steps:
  • Methods for obtaining a pathogen nucleic acid integration sequence include, but are not limited to, the following methods: A DNA fragment that may contain a pathogen genomic fragment is captured from a sample using a capture probe technique, and the resulting sequence is sequenced.
  • Strategy for removing low-quality sequencing data When the number of bases with a sequencing quality value less than or equal to 5 in a sequencing data accounts for more than 50% of the total number of bases of the sequencing data, the sequencing data is considered to be low-quality sequencing data. . When one of the paired PE sequencing data is low quality, the pair of sequencing data is removed.
  • the processed sequencing data were separately aligned to the human genome hgl9 and the pathogen genome fragment genome. Since the pathogen genome generally has multiple subtypes, the reference genome of the pathogen genome here generally needs to select the appropriate subtype according to the needs. After the alignment is completed, by analyzing the pairwise relationship between the two alignment results, sequencing data may be selected which may contain the integrated genome of the pathogen genome. The proportions of the useful sequences in the original sequencing data, and the ratio of the human genome alignment in the useful sequence to the genome of the pathogen genome fragment are calculated separately.
  • the sequencing data obtained in the third step which may contain the integrated fragments of the pathogen genomic fragments, was subjected to PER assembly.
  • PER refers to the assembly of pair end sequencing data. That is, each pair of PE sequencing data obtained by pair end sequencing is assembled according to the overlapping relationship between the sequences.
  • the strategy here is to use the deduplication strategy of SE sequencing data, ie when a sequence of sequencing data is duplicated, then the sequencing data is removed.
  • sequence set is obtained. This sequence set was then compared again to the human genome hgl9 and the pathogen genome using BWA software. By analyzing the results of the two alignments, sequences that are both comparable to the human genome hgl9 and the pathogen genome were selected. These sequences contain breakpoint information. The alignment of these sequences with the human genome hgl9 and the pathogen genome was analyzed separately, and the distribution of the pathogen integrated fragments on the human genome and the distribution on the pathogen genome were obtained.
  • the distribution here includes, but is not limited to, the alignment position, the number of sequencing data supporting the left endpoint of an insertion breakpoint, the number of sequencing data supporting the right endpoint of an insertion breakpoint, the total number of supported sequencing data, and the normalized amount of data.
  • the normalization strategy here is normalized based on the number of valid sequencing data.
  • the presence of a substitution type variation is examined by analyzing the intrinsic link between the human genome breakpoint and the pathogen genome breakpoint.
  • the result information of the integrated weight comparison and the initial sequence information are used to calculate the actual efficiency of the probe capture.
  • the method for determining the manner in which foreign genes are integrated in the human genome according to an embodiment of the present invention enables the discovery of precise pathogen genomic fragment insertion positions within the human genome.
  • the method of determining the manner in which a foreign gene is integrated in the human genome according to an embodiment of the present invention can give some possible types of substitution, as well as the type of partial pathogen genomic insert.
  • the method for determining the manner in which a foreign gene is integrated in the human genome according to an embodiment of the present invention is quick and convenient to use. Taking the starting data volume of 5G as an example, it can be analyzed within two days.
  • the present inventors After extensive and intensive research, the present inventors have for the first time constructed a bioinformatics detection method for detecting the integration mode of a pathogen genome in a sample to be tested and its application, and specifically, the inventors obtained from a sequence capture technique. The sequences were aligned, screened, assembled, and compared, and a complete bioinformatics testing process was established. The present invention was completed on the basis of detecting the signals of the pathogen genome in the human genome integration mode using the detection procedure.
  • the type of the sample that can be processed is not particularly limited as long as the nucleic acid sample is contained, and the type of the nucleic acid is not particularly limited, and may be deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). ), preferably DNA.
  • DNA deoxyribonucleic acid
  • RNA ribonucleic acid
  • the source of the sample is not particularly limited.
  • a cancer tissue sample can be used as a test sample, whereby a DNA sequence of a pathogen genomic insert can be extracted therefrom, and the insertion of the pathogen genomic fragment can be detected and analyzed.
  • samples that may be used in accordance with embodiments of the invention include, but are not limited to, patient plasma, cancer tissue cells, paracancerous tissue cells.
  • DNA library preparation is required prior to application of the probe for sequence capture, and methods for preparing sample libraries are well known to those skilled in the art.
  • the term "DNA library preparation" refers to disrupting a target fragment of a genome to obtain a mixture of DNA fragments of a certain size.
  • the method and apparatus for capturing a specific sequence from a sample are also not particularly limited and can be carried out using a commercially available probe.
  • the sequencing sequence refers to a sequence fragment output by the sequencer, that is, sequencing data (reads).
  • the DNA fragments used for sequencing are captured by a particular probe. The probe must be pure and not affected by other different sequence nucleic acids.
  • Typical probes are cloned DNA sequences or DNA obtained by PCR amplification, synthetic oligonucleotides or RNA obtained after cloning and cloning DNA sequences in vitro, and can also be used as probes.
  • the length of the probe may range from 20 to 500 mers, preferably from 50 to 300 mers, more preferably from 250 mers.
  • Probe design and synthesis methods are those skilled in the art As is well known, synthetic probes can be used to synthesize probes or to use commercially available probes.
  • the sequencing sequence obtained from the sample can be carried out by a sequencing method, which can be performed by any sequencing method, including but not limited to the dideoxy chain termination method; preferably a high-throughput sequencing method, including but not limited to Second generation sequencing technology or single molecule sequencing technology.
  • the second generation sequencing platform described in the present invention includes but is not limited to Illumina (eg GA series, HiSeq series) , Life Technologies (such as SOLID series, semiconductor sequencing series, etc.) and Roche (such as GS series) and other companies provide sequencing platforms.
  • the types of sequencing described in the present invention include, but are not limited to, Pair-end sequencing, and the sequencing length includes, but is not limited to, 100 bp.
  • the sequencing platform is Illumina/Solexa
  • the sequencing type is Pair-end sequencing, and a 100 bp DNA sequence molecule having a bidirectional positional relationship is obtained.
  • the main library length of the sequencing sequence should be less than 200 bp. This will allow for sufficient assembly success to ensure sufficient data for subsequent studies.
  • the plasma sample has a sequencing yield of about 5G and the tissue cell sample has a sequencing data volume of about 1G. The larger the amount of data, the more comprehensive the information that can be detected.
  • pathogen genomic insertion signals can be determined by data generated by next generation sequencing techniques, including but not limited to location, genotype, number, and length.
  • the human genome sequence for performing SOAP2 alignment and BWA alignment is the human genome reference sequence of version 37 (hgl9; NCBI Build 37) in the NCBI database.
  • the pathogen reference sequence for performing SOAP2 alignment and BWA alignment is selected from 23 genomic sequences of the 8 subtypes of the pathogen.
  • the alignment includes PER pre-assembly alignment and PER post-assembly alignment, wherein the PER pre-assembly alignment is a mismatch of 5 bases.
  • the PER pre-assembly sequence alignment can be performed by any sequence alignment program, such as the Short Oligonucleotide Analysis Package (SOAP) and BWA alignments available to those skilled in the art (Burrows-Wheeler Aligner 0.5. 8c (rl536)), the sequencing sequence is aligned with the reference genome sequence, and the sequencing data is classified according to the alignment of the sequencing data with the pathogen genome and the human genome.
  • SOAP Short Oligonucleotide Analysis Package
  • BWA alignments available to those skilled in the art
  • the sequence alignment can be performed using default parameters provided by the program, or the parameters can be selected by those skilled in the art as needed.
  • the comparison software employed is SOAPaligner/soap2.
  • Another alignment is post-assembly alignment of the PER.
  • Sequence alignment after PER assembly can be performed by any sequence alignment program that can be set to allow a sufficient length of gap.
  • the sequence alignment can be performed using default parameters provided by the program, or the parameters can be selected by those skilled in the art as needed.
  • the B WAS W parameter in the BWA alignment (Burrows-Wheeler Aligner) available to those skilled in the art is performed, and the alignment position is determined by comparison.
  • the conditions for extracting sequencing data that may contain information on the pathogen genomic insert are:
  • the required sequencing data is selected, and the comparison ratio is calculated:
  • This step outputs a file including but not limited to the following items: Yield, comparison rate , pollution rate, proportion of valid sequencing data.
  • Effective sequencing data refers to the sequencing data remaining after the original down-sequencing data is removed from the contaminated sequencing data.
  • the contaminated sequencing data referred to herein refers to PCR repeat sequencing data, including linker sequencing data, and low quality sequencing data.
  • the required sequencing data is selected and the alignment ratio is calculated, wherein the insert sd value of the Soap alignment is set to 30. This ensures the largest possible sequencing data utilization.
  • the number of sequencing data set in a certain grid is V num
  • the human genome comparison rate is alnRate.
  • the required sequencing data is selected to calculate the alignment ratio: conditions for extracting sequencing data that may contain information on the pathogen genomic insert (the following conditions are satisfied) Yes 1)
  • a pair of PE sequencing data in the case of only 5 mismatches, the strip can be compared to the human genome, and the other can be compared to the pathogen genome; 2) - for PE sequencing data, only 5 In the case of a mismatch, the strip can be compared to the human genome, and the other cannot be compared to any reference sequence; 3) - For PE sequencing data, in the case of only 5 mismatches, one can compare with the pathogen genome, One cannot compare with any reference sequence; 4) In a pair of PE sequencing data, in the case where only 5 mismatches are allowed, neither of them can match any reference sequence.
  • "calculating useful sequences in raw sequencing data" in the SOAP alignment step means removing PE sequencing data containing PCR repeats, PE sequencing data containing low quality sequencing data, and Sequence of AP sequencing data containing adaptor.
  • 5 6 8 9 of Table 1 below is a sequence selected in some embodiments of the present invention.
  • the rows in Table 1 indicate the alignment of the sequence with the human genome, PE indicates that a pair of PE sequencing data can be compared to the upper reference sequence within the set insert length; SE indicates that only one pair of PE sequencing data Can be compared to the reference sequence, or both can be compared to the reference sequence, but the alignment position is not within the set insertion length; Unmap indicates that a pair of PE sequencing data can not match the reference sequence at all.
  • the step of removing the repeated sequence again in Fig. 2 although the PCR repeating operation in the PE sequencing data has been performed in the first impurity removal step 2, the filtering is not thorough. of. Because some of the overlapping parts of the sequence may have mismatches, but they can still be assembled, it may result in the same sequence after assembly.
  • the breakpoint information is extracted: BWA 0.5.8c (rl536)
  • the parameters used in the heavy comparison are parameters BWASW that are suitable for long sequence alignment and support high fault tolerance alignment. . It uses the heuristic Smith-Waterman-like algorithm to search for high-scoring alignment locations. The parameters used in the comparison completely use the default parameter values of the software. For details, please refer to http://bio-bwa.sourceforge.net/bwa.shtml o It uses the heuristic Smith-Waterman-like algorithm to search for high scores. Compare the positions.
  • the breakpoint information is extracted: after the BWA weight comparison is completed, the comparison result needs to be processed to extract the breakpoint information. At this time, the selection of the alignment result for each sequence needs to meet the following conditions:
  • the length of the sequence aligned with the reference sequence must be greater than or equal to 30 bp.
  • sequence length of the insert portion may be greater than or equal to
  • the breakpoint information is extracted: the BWA alignment is a basic strategy of truncation and truncation. For example, when the first half of a piece of sequencing data is inferior to the reference sequence, the comparison software will directly cut off the portion of the sequenced data that is not comparable, and then continue to compare.
  • the breakpoint information is extracted:
  • the upstream part of the sequence is the pathogen genome sequence
  • the downstream part is the human DNA sequence, and there are no other types of bases.
  • the pathogen sequence insertion position of the pathogen should be compared with the alignment position given by the software.
  • the upstream part of the sequence is the human DNA sequence
  • the downstream part is the pathogen genome sequence, and there are no other types of bases.
  • the insertion position of the pathogen genome sequence fragment should be compared with the alignment position given by the software. Assume that the length of the DNA sequence of the upstream human is X, and the insertion position of the pathogen genome sequence fragment should be taken as p+x 3)
  • the upstream and downstream of one sequence are human genome sequences, and the middle part is the pathogen genome insert, assuming the upstream person the length of the DNA sequence portion of the human DNA sequence of the downstream portion of the length X of ⁇ ⁇ , time pathogen genomic insert position to be taken as p + x _t
  • Both the upstream and downstream of a sequence are pathogen genomic sequences, and the middle part is the human genome sequence.
  • This sequence carries the signal of the insertion of the two pathogen genomic fragments into the integration point, so it can be extracted from the alignment result of this sequence. Two insertion positions.
  • the breakpoint information is extracted: the output file of the step gives the number of left support sequencing data, the number of right support sequencing data, the total number of supported sequencing data, and the corresponding return The number of sequencing data after the ization.
  • the number of sequencing data supported by the left is the number of sequencing data that supports the left endpoint of a certain insertion breakpoint.
  • the right supports the number of sequencing data, that is, the number of sequencing data that supports the right endpoint of a certain insertion breakpoint.
  • the total number of supported sequencing data is equal to the sum of the number of left support sequences and the number of right support sequences.
  • the number of left support sequencing data before the normalization is V L
  • the useful logarithm of the sequencing data is b M pairs
  • the number of left support sequences after normalization is Vi7b.
  • the number of right support sequencing data is a
  • the useful sequencing data log is V r M pairs
  • the number of left support sequences after normalization is VJb.
  • the normalized total number of support sequences is the sum of the number of normalized left support sequences and the number of normalized right support sequences.
  • the breakpoint information is extracted:
  • the step of checking whether there is a substitution category the specific method is when two breakpoint information on the human genome appears to have only left support sequencing data or only right support sequencing data, then two The sequence between the breakpoint positions is the human genome sequence in which the substitution occurs.
  • the alignment positions of the left support sequencing data and the right support sequence data on the pathogen genome were found separately, and the sequence between the two aligned positions found was the pathogen genome sequence in which the substitution occurred.
  • all displacements will be output in accordance with a random combination of eligible breakpoints on the human genome.
  • the step of calculating the length and type of the pathogen genomic insert the specific method is to find the left-end support sequence and the right-end support sequence of a pathogen genomic insert, and then find the support sequences at the two ends, respectively. In the alignment position on the pathogen genome, the sequence between the two positions is the insert.
  • the step of calculating the actual efficiency for the calculation of the present invention The specific method is to calculate the number of sequences participating in the BWA heavy alignment, which is comparable to the upper human genome hgl9 and the pathogen genome, and is denoted as A.
  • the logarithm of the PE sequencing data useful in the original sequencing data is recorded as B.
  • the effective capture efficiency is calculated as A/B.
  • the last breakpoint information that can be found includes breakpoint information of the pathogen genome fragment on the human genome and breakpoint information of the pathogen genome fragment on the pathogen genome.
  • the biological information analysis process involved can be theoretically applied to signal detection in the case where all pathogen genetic material DNA or RNA is inserted into the human genome.
  • the present invention provides a system for determining the manner in which a foreign gene is integrated in a human due group.
  • the system includes a capture device 100, a sequencing device 200, a first impurity removal device 300, and a first alignment device. 400, assembly device 500, second impurity removal device 600, second comparison device 700, and analysis device 800. Among them, referring to FIG. 3, these devices are sequentially connected in the process flow.
  • the capture device 100 utilizes a capture probe to capture a DNA fragment that may contain an integration of a foreign gene fragment from a human genomic nucleic acid sample.
  • the type of exogenous gene that can be analyzed using the method of the present invention is particularly limited. As long as it can be integrated with the human genome, and its gene sequence can be obtained or already known.
  • the foreign gene that can be studied is the pathogen genome.
  • the pathogen is HBV. Thereby, the integration of pathogens such as HBV with the human genome can be effectively analyzed.
  • the sequencing device 200 performs sequencing on the captured DNA fragments to obtain sequencing results composed of a plurality of sequencing data.
  • the manner of sequencing the captured DNA fragments is not particularly limited.
  • sequencing is performed by a second generation sequencing platform.
  • the whole genome sequencing library can be sequenced using at least one selected from the group consisting of Hiseq2000, SOLiD, 454, and a single molecule sequencing device. Thereby, the efficiency of determining the aneuploidy of single cell chromosomes can be further improved by utilizing the characteristics of high-throughput and deep sequencing of these sequencing devices.
  • the length of the sequencing data obtained by whole genome sequencing is not particularly limited. According to an embodiment of the present invention, it is preferable that the sequencing length is lOObp, whereby the analysis effect can be further improved.
  • the first impurity removing device 300 performs the first impurity removal on the sequencing result to obtain the sequencing result of the first impurity removal.
  • the type of the first impurity is performed, and is not particularly limited.
  • the first impurity may further include at least one of removing PCR repeats, removing low-quality sequencing data, and removing linker-containing sequencing data. Thereby, the analysis efficiency can be further improved.
  • the first comparison device 400 firstly compares the first impurity-sequencing result with a known human genome sequence and a foreign gene sequence to obtain a possible integration of the foreign gene. Sequencing data for the fragment.
  • the first alignment can be performed using SOAP. Thereby, the analysis efficiency can be further improved.
  • the assembly device 500 assembles the obtained sequencing data which may contain the foreign gene integration fragment to obtain an assembly result, which is composed of a plurality of assembly data. According to an embodiment of the invention, the assembly is performed by based on an overlapping relationship between the sequencing data.
  • the second impurity removing device 600 performs second impurity removal on the assembly result to obtain an assembly result of the second impurity removal.
  • the second impurity removal further comprises removing duplicate assembly data.
  • the second alignment device 700 performs a second alignment of the second impurity-free assembly result with a known human genomic sequence and a foreign gene sequence.
  • the second alignment is performed using BWA.
  • the analysis device 800 determines the manner in which the foreign gene is integrated in the human genome based on the second alignment result. According to an embodiment of the present invention, determining the manner in which the foreign gene is integrated in the human genome based on the second alignment result further comprises selecting and simultaneously aligning the known human genome sequence and the assembly of the foreign gene sequence Data, the assembly data includes human genome breakpoint information and foreign gene breakpoint information.
  • the invention provides a computer readable medium.
  • instructions are stored on the computer readable medium, the instructions being adapted to be executed by the processor to determine how the foreign gene is integrated in the human genome by the following steps: performing a first impurity removal on the sequencing result In order to obtain the sequencing result of the first impurity removal; the first alignment result of the first impurity is first aligned with the known human genome sequence and the foreign gene sequence, so as to obtain an integration fragment which may contain the foreign gene Sequencing data; assembling the obtained sequencing data, which may contain the integrated fragment of the foreign gene, to obtain an assembly result, the assembly result being composed of a plurality of assembly data; performing second impurity removal on the assembly result to obtain Performing a second impurity-free assembly result; and performing a second alignment of the second impurity-free assembly result with a known human genome sequence and a foreign gene sequence, and determining based on the second alignment result The manner in which the foreign gene is integrated in the human genome, wherein
  • a "computer-readable medium” can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by the instruction execution system, apparatus, or device, or in conjunction with the instruction execution system, apparatus, or device. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read only memory (CDROM).
  • the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method proceeds to obtain the program electronically and then store it in computer memory.
  • portions of the invention may be implemented in hardware, software, firmware or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any of the following techniques or combinations thereof known in the art: having logic gates for implementing logic functions on data signals Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), and the like.
  • each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as separate products, may also be stored in a computer readable storage medium.
  • the above-mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
  • the source of the sample is a patient's liver cancer tissue, and the patient's liver cancer tissue has genome-wide sequencing information and breakpoint information found by genome-wide data.
  • the pre-risk section includes the following steps:
  • the library was constructed according to Illumina's Paired-End Sample Preparation Guide.
  • the genomic DNA was disrupted with Covaris s2, the ends were filled with repair, the end was added with A, the linker was added, and the fragment added to the adaptor was PCR.
  • Sample library was constructed according to Illumina's Paired-End Sample Preparation Guide.
  • the genomic DNA was disrupted with Covaris s2, the ends were filled with repair, the end was added with A, the linker was added, and the fragment added to the adaptor was PCR.
  • Design probes for b and c types of HBV The length of each probe is 60 bp, and the design principle is that the overlap between adjacent probes is 55 bp in length.
  • the steps for preparing the HB V probe are as follows: Design primers, PCR reaction, PCR purification and electrophoresis detection, PCR product fragmentation, fragmentation product electrophoresis detection, and probe storage. (4) Hybridization and sequencing of HBV capture probes with sample libraries
  • the HB V capture probe was hybridized with the sample library using Nimblegen's hybridization platform, hybridized, eluted, PCR amplified, and the PCR product was sequenced. Then, sequencing was performed using the Hiseq 2000 sequencing platform, in which the sequencing was performed in accordance with the c-Bot and Hiseq 2000 (PE sequencing) specifications officially published by IUumina/Solexa.
  • the main sequence library was 170 bp in length, the sequencing sequence was lOObp in length, and the sequencing yield was lG bp.
  • the PCR is repeated for these data, the low quality sequencing data is removed, and the sequencing data containing the linker is removed.
  • the pair of sequencing data is removed.
  • Strategy for removing sequencing data containing linkers When a sequence of linker data contains a linker sequence, the sequenced data is considered to be sequenced data containing the linker. When one of the paired PE sequencing data is sequencing data containing the linker, the pair of sequencing data is removed.
  • the processed sequencing data were aligned to the human genome hgl9 and the HBV genome, respectively. Because the HBV virus has multiple subtypes, the HBV genome here includes 23 subtypes (AB014381.1, AB032431.1, AB033554.1, AB036910.1, AB064310.1, AF090842.1, AF100309.1, AF160501.
  • the sequencing data obtained in the third step which may contain the viral integration fragment, was subjected to PER assembly.
  • PER refers to the assembly of pair end sequencing data. That is, each pair of PE sequencing data obtained by pair end sequencing is assembled according to the overlapping relationship between the sequences. Assembly success rate is 94.33%
  • sequence set is deduplicated again.
  • the strategy here is to use the deduplication strategy of SE sequencing data, ie when a sequence of sequencing data is duplicated, then the sequencing data is removed. As a result, 1.823% of the repeats were removed, leaving 850,696 available sequences after assembly.
  • a sequence set is obtained.
  • This sequence set was then again aligned to the human genome hgl9 and HBV viral genomes using BWA software.
  • sequences were selected that simultaneously corresponded to the human genome hgl9 and HBV viral genomes. These sequences contain breakpoint information.
  • the alignment of these sequences with the human genome hgl9 and HBV viral genomes was separately analyzed, and the integration of the HBV virus on the human genome and the distribution on the HBV viral genome were obtained. Finally, 33 breakpoints were found, of which the number of breakpoints exceeding the threshold was 8.
  • a de-duplication operation is performed on the obtained result to obtain a final result.
  • Table 5 the most prominent viral insertion site for insertion of the HBV virus found in the present invention into the human genome hgl9 is shown. Table 5.
  • the presence of a substitution type variation is examined by analyzing the intrinsic link between the human genome breakpoint and the HBV viral genome breakpoint.
  • the specific method is that when the two breakpoint information on the human genome appears to be within 500 bp, and both breakpoints have only the left end supporting the sequencing data or only the right end supporting the sequencing data, then the two breakpoints are likely to be A situation has occurred in the replacement. Find the HBV viral genome alignment information for the two breakpoints and determine the type and location of the replacement. Table 6 below shows the seven replacement cases found by the present invention.
  • HBV virus insert type By analyzing the internal links between the human genome breakpoints and HBV viral genome breakpoints can be calculated to find the HBV viral insert the breakpoint information, calculates the breakpoint in this portion of HBV insert length and another type 1 J . Specifically, a left-end support sequence and a right-end support sequence of a viral insert are found, and then the alignment positions of the two-end sequences on the HBV viral genome are found, and the sequence between the two positions is an insert. Only one insertion breakpoint is found here to find out the type of insert. Table 7 below is the type of insert found in the present invention. Table 7. HBV virus insert type
  • the result information of the integrated weight comparison and the initial sequence information are used to calculate the actual efficiency of the probe capture.
  • the specific method is to calculate the number of sequences in the BWA heavy comparison that can compare with the human genome hgl9 and the HBV viral genome, and record it as A.
  • the logarithm of the PE sequencing data useful in the raw sequencing data is recorded as B.
  • the effective capture efficiency is calculated as A/B.
  • the effective capture efficiency calculated here is 0.0001059.
  • the method, system and computer readable medium of the present invention for determining the manner in which a foreign gene is integrated in the human genome can be effectively used to determine the manner in which a foreign gene, such as a pathogen genome, is integrated in the human genome.
  • a foreign gene such as a pathogen genome
  • the description of the terms “one embodiment”, “some embodiments”, “example”, “specific example”, or “some examples” and the like means a specific feature described in connection with the embodiment or example.
  • a structure, material or feature is included in at least one embodiment or example of the invention.
  • the schematic representation of the above terms does not necessarily mean the same embodiment or example.
  • the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

Provided are a method, system, and readable medium for determining an integration manner of a foreign gene in a human genome. The method for determining an integration manner of a foreign gene in a human genome comprises: capturing, by using a capture probe, an integrated DNA fragment possibly containing a foreign gene fragment from a human genome nucleic acid sample; performing sequencing for the captured DNA fragment to obtain a sequencing result; performing first purification on the sequencing result; performing first comparison on the sequencing result obtained through the first purification and a known human genome sequence and a foreign gene sequence to obtain sequencing data possibly containing a foreign gene integration fragment; assembling the sequencing data possibly containing a foreign gene integration fragment to obtain an assembling result; performing second purification on the assembling result; and performing second comparison on the assembling result obtained through the second purification, and determining an integration manner of a foreign gene in a human genome based on a second comparison result.

Description

确定外源基因在人类基因组中整合方式的方法和系统  Method and system for determining the manner in which foreign genes are integrated in the human genome
优先权信息 Priority information
无 技术领域  No technical field
本发明属于生物技术领域, 具体地, 本发明涉及一种检测病原体基因组在人类基因组 中整合方式的生物信息学分析方法, 更具体地, 本发明涉及确定外源基因在人类基因组中 整合方式的方法、 系统和计算机可读介质。 背景技术  The present invention relates to the field of biotechnology, and in particular, to a bioinformatics analysis method for detecting the integration of a pathogen genome in a human genome, and more particularly, to a method for determining a method for integrating a foreign gene in a human genome , system and computer readable media. Background technique
现已知 HBV, HIV, HPV病毒可以将自己的 DNA整合到人的基因组上, HBV感染后 病毒通过复制引起感染部位的炎症, 继而引发细胞癌变。 HPV病毒中的高危型 HPV16, 在 HPV患者宫颈感染后引起炎症, 以及促进其宫颈细胞异常生长, 从而产生癌变, 在癌变组 织中 HPV16的整合想象明显。  It is known that HBV, HIV, and HPV viruses can integrate their own DNA into the human genome. After HBV infection, the virus causes inflammation of the infected site through replication, which in turn triggers cell cancer. The high-risk type HPV16 in the HPV virus causes inflammation in the HPV patients after cervical infection, and promotes the abnormal growth of cervical cells, resulting in canceration. The integrated imaging of HPV16 is evident in the cancerous tissue.
多年来研究基因整合的方法,仍然停留 PCR对宫颈癌及癌前病变 HPV16病毒整合状态 的检测。 其中 alu等方法曾被广泛应用, 随着高通量测序技术的发展, 釆用高通量测序与信 息分析方法的改进为分析病原体整合位置提供了基础, 现今常用的生物信息学分析主要还 是限于 pair end序列比对, 并通过 PE reads的比对位置确定大概的插入位置, 而无法确定精 确的位置。 因而, 目前进行相关研究的方法仍有待改进。 发明内容  Over the years, the method of genetic integration has been studied, and PCR has been used to detect the integration status of HPV16 virus in cervical cancer and precancerous lesions. Among them, alu and other methods have been widely used. With the development of high-throughput sequencing technology, the improvement of high-throughput sequencing and information analysis methods provides a basis for analyzing the integration of pathogens. The commonly used bioinformatics analysis is mainly limited. The pair end sequence is aligned, and the approximate insertion position is determined by the alignment position of the PE reads, and the exact position cannot be determined. Therefore, the current research methods are still to be improved. Summary of the invention
本发明旨在至少解决现有技术中存在的技术问题之一。 本发明旨在提出一种可以有效 地找出全基因组范围内的精确病原体基因组片段整合位点的方法。  The present invention aims to solve at least one of the technical problems existing in the prior art. The present invention aims to propose a method for efficiently identifying integration sites of precise pathogen genomic fragments within the genome-wide range.
在本发明的一个方面, 本发明提出了一种确定外源基因在人类基因组中整合方式的方 法。 根据本发明的实施例, 该方法包括: 利用捕获探针从人类基因组核酸样本中捕获可能 含有外源基因片段整合的 DNA片段; 针对所捕获的 DNA片段进行测序, 以便获得由多个 测序数据构成的测序结果; 对所述测序结果进行第一除杂, 以便获得经过第一除杂的测序 结果; 将所述经过第一除杂的测序结果与已知的人类基因组序列和外源基因序列进行第一 比对, 以便获得可能含有外源基因整合片段的测序数据; 将所得到的可能含有外源基因整 合片段的测序数据进行组装, 以便得到组装结果, 所述组装结果由多个组装数据构成; 对 所述组装结果进行第二除杂, 以便获得经过第二除杂的组装结果; 将所述经过第二除杂的 组装结果与已知的人类基因组序列和外源基因序列进行第二比对, 并且基于所述第二比对 结果, 确定所述外源基因在人类基因组中的整合方式。 利用该方法能够有效地确定外源基 因例如病原体基因组在人类基因组中的整合方式。  In one aspect of the invention, the invention proposes a method of determining the manner in which a foreign gene is integrated in the human genome. According to an embodiment of the present invention, the method comprises: capturing a DNA fragment that may contain an integration of a foreign gene fragment from a human genomic nucleic acid sample using a capture probe; sequencing the captured DNA fragment to obtain a plurality of sequencing data a sequencing result; performing a first impurity removal on the sequencing result to obtain a sequencing result of the first impurity removal; performing the first impurity-clearing sequencing result with a known human genome sequence and a foreign gene sequence First alignment to obtain sequencing data that may contain an integrated fragment of a foreign gene; assembling the obtained sequencing data that may contain an integrated fragment of the foreign gene to obtain an assembly result, the assembly result being composed of a plurality of assembly data Performing a second impurity removal on the assembly result to obtain a second impurity-free assembly result; performing a second ratio of the second impurity-free assembly result to a known human genome sequence and a foreign gene sequence And, based on the second alignment result, determining the foreign gene in the human genome Integrated way. This method can effectively determine the way in which foreign genes such as pathogen genomes are integrated in the human genome.
在本发明的第二方面, 本发明还提出了一种确定外源基因在人类基因组中整合方式的 系统。 根据本发明的实施例, 该系统包括: 捕获装置, 所述捕获装置适于利用捕获探针从 人类基因组核酸样本中捕获可能含有外源基因片段整合的 DNA片段; 测序装置, 所述测序 装置与所述捕获装置相连, 并且适于针对所捕获的 DNA片段进行测序, 以便获得由多个测 序数据构成的测序结果; 第一除杂装置, 所述第一除杂装置与所述测序装置相连, 并且适 于对所述测序结果进行第一除杂, 以便获得经过第一除杂的测序结果; 第一比对装置, 所 述第一比对装置与所述第一除杂装置相连, 并且适于将所述经过第一除杂的测序结果与已 知的人类基因组序列和外源基因序列进行第一比对, 以便获得可能含有外源基因整合片段 的测序数据; 组装装置, 所述组装装置与所述第一比对装置相连, 并且适于将所得到的可 能含有外源基因整合片段的测序数据进行组装, 以便得到组装结果, 所述组装结果由多个 组装数据构成; 第二除杂装置, 所述第二除杂装置与所述组装装置相连, 并且适于对所述 组装结果进行第二除杂, 以便获得经过第二除杂的组装结果; 第二比对装置, 所述第二比 对装置与所述第二除杂装置相连, 并且适于将所述经过第二除杂的组装结果与已知的人类 基因组序列和外源基因序列进行第二比对; 以及分析装置, 所述分析装置适于基于所述第 二比对结果, 确定所述外源基因在人类基因组中的整合方式。 利用根据本发明实施例的系 统, 能够有效地实施前面所描述的方法, 由此, 可以有效地确定外源基因例如病原体基因 组在人类基因组中的整合方式。 In a second aspect of the invention, the invention also provides a method for determining the integration of a foreign gene in the human genome System. According to an embodiment of the invention, the system comprises: a capture device adapted to capture, from a human genomic nucleic acid sample, a DNA fragment that may contain an integration of a foreign gene fragment using a capture probe; a sequencing device, the sequencing device The capture device is coupled and adapted to sequence the captured DNA fragments to obtain sequencing results consisting of a plurality of sequencing data; a first impurity removing device, the first impurity removing device being coupled to the sequencing device, And adapted to perform the first impurity removal on the sequencing result, so as to obtain the sequencing result after the first impurity removal; the first comparison device, the first comparison device is connected to the first impurity removal device, and is suitable Performing a first alignment of the first impurity-cleared sequencing result with a known human genome sequence and a foreign gene sequence to obtain sequencing data that may contain an integrated fragment of the foreign gene; assembling device, the assembling device Connected to the first alignment device and adapted to assemble the resulting sequencing data that may contain foreign gene integration fragments In order to obtain an assembly result, the assembly result is composed of a plurality of assembly data; a second impurity removal device, the second impurity removal device is connected to the assembly device, and is adapted to perform second impurity removal on the assembly result, In order to obtain the assembly result after the second impurity removal; the second comparison device, the second comparison device is connected to the second impurity removal device, and is adapted to combine the result of the second impurity removal assembly with A second alignment of the known human genome sequence and the foreign gene sequence; and an analysis device adapted to determine the manner in which the foreign gene is integrated in the human genome based on the second alignment result. With the system according to an embodiment of the present invention, the aforementioned method can be effectively implemented, whereby the manner in which a foreign gene such as a pathogen genome is integrated in the human genome can be efficiently determined.
在本发明的又一方面, 本发明提出了一种计算机可读介质。 根据本发明的实施例, 在 该计算机可读介质上存储有指令, 所述指令适于被处理器执行以便通过下列步骤确定外源 基因在人类基因组中整合方式: 对测序结果进行第一除杂, 以便获得经过第一除杂的测序 结果; 将所述经过第一除杂的测序结果与已知的人类基因组序列和外源基因序列进行第一 比对, 以便获得可能含有外源基因整合片段的测序数据; 将所得到的可能含有外源基因整 合片段的测序数据进行组装, 以便得到组装结果, 所述组装结果由多个组装数据构成; 对 所述组装结果进行第二除杂, 以便获得经过第二除杂的组装结果; 以及将所述经过第二除 杂的组装结果与已知的人类基因组序列和外源基因序列进行第二比对, 并且基于所述第二 比对结果, 确定所述外源基因在人类基因组中的整合方式, 其中, 所述测序结果是通过下 列获得的: 利用捕获探针从人类基因组核酸样本中捕获可能含有外源基因片段整合的 DNA 片段; 针对所捕获的 DNA片段进行测序, 以便获得由多个测序数据构成的测序结果。 利用 该计算机可读介质可以有效地确定外源基因例如病原体基因组在人类基因组中的整合方 式。  In yet another aspect of the invention, the invention provides a computer readable medium. According to an embodiment of the invention, instructions are stored on the computer readable medium, the instructions being adapted to be executed by the processor to determine how the foreign gene is integrated in the human genome by the following steps: performing a first impurity removal on the sequencing result In order to obtain the sequencing result of the first impurity removal; the first alignment result of the first impurity is first aligned with the known human genome sequence and the foreign gene sequence, so as to obtain an integration fragment which may contain the foreign gene Sequencing data; assembling the obtained sequencing data, which may contain the integrated fragment of the foreign gene, to obtain an assembly result, the assembly result being composed of a plurality of assembly data; performing second impurity removal on the assembly result to obtain Performing a second impurity-free assembly result; and performing a second alignment of the second impurity-free assembly result with a known human genome sequence and a foreign gene sequence, and determining based on the second alignment result The manner in which the foreign gene is integrated in the human genome, wherein the sequencing result is obtained by the following Obtained: A capture probe is used to capture a DNA fragment that may contain an integration of a foreign gene fragment from a human genomic nucleic acid sample; the captured DNA fragment is sequenced to obtain a sequencing result composed of a plurality of sequencing data. The computer readable medium can be used to efficiently determine the integration of a foreign gene, such as a pathogen genome, in the human genome.
本发明的附加方面和优点将在下面的描述中部分给出, 部分将从下面的描述中变得明 显, 或通过本发明的实践了解到。 附图说明  The additional aspects and advantages of the invention will be set forth in part in the description which follows. DRAWINGS
本发明的上述和 /或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和 容易理解, 其中:  The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from
图 1 是根据本发明一个实施例的确定外源基因在人类基因组中整合方式的方法的流程 示意图; 1 is a flow chart of a method for determining a manner of integration of a foreign gene in a human genome according to an embodiment of the present invention. Schematic diagram
图 2是根据本发明又一个实施例的确定外源基因在人类基因组中整合方式的方法的流 程示意图; 以及  2 is a flow chart showing a method of determining a manner of integration of a foreign gene in a human genome according to still another embodiment of the present invention;
图 3是根据本发明一个实施例的确定外源基因在人类基因组中整合方式的系统的结构 示意图。 具体实施方式  Figure 3 is a schematic illustration of the structure of a system for determining the manner in which foreign genes are integrated in the human genome, in accordance with one embodiment of the present invention. detailed description
下面详细描述本发明的实施例, 所述实施例的示例在附图中示出, 其中自始至终相同 或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。 下面通过参考附图描 述的实施例是示例性的, 旨在用于解释本发明, 而不能理解为对本发明的限制。  The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are intended to be illustrative of the invention and are not to be construed as limiting.
术语 "第一"、 "第二" 仅用于描述目的, 而不能理解为指示或暗示相对重要性或者隐 含指明所指示的技术特征的数量。 由此, 限定有 "第一"、 "第二" 的特征可以明示或者隐 含地包括一个或者更多个该特征。 在本发明的描述中, "多个" 的含义是两个或两个以上, 除非另有明确具体的限定。 如本文所用, 术语 "以上" 和 "以下" 包括本数, 例如 "80%以 上 "指 > 80% , "2%以下" 指 < 2%。 PER组装: 如本文所述, PER是指双向 (pair end )测 序数据组装。 即根据序列之间的重叠关系, 将 pair end测序得到的每对 PE测序数据进行组 装。 置换: 如本文所述, 一段病原体基因组 DNA片段插入到人类基因组中, 同时使这个插 入位置的人类基因组 DNA发生缺失的现象,叫做置换。 PCR重复: PCR过程中的重复扩增。 接头( Adaptor ):测序接头,在有些下机的序列数据中会出现 adaptor。 BWA: Burrows-Wheeler Aligner的简称,是一种序列比对软件。 Soap: Short Oligonucleotide Analysis Package的简称, 是一种比对软件。  The terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implied the number of technical features indicated. Thus, features defining "first" and "second" may include one or more of the features, either explicitly or implicitly. In the description of the present invention, the meaning of "plurality" is two or more, unless specifically defined otherwise. As used herein, the terms "above" and "below" include the number, for example "80% or more" means >80%, and "2% or less" means <2%. PER Assembly: As described herein, PER refers to the assembly of pair end sequence data. That is, each pair of PE sequencing data obtained by pair end sequencing is assembled according to the overlapping relationship between the sequences. Substitution: As described herein, a segment of a genomic DNA fragment of a pathogen is inserted into the human genome, and the deletion of the human genomic DNA at the insertion site is called a substitution. PCR repeat: repeated amplification during PCR. Adaptor: Sequencing connector, adaptor will appear in the sequence data of some of the lower machines. BWA: Short for Burrows-Wheeler Aligner, a sequence alignment software. Soap: short for Short Oligonucleotide Analysis Package, is a comparison software.
在本发明中, 除非另有明确的规定和限定, 术语 "相连"、 "连接" 等术语应做广义理 解, 例如, 可以是固定连接, 也可以是可拆卸连接, 或一体地连接; 可以是机械连接, 也 可以是电连接; 可以是直接相连, 也可以通过中间媒介间接相连, 可以是两个元件内部的 连通。 对于本领域的普通技术人员而言, 可以根据具体情况理解上述术语在本发明中的具 体含义。 确定外源基因在人类基因組中^^方式的方法  In the present invention, the terms "connected", "connected" and the like should be understood broadly, and may be, for example, a fixed connection, a detachable connection, or an integral connection, unless otherwise explicitly stated and defined; The mechanical connection may also be an electrical connection; it may be directly connected, or may be indirectly connected through an intermediate medium, and may be internal communication between the two elements. For those of ordinary skill in the art, the specific meaning of the above terms in the present invention can be understood on a case-by-case basis. Method for determining the method of exogenous gene in the human genome
根据本发明的第一方面, 本发明提出了一种确定外源基因在人类基因组中整合方式的 方法。 才艮据本发明的实施例, 参照图 1 , 该方法包括:  According to a first aspect of the invention, the invention proposes a method of determining the manner in which a foreign gene is integrated in the human genome. According to an embodiment of the present invention, referring to FIG. 1, the method includes:
捕获步骤 S100: 利用捕获探针从人类基因组核酸样本中捕获可能含有外源基因片段整 合的 DNA片段。 根据本发明的实施例, 可以利用本发明方法进行分析的外源基因的类型并 不受特别限制。 只要其可以和人类基因组整合, 并且可以获得或者已经知道其基因序列即 可。 根据本发明的实施例, 可以研究的外源基因为病原体基因组。 另外, 根据本发明的具 体实例, 所述病原体为 HBV。 由此, 可以有效地分析病原体例如 HBV与人类基因组的整 合。 测序步骤 S200: 针对所捕获的 DNA片段进行测序, 以便获得由多个测序数据构成的 测序结果。根据本发明的实施例,对经过捕获的 DNA片段进行测序的方式并不受特别限制。 根据本发明的实施例, 测序是通过第二代测序平台进行的。 根据本发明的实施例, 可以釆 用选自 Hiseq2000、 SOLiD、 454和单分子测序装置的至少一种对全基因组测序文库进行测 序。 由此, 能够利用这些测序装置的高通量、 深度测序的特点, 进一步提高确定外源基因 在人类基因组中整合方式的效率。 当然, 本领域技术人员能够理解的是, 还可以釆用其他 的测序方法和装置进行全基因组测序, 例如第三代测序技术, 以及以后可能开发出来的更 先进的测序技术。 根据本发明的实施例, 通过全基因组测序所得到的测序数据的长度不受 特别限制。根据本发明的实施例,优选测序长度为 lOObp, 由此,可以进一步提高分析效果。 Capture step S100: Capture a DNA fragment that may contain an integration of a foreign gene fragment from a human genomic nucleic acid sample using a capture probe. According to an embodiment of the present invention, the type of the foreign gene that can be analyzed using the method of the present invention is not particularly limited. As long as it can be integrated with the human genome, and its gene sequence can be obtained or already known. According to an embodiment of the invention, the foreign gene that can be studied is the pathogen genome. Further, according to a specific example of the present invention, the pathogen is HBV. Thereby, the integration of pathogens such as HBV with the human genome can be effectively analyzed. Sequencing step S200: Sequencing the captured DNA fragments to obtain sequencing results consisting of multiple sequencing data. According to an embodiment of the present invention, the manner of sequencing the captured DNA fragments is not particularly limited. According to an embodiment of the invention, sequencing is performed by a second generation sequencing platform. According to an embodiment of the invention, the whole genome sequencing library can be sequenced using at least one selected from the group consisting of Hiseq2000, SOLiD, 454 and single molecule sequencing devices. Thereby, the efficiency of determining the integration mode of the foreign gene in the human genome can be further improved by utilizing the characteristics of high-throughput and deep sequencing of these sequencing devices. Of course, those skilled in the art will appreciate that other sequencing methods and devices can be used for whole genome sequencing, such as third generation sequencing techniques, as well as more advanced sequencing techniques that may be developed in the future. According to an embodiment of the present invention, the length of the sequencing data obtained by whole genome sequencing is not particularly limited. According to an embodiment of the present invention, it is preferable that the sequencing length is lOObp, whereby the analysis effect can be further improved.
第一除杂步骤 S300: 对所述测序结果进行第一除杂, 以便获得经过第一除杂的测序结 果。 才艮据本发明的实施例, 进行第一除杂的类型, 并不受特别限制, 例如第一除杂可以进 一步包括去除 PCR重复、 去除低质量测序数据以及去除含接头的测序数据的至少一种。 由 此, 可以进一步提高分析效率。  The first impurity removing step S300: performing first impurity removal on the sequencing result to obtain a sequencing result after the first impurity removal. According to an embodiment of the present invention, the type of the first impurity removal is not particularly limited. For example, the first impurity removal may further include removing PCR duplication, removing low quality sequencing data, and removing at least one of the linker-containing sequencing data. Kind. Thereby, the analysis efficiency can be further improved.
第一比对步骤 S400: 将所述经过第一除杂的测序结果与已知的人类基因组序列和外源 基因序列进行第一比对, 以便获得可能含有外源基因整合片段的测序数据。 根据本发明的 实施例, 可以釆用 SOAP进行该第一比对。 由此, 可以进一步提高分析效率。  The first alignment step S400: the first impurity-sequencing sequencing result is first aligned with the known human genome sequence and the foreign gene sequence to obtain sequencing data which may contain the foreign gene integration fragment. According to an embodiment of the invention, the first alignment can be performed using SOAP. Thereby, the analysis efficiency can be further improved.
组装步骤 S500: 将所得到的可能含有外源基因整合片段的测序数据进行组装, 以便得 到组装结果, 所述组装结果由多个组装数据构成。 根据本发明的实施例, 所述组装是通过 基于测序数据之间的重叠关系进行的。  Assembly step S500: The obtained sequencing data, which may contain the foreign gene integration fragment, is assembled to obtain an assembly result, which is composed of a plurality of assembly data. According to an embodiment of the invention, the assembly is performed by based on an overlapping relationship between the sequenced data.
第二除杂步骤 S600: 对所述组装结果进行第二除杂, 以便获得经过第二除杂的组装结 果。 根据本发明的实施例, 所述第二除杂进一步包括去除重复的组装数据。  The second impurity removing step S600: performing second impurity removal on the assembly result to obtain an assembly result after the second impurity removal. According to an embodiment of the invention, the second impurity removal further comprises removing duplicate assembly data.
第二比对步骤 S700: 将所述经过第二除杂的组装结果与已知的人类基因组序列和外源 基因序列进行第二比对。 根据本发明的实施例, 所述第二比对是利用 BWA进行的。  A second alignment step S700: performing a second alignment of the second impurity-free assembly result with a known human genome sequence and a foreign gene sequence. According to an embodiment of the invention, the second alignment is performed using BWA.
分析步骤 S800: 基于所述第二比对结果, 确定所述外源基因在人类基因组中的整合方 式。 根据本发明的实施例, 基于所述第二比对结果, 确定所述外源基因在人类基因组中的 整合方式进一步包括: 选择同时能够比对上已知的人类基因组序列和外源基因序列的组装 数据, 该组装数据中包含人类基因组断点信息和外源基因断点信息。 根据本发明的实施例, 还可以进一步基于所述人类基因组断点信息和外源基因断点信息, 判断是否存在置换变异; 或者基于所述人类基因组断点信息和外源基因断点信息, 确定人类基因组中外源基因插入 长度和类型的至少一种, 例如确定人类基因组中至少一部分外源基因插入长度和类型。  Analysis step S800: Based on the second alignment result, the integration of the foreign gene in the human genome is determined. According to an embodiment of the present invention, determining the manner in which the foreign gene is integrated in the human genome based on the second alignment result further comprises: selecting and simultaneously aligning the known human genome sequence and the foreign gene sequence Assembly data, the assembly data includes human genome breakpoint information and foreign gene breakpoint information. According to an embodiment of the present invention, it may further be determined whether there is a replacement variation based on the human genome breakpoint information and the foreign gene breakpoint information; or determining, based on the human genome breakpoint information and the foreign gene breakpoint information, At least one of the length and type of foreign gene insertion in the human genome, for example, determining the length and type of insertion of at least a portion of the foreign gene in the human genome.
下面参考图 2, 以 HBV为例, 对根据本发明实施例的确定外源基因在人类基因组中整 合方式的系统进行详细解释。 如图 2所示, 其具体包括以下步骤:  Referring now to Fig. 2, a system for determining the manner in which foreign genes are integrated in the human genome according to an embodiment of the present invention will be explained in detail by taking HBV as an example. As shown in FIG. 2, it specifically includes the following steps:
1. 病原体基因组片段核酸整合序列的获得和测序  1. Acquisition and sequencing of nucleic acid integration sequences of pathogen genome fragments
获得病原体核酸整合序列的方法包括但不限于以下方法: 釆用捕获探针技术将可能含 有病原体基因组片段整合的 DNA片段从样品中捕获下来, 然后对得到的序列进行测序。  Methods for obtaining a pathogen nucleic acid integration sequence include, but are not limited to, the following methods: A DNA fragment that may contain a pathogen genomic fragment is captured from a sample using a capture probe technique, and the resulting sequence is sequenced.
2. 去除 PCR重复, 去除低质量测序数据以及去除含接头的测序数据 去除 PCR重复的策略: 当两条序列完全一样时, 则认定为重复序列。 当一对 PE测序 数据中有一条测序数据出现重复时, 则去掉这一对测序数据。 2. Remove PCR duplication, remove low quality sequencing data, and remove splice-containing sequencing data Strategy for removing PCR repeats: When the two sequences are identical, they are considered to be repeats. When one of the pair of PE sequencing data is duplicated, the pair of sequencing data is removed.
去除低质量测序数据的策略: 当一条测序数据中测序质量值小于或等于 5 的碱基数目 占这条测序数据总碱基数目的 50%以上时, 则认为这条测序数据为低质量测序数据。 当一 对 PE测序数据中有一条测序数据是低质量时, 则去掉这一对测序数据。  Strategy for removing low-quality sequencing data: When the number of bases with a sequencing quality value less than or equal to 5 in a sequencing data accounts for more than 50% of the total number of bases of the sequencing data, the sequencing data is considered to be low-quality sequencing data. . When one of the paired PE sequencing data is low quality, the pair of sequencing data is removed.
去除含接头测序数据的策略: 当一条测序数据中含有一段接头序列时, 则认为这条测 序数据是含接头测序数据。 当一对 PE测序数据中有一条测序数据是含接头测序数据时, 去 掉这一对测序数据。  Strategies for removing sequence-containing sequencing data: When a piece of sequencing data contains a linker sequence, the sequence data is considered to contain linker sequencing data. When one of the paired PE sequencing data has sequenced sequencing data, the pair of sequencing data is removed.
3. Soap比对, 选取需要的测序数据, 计算比对率  3. Soap comparison, select the required sequencing data, calculate the comparison rate
将经过处理的测序数据分别比对到人类基因组 hgl9以及病原体基因组片段基因组上。 因为病原体基因组一般有多个亚型, 所以这里病原体基因组的参考基因组一般要按照需求 选取合适的亚型。 比对完成后, 通过分析两次比对结果之间的成对关系, 选取可能含有病 原体基因组整合片段的测序数据。 并分别计算原始测序数据中有用序列的比例, 以及有用 序列中人类基因组比对率和病原体基因组片段基因组的比对率。  The processed sequencing data were separately aligned to the human genome hgl9 and the pathogen genome fragment genome. Since the pathogen genome generally has multiple subtypes, the reference genome of the pathogen genome here generally needs to select the appropriate subtype according to the needs. After the alignment is completed, by analyzing the pairwise relationship between the two alignment results, sequencing data may be selected which may contain the integrated genome of the pathogen genome. The proportions of the useful sequences in the original sequencing data, and the ratio of the human genome alignment in the useful sequence to the genome of the pathogen genome fragment are calculated separately.
4. PER组装  4. PER assembly
将第三步得到的可能含有病原体基因组片段整合片段的测序数据进行 PER组装。 PER 是指双向 (pair end )测序数据组装。 即根据序列之间的重叠关系, 将 pair end测序得到的 每对 PE测序数据进行组装。  The sequencing data obtained in the third step, which may contain the integrated fragments of the pathogen genomic fragments, was subjected to PER assembly. PER refers to the assembly of pair end sequencing data. That is, each pair of PE sequencing data obtained by pair end sequencing is assembled according to the overlapping relationship between the sequences.
5.再次去除重复的序列  5. Remove duplicate sequences again
通过 PER组装后,得到一个组装后序列的集合。再次对这个序列集合进行去重复操作。 这里的策略是釆用 SE测序数据的去重复策略, 即: 当一条测序数据出现重复的情况时, 则 去掉这条测序数据。  After assembly by PER, a collection of assembled sequences is obtained. This sequence set is deduplicated again. The strategy here is to use the deduplication strategy of SE sequencing data, ie when a sequence of sequencing data is duplicated, then the sequencing data is removed.
6. BWA重比对, 提取断点信息  6. BWA heavy comparison, extract breakpoint information
通过第五步的去重复步骤, 得到一个序列集合。 然后使用 BWA软件将这个序列集合分 别再一次比对到人类基因组 hgl9和病原体基因组上。 通过分析两次比对的结果文件, 选出 同时都能比上人类基因组 hgl9和病原体基因组的序列。 这些序列是含有断点信息的。 分别 分析这些序列与人类基因组 hgl9和病原体基因组的比对情况, 得病原体基因组整合片段在 人类基因组上的分布情况, 以及在病原体基因组上的分布情况。  By repeating the steps in the fifth step, a sequence set is obtained. This sequence set was then compared again to the human genome hgl9 and the pathogen genome using BWA software. By analyzing the results of the two alignments, sequences that are both comparable to the human genome hgl9 and the pathogen genome were selected. These sequences contain breakpoint information. The alignment of these sequences with the human genome hgl9 and the pathogen genome was analyzed separately, and the distribution of the pathogen integrated fragments on the human genome and the distribution on the pathogen genome were obtained.
这里的分布情况包括但不限于比对位置、 支持某一插入断点左端点的测序数据数目、 支持某一插入断点右端点的测序数据数目、 总支持测序数据数目、 经过数据量归一化之后 的左端点的支持测序数据数目、 左端点支持测序数据数目、 归一化后总支持测序数据数目 以及支持某一插入断点的所有支持测序数据的 ID号。 这里的归一化策略是根据有效测序数 据数目进行归一化。  The distribution here includes, but is not limited to, the alignment position, the number of sequencing data supporting the left endpoint of an insertion breakpoint, the number of sequencing data supporting the right endpoint of an insertion breakpoint, the total number of supported sequencing data, and the normalized amount of data. The number of supported sequencing data for the left endpoint, the number of sequencing data supported by the left endpoint, the total number of sequencing data supported after normalization, and the ID number of all supported sequencing data supporting an insertion breakpoint. The normalization strategy here is normalized based on the number of valid sequencing data.
这里说的 "支持某一插入断点左端点的测序数据" 以及 "支持某一插入断点右端点的 测序数据" 都仅是相对于人类基因组中的插入断点而言的, 对于病原体基因组, 则没有插 入断点位置。 7. 检查是否存在置换类别 The "sequencing data that supports the left endpoint of an insertion breakpoint" and "sequencing data that supports the right endpoint of an insertion breakpoint" are only relative to the insertion breakpoint in the human genome, for the pathogen genome, Then the breakpoint position is not inserted. 7. Check if there is a replacement category
通过分析人类基因组断点和病原体基因组断点之间的内在联系, 检查是否存在置换类 型变异。  The presence of a substitution type variation is examined by analyzing the intrinsic link between the human genome breakpoint and the pathogen genome breakpoint.
8. 计算病原体基因组插入片段的长度和型别  8. Calculate the length and type of the pathogen genomic insert
通过分析人类基因组断点和病原体基因组断点之间的内在联系, 找到能够计算出病原 体基因组插入片段的断点信息, 计算出这一部分的断点处的病原体基因组插入片段长度和 型别。  By analyzing the intrinsic link between the human genome breakpoint and the pathogen genomic breakpoint, we find the breakpoint information that can be used to calculate the pathogen genomic insert, and calculate the length and type of the pathogen genomic insert at this breakpoint.
9. 计算捕获实际效率  9. Calculate the actual efficiency of capture
综合重比对的结果信息和初始的序列信息计算探针捕获的实际效率。  The result information of the integrated weight comparison and the initial sequence information are used to calculate the actual efficiency of the probe capture.
根据本发明实施例的确定外源基因在人类基因组中整合方式的方法能够找到人类全基 因组范围内精确的病原体基因组片段插入位置。 根据本发明实施例的确定外源基因在人类 基因组中整合方式的方法能给出部分可能的置换类型, 以及部分病原体基因组插入片段的 型别。 根据本发明实施例的确定外源基因在人类基因组中整合方式的方法快捷, 使用方便。 以 5G的起始数据量为例, 能在两天内分析完成。  The method for determining the manner in which foreign genes are integrated in the human genome according to an embodiment of the present invention enables the discovery of precise pathogen genomic fragment insertion positions within the human genome. The method of determining the manner in which a foreign gene is integrated in the human genome according to an embodiment of the present invention can give some possible types of substitution, as well as the type of partial pathogen genomic insert. The method for determining the manner in which a foreign gene is integrated in the human genome according to an embodiment of the present invention is quick and convenient to use. Taking the starting data volume of 5G as an example, it can be analyzed within two days.
本发明人经过广泛而深入的研究, 首次构建了一种用于检测病原体基因组在待测样本 中整合方式的生物信息学检测方法及其应用, 具体地, 本发明人从通过对序列捕获技术得 到的序列进行比对、 筛选、 组装, 以及重比对, 建立了一套完整的生物信息学检测流程。 应用所述检测流程, 检测到有关病原体基因组在人类基因组整合方式的信号, 在此基础上 完成了本发明。  After extensive and intensive research, the present inventors have for the first time constructed a bioinformatics detection method for detecting the integration mode of a pathogen genome in a sample to be tested and its application, and specifically, the inventors obtained from a sequence capture technique. The sequences were aligned, screened, assembled, and compared, and a complete bioinformatics testing process was established. The present invention was completed on the basis of detecting the signals of the pathogen genome in the human genome integration mode using the detection procedure.
根据本发明的实施例, 可以处理的样本类型并不受特别限制, 只要含有核酸样本即可, 核酸的类型并不受特别限制, 可以是脱氧核糖核酸(DNA ), 也可以是核糖核酸(RNA ), 优选 DNA。 本领域技术人员可以理解, 对于 RNA, 可以通过常规手段将其转换为具有相应 序列的 cDNA, 进行后续检测和分析。 根据本发明的实施例, 样本的来源并不受特别限制。 根据本发明的示例, 可以釆用癌组织样本作为测试样本, 从而可以从其中提取病原体基因 组插入片段的 DNA序列, 进而可以对病原体基因组片段插入情况进行检测和分析。 根据本 发明的实施例, 可以使用的样本的例子包括但不限于病人血浆、 癌组织细胞、 癌旁组织细 胞。  According to an embodiment of the present invention, the type of the sample that can be processed is not particularly limited as long as the nucleic acid sample is contained, and the type of the nucleic acid is not particularly limited, and may be deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). ), preferably DNA. Those skilled in the art will appreciate that for RNA, it can be converted to cDNA having the corresponding sequence by conventional means for subsequent detection and analysis. According to an embodiment of the present invention, the source of the sample is not particularly limited. According to an example of the present invention, a cancer tissue sample can be used as a test sample, whereby a DNA sequence of a pathogen genomic insert can be extracted therefrom, and the insertion of the pathogen genomic fragment can be detected and analyzed. Examples of samples that may be used in accordance with embodiments of the invention include, but are not limited to, patient plasma, cancer tissue cells, paracancerous tissue cells.
在本发明中的一个实施例中, 在应用探针进行序列捕获之前, 需要先进行 DNA文库制 备, 样本文库的制备方法为本领域技术人员所熟知。 "DNA文库制备"一词是指对基因组的 目的片段进行打断, 获得一组具有一定大小的 DNA片段混合物。 根据本发明的实施例, 从 样本捕获特殊序列的方法和设备, 也不受特别限制, 可以釆用商品化的探针进行。 在本发 明中, 测序序列是指测序仪输出的序列片段, 即测序数据(reads )。 在本发明的一个实施例 中, 测序所用的 DNA片段是经过特定探针捕获过的。 探针必须是纯净的, 而且不受其他不 同序列核酸的影响。 典型的探针是克隆的 DNA序列或通过 PCR扩增获得的 DNA, 人工合 成的寡核苷酸或从体外转录克隆 DNA序列后获得的 RNA, 也可以作为探针。 探针长度可 以从 20-500mer, 较佳地 50-300mer, 更佳地 250mer。 探针设计和合成方法为本领域技术人 员所熟知, 可以使用人工化学合成法合成探针或使用市售探针。 In one embodiment of the invention, DNA library preparation is required prior to application of the probe for sequence capture, and methods for preparing sample libraries are well known to those skilled in the art. The term "DNA library preparation" refers to disrupting a target fragment of a genome to obtain a mixture of DNA fragments of a certain size. According to an embodiment of the present invention, the method and apparatus for capturing a specific sequence from a sample are also not particularly limited and can be carried out using a commercially available probe. In the present invention, the sequencing sequence refers to a sequence fragment output by the sequencer, that is, sequencing data (reads). In one embodiment of the invention, the DNA fragments used for sequencing are captured by a particular probe. The probe must be pure and not affected by other different sequence nucleic acids. Typical probes are cloned DNA sequences or DNA obtained by PCR amplification, synthetic oligonucleotides or RNA obtained after cloning and cloning DNA sequences in vitro, and can also be used as probes. The length of the probe may range from 20 to 500 mers, preferably from 50 to 300 mers, more preferably from 250 mers. Probe design and synthesis methods are those skilled in the art As is well known, synthetic probes can be used to synthesize probes or to use commercially available probes.
本发明中, 从样本获得测序序列可以釆用测序的方法进行, 所述测序可通过任何测序 方法进行, 包括但不限于双脱氧链终止法; 优选高通量的测序方法, 包括但不限于第二代 测序技术或者是单分子测序技术。 本发明中所述第二代测序平台 (Metzker ML. Sequencing technologies-the next generation. Nat Rev Genet. 2010 Jan;ll(l):31-46 ) 包括但不限于 Illumina (如 GA系列、 HiSeq系列)、 Life Technologies (如 SOLID系列、半导体测序系列等)和 Roche (如 GS系列)等公司提供的测序平台。  In the present invention, the sequencing sequence obtained from the sample can be carried out by a sequencing method, which can be performed by any sequencing method, including but not limited to the dideoxy chain termination method; preferably a high-throughput sequencing method, including but not limited to Second generation sequencing technology or single molecule sequencing technology. The second generation sequencing platform described in the present invention (Metzker ML. Sequencing technologies-the next generation. Nat Rev Genet. 2010 Jan; ll(l): 31-46) includes but is not limited to Illumina (eg GA series, HiSeq series) , Life Technologies (such as SOLID series, semiconductor sequencing series, etc.) and Roche (such as GS series) and other companies provide sequencing platforms.
本发明中所述测序类型包括但不限于 Pair-end (双向) 测序, 测序长度包括但不限于 100bp。 在本发明的一个实施例中, 所述的测序平台为 Illumina/Solexa, 测序类型为 Pair-end 测序, 得到具有双向位置关系的 100bp大小的 DNA序列分子。  The types of sequencing described in the present invention include, but are not limited to, Pair-end sequencing, and the sequencing length includes, but is not limited to, 100 bp. In one embodiment of the present invention, the sequencing platform is Illumina/Solexa, and the sequencing type is Pair-end sequencing, and a 100 bp DNA sequence molecule having a bidirectional positional relationship is obtained.
本发明中, 所述的测序序列的主要文库长度应该小于 200bp。 这样才能有足够的组装成 功率, 以保证有足够的数据量进行后续的研究。 本发明的一些实施例中, 血浆样品测序产 量约为 5G, 组织细胞样品的测序数据量约为 1G。 数据量越大, 能够检出的插入片段信息 越全面。  In the present invention, the main library length of the sequencing sequence should be less than 200 bp. This will allow for sufficient assembly success to ensure sufficient data for subsequent studies. In some embodiments of the invention, the plasma sample has a sequencing yield of about 5G and the tissue cell sample has a sequencing data volume of about 1G. The larger the amount of data, the more comprehensive the information that can be detected.
利用本发明的方法, 通过新一代测序技术产生的数据, 能够确定病原体基因组插入信 号, 这里的信号包括但不限于位置、 基因型别、 数目、 长度。  Using the methods of the present invention, pathogen genomic insertion signals can be determined by data generated by next generation sequencing techniques, including but not limited to location, genotype, number, and length.
本发明中, 所述的进行 SOAP2比对以及 BWA比对的人类基因组序列是 NCBI数据库 中版本 37 ( hgl9; NCBI Build 37 ) 的人类基因组参考序列。  In the present invention, the human genome sequence for performing SOAP2 alignment and BWA alignment is the human genome reference sequence of version 37 (hgl9; NCBI Build 37) in the NCBI database.
本发明的一些实施例中, 所述的进行 SOAP2比对以及 BWA比对的病原体基因组参考 序列选自该病原体的 8个亚型中的 23条基因组序列。  In some embodiments of the invention, the pathogen reference sequence for performing SOAP2 alignment and BWA alignment is selected from 23 genomic sequences of the 8 subtypes of the pathogen.
在本发明中, 所述的比对包括 PER组装前比对和 PER组装后比对, 其中 PER组装前 比对是容许错配 5个碱基的比对。 PER组装前序列比对可以通过任何一种序列比对程序, 例如本领域技术人员可获得的短寡核苷酸分析包(Short Oligonucleotide Analysis Package , SOAP )和 BWA比对( Burrows-Wheeler Aligner 0.5.8c (rl536) )进行, 将测序序列与参考基 因组序列比对,根据测序数据与病原体基因组和人类基因组的比对情况, 对测序数据进行分 类。 进行序列比对可以使用程序提供的默认参数进行, 或者由本领域技术人员根据需要对 参数进行选择。 在本发明的一个实施例中, 所釆用的比对软件是 SOAPaligner/soap2。 另一 种比对是 PER组装后比对, PER组装后序列比对可以通过任何一种可以设定为容许足够长 度 gap的序列比对程序。进行序列比对可以使用程序提供的默认参数进行,或者由本领域技 术人员根据需要对参数进行选择。 例如本领域技术人员可获得的 BWA 比对 ( Burrows-Wheeler Aligner ) 中的 B WAS W参数进行, 通过比对确定比对位置。  In the present invention, the alignment includes PER pre-assembly alignment and PER post-assembly alignment, wherein the PER pre-assembly alignment is a mismatch of 5 bases. The PER pre-assembly sequence alignment can be performed by any sequence alignment program, such as the Short Oligonucleotide Analysis Package (SOAP) and BWA alignments available to those skilled in the art (Burrows-Wheeler Aligner 0.5. 8c (rl536)), the sequencing sequence is aligned with the reference genome sequence, and the sequencing data is classified according to the alignment of the sequencing data with the pathogen genome and the human genome. The sequence alignment can be performed using default parameters provided by the program, or the parameters can be selected by those skilled in the art as needed. In one embodiment of the invention, the comparison software employed is SOAPaligner/soap2. Another alignment is post-assembly alignment of the PER. Sequence alignment after PER assembly can be performed by any sequence alignment program that can be set to allow a sufficient length of gap. The sequence alignment can be performed using default parameters provided by the program, or the parameters can be selected by those skilled in the art as needed. For example, the B WAS W parameter in the BWA alignment (Burrows-Wheeler Aligner) available to those skilled in the art is performed, and the alignment position is determined by comparison.
本发明的一些实施例中, 对于步骤 3 , 提取可能含有病原体基因组插入片段信息的测序 数据的条件(以下条件满足一个即可)是:  In some embodiments of the invention, for step 3, the conditions for extracting sequencing data that may contain information on the pathogen genomic insert (the following conditions are met) are:
1 )一对 PE 测序数据中, 在仅容许 5个错配的情况下, 一条能比上人基因组, 另一条 能比上病原体基因组;  1) In a pair of PE sequencing data, in the case of only 5 mismatches, one can compare with the human genome and the other can compare with the pathogen genome;
2 )—对 PE 测序数据中, 在仅容许 5个错配的情况下, 一条能比上人基因组, 另一条 不能比上任何参考序列; 2)—In the case of PE sequencing data, in the case where only 5 mismatches are allowed, one can compare with the human genome, and the other Cannot compare to any reference sequence;
3 )—对 PE 测序数据中, 在仅容许 5个错配的情况下, 一条能比上病原体基因组, 另 一条不能比上任何参考序列;  3)—In the case of PE sequencing data, in the case of only 5 mismatches, one can compare with the pathogen genome, and the other cannot match any reference sequence;
4 )一对 PE 测序数据中, 在仅容许 5个错配的情况下, 两条都不能比上任何参考序列。 本发明的一些实施例中, 对于图 2中的 Soap比对步骤, 选取需要的测序数据, 计算比 对率: 该步骤会输出一个文件, 该文件包含但不限于以下项目: 产量、 比对率、 污染率、 有效测序数据比例。 有效测序数据指的是原始的下机测序数据去掉受污染测序数据后剩下 的测序数据。 这里提到的受污染测序数据指的是 PCR重复测序数据、 含接头测序数据、 低 质量测序数据。  4) In a pair of PE sequencing data, in the case where only 5 mismatches are allowed, neither of them can match any reference sequence. In some embodiments of the present invention, for the Soap alignment step in Figure 2, the required sequencing data is selected, and the comparison ratio is calculated: This step outputs a file including but not limited to the following items: Yield, comparison rate , pollution rate, proportion of valid sequencing data. Effective sequencing data refers to the sequencing data remaining after the original down-sequencing data is removed from the contaminated sequencing data. The contaminated sequencing data referred to herein refers to PCR repeat sequencing data, including linker sequencing data, and low quality sequencing data.
本发明的一些实施例中, 对于图 2中的 Soap比对步骤, 选取需要的测序数据, 计算比 对率, 其中 Soap比对的插入片段 sd值设定为 30。 这样做可保证有尽可能大的测序数据利 用率。 结合下表 1 ,设落在某一个格子中的测序数据的数目是 Vnum,设人类基因组比对率为 alnRate 设病原体基因组比对率为 alnRate virus, 设 Vall= Vi+Vz+Vs+Vs+Vs+V +Vg+Vg那 么比对率的计算公式是 alnRatevims= ( V1+V2+V4+V5+V3+V6 ) I Vall , alnRate ( V1+V2+V4+V5+V7+V8 ) I Van In some embodiments of the present invention, for the Soap alignment step of Figure 2, the required sequencing data is selected and the alignment ratio is calculated, wherein the insert sd value of the Soap alignment is set to 30. This ensures the largest possible sequencing data utilization. In combination with Table 1 below, the number of sequencing data set in a certain grid is V num , and the human genome comparison rate is alnRate. The pathogen transcript rate is alnRate virus , and V all = Vi+Vz+Vs+Vs +Vs+V +Vg+Vg then the formula for calculating the ratio is alnRate vims = ( V1+V2+V4+V5+V3+V6 ) IV all , alnRate ( V1+V2+V4+V5+V7+V8 ) I Van
本发明的一些实施例中, 对于图 2中的 soap比对步骤, 选取需要的测序数据, 计算比 对率: 提取可能含有病原体基因组插入片段信息的测序数据的条件 (以下条件满足一个即 可)是 1 )一对 PE 测序数据中, 在仅容许 5个错配的情况下, 条能比上人基因组, 另一 条能比上病原体基因组; 2 )—对 PE 测序数据中, 在仅容许 5个错配的情况下, 条能比 上人基因组, 另一条不能比上任何参考序列; 3 )—对 PE 测序数据中, 在仅容许 5个错配 的情况下, 一条能比上病原体基因组, 另一条不能比上任何参考序列; 4 )一对 PE 测序数 据中, 在仅容许 5个错配的情况下, 两条都不能比上任何参考序列。  In some embodiments of the present invention, for the soap alignment step in Figure 2, the required sequencing data is selected to calculate the alignment ratio: conditions for extracting sequencing data that may contain information on the pathogen genomic insert (the following conditions are satisfied) Yes 1) In a pair of PE sequencing data, in the case of only 5 mismatches, the strip can be compared to the human genome, and the other can be compared to the pathogen genome; 2) - for PE sequencing data, only 5 In the case of a mismatch, the strip can be compared to the human genome, and the other cannot be compared to any reference sequence; 3) - For PE sequencing data, in the case of only 5 mismatches, one can compare with the pathogen genome, One cannot compare with any reference sequence; 4) In a pair of PE sequencing data, in the case where only 5 mismatches are allowed, neither of them can match any reference sequence.
本发明的一些实施例中, 对于 SOAP 比对步骤中的 "计算原始测序数据中有用序列" 是指, 原始测序数据中除去含 PCR重复的 PE 测序数据、 含低质量测序数据的 PE 测序数 据以及含 adaptor的 PE 测序数据的序列。 如下表 1的 5 6 8 9为本发明的一些实施例中 选取的序列。 表 1中的行表示序列与人基因组比对时的情况, PE表示一对 PE测序数据能 在设定的插入片段长度范围内比上参考序列; SE表示一对 PE测序数据中, 仅有一条能比 上参考序列, 或者两条都能比上参考序列, 但是比对位置不在设定的插入片段长度范围内; Unmap表示一对 PE测序数据完全不能比上参考序列。  In some embodiments of the present invention, "calculating useful sequences in raw sequencing data" in the SOAP alignment step means removing PE sequencing data containing PCR repeats, PE sequencing data containing low quality sequencing data, and Sequence of AP sequencing data containing adaptor. 5 6 8 9 of Table 1 below is a sequence selected in some embodiments of the present invention. The rows in Table 1 indicate the alignment of the sequence with the human genome, PE indicates that a pair of PE sequencing data can be compared to the upper reference sequence within the set insert length; SE indicates that only one pair of PE sequencing data Can be compared to the reference sequence, or both can be compared to the reference sequence, but the alignment position is not within the set insertion length; Unmap indicates that a pair of PE sequencing data can not match the reference sequence at all.
表 1. 测序数据分类九宫格  Table 1. Sequencing data classification
Figure imgf000009_0001
本发明中, 对于图 2中再次去除重复的序列的步骤: 虽然在第一次除杂即步骤 2中已 经对 PE测序数据中存在的 PCR重复进行了过滤的操作, 但是这次过滤是不彻底的。 因为 有些序列的重叠部分可能会有错配, 但仍然是可以组装的, 所以就可能造成组装后序列完 全一样的情况。
Figure imgf000009_0001
In the present invention, the step of removing the repeated sequence again in Fig. 2: although the PCR repeating operation in the PE sequencing data has been performed in the first impurity removal step 2, the filtering is not thorough. of. Because some of the overlapping parts of the sequence may have mismatches, but they can still be assembled, it may result in the same sequence after assembly.
本发明的一些实施例中, 对于 BWA重比对步骤, 提取断点信息: BWA 0.5.8c (rl536) 重比对时所用的参数是适用长序列比对且支持高容错率比对的参数 BWASW。 它应用启发 式 Smith- Waterman-like算法搜索高分比对位置。 比对时所用的参数完全釆用该版本软件的 默认参数值, 详细情况可查阅 http://bio-bwa.sourceforge.net/bwa.shtml o 它应用启发式 Smith- Waterman-like算法搜索高分比对位置。  In some embodiments of the present invention, for the BWA re-alignment step, the breakpoint information is extracted: BWA 0.5.8c (rl536) The parameters used in the heavy comparison are parameters BWASW that are suitable for long sequence alignment and support high fault tolerance alignment. . It uses the heuristic Smith-Waterman-like algorithm to search for high-scoring alignment locations. The parameters used in the comparison completely use the default parameter values of the software. For details, please refer to http://bio-bwa.sourceforge.net/bwa.shtml o It uses the heuristic Smith-Waterman-like algorithm to search for high scores. Compare the positions.
本发明中的 BWA重比对步骤, 提取断点信息:在 BWA重比对完成之后, 需要处理比 对结果以提取断点信息。 这时对于每条序列比对结果的选择需要满足以下几个条件:  In the BWA weight comparison step in the present invention, the breakpoint information is extracted: after the BWA weight comparison is completed, the comparison result needs to be processed to extract the breakpoint information. At this time, the selection of the alignment result for each sequence needs to meet the following conditions:
1 )同时能比对上人类基因组与病原体基因组。  1) Simultaneously compare the human genome with the pathogen genome.
2)不论与何种类型的参考基因组比对,能与参考序列比对上的序列长度必须大于或等于 30bp  2) Regardless of the type of reference genome, the length of the sequence aligned with the reference sequence must be greater than or equal to 30 bp.
3)不论与何种类型的参考基因组比对,可能为插入片段部分的序列长度必须大于或等于 3) Regardless of the type of reference genome alignment, the sequence length of the insert portion may be greater than or equal to
5bp 5bp
本发明的一些实施例中, 对于 BWA重比对步骤, 提取断点信息: BWA的比对釆取的 是截头和截尾的基本策略。 比如说, 也就是说当一条测序数据的前半部分比不上参考序列 时, 比对软件会把该测序数据中比不上的部分直接截掉, 然后继续比对。  In some embodiments of the invention, for the BWA re-alignment step, the breakpoint information is extracted: the BWA alignment is a basic strategy of truncation and truncation. For example, when the first half of a piece of sequencing data is inferior to the reference sequence, the comparison software will directly cut off the portion of the sequenced data that is not comparable, and then continue to compare.
本发明的一些实施例中, 对于 BWA重比对步骤, 提取断点信息:  In some embodiments of the invention, for the BWA heavy comparison step, the breakpoint information is extracted:
从比对结果中提取断点信息时, 可能会遇到几种情况, 下面具体列出这几种情况并给 出本发明的处理方法。 以下给出的情况都是序列与人类基因组比对时可能出现的。 研究人 员应理解, 比对软件 BWA所给出的比对位置是被比序列的左端位置。 假设比对软件的比 对位置为 p  When extracting breakpoint information from the comparison results, several situations may be encountered, and the following are specifically listed and the processing method of the present invention is given. The cases given below are all possible when the sequence is aligned with the human genome. The investigator should understand that the alignment position given by the comparison software BWA is the left end position of the sequence. Suppose the alignment position of the comparison software is p
1 ) 条序列的上游部分是病原体基因组序列, 下游部分是人 DNA序列, 并且不存在 其他类型碱基时, 这时病原体基因组序列插入位置应该取比对软件给出的比对位置 p  1) The upstream part of the sequence is the pathogen genome sequence, the downstream part is the human DNA sequence, and there are no other types of bases. At this time, the pathogen sequence insertion position of the pathogen should be compared with the alignment position given by the software.
2 )—条序列的上游部分是人 DNA序列, 下游部分是病原体基因组序列, 并且不存在 其他类型碱基时, 这时病原体基因组序列片段插入位置应该取比对软件给出的比对位置。 假设上游人 DNA序列部分的长度为 X ,这时病原体基因组序列片段插入位置应取为 p+x 3 )一条序列的上游和下游都是人基因组序列, 中间部分是病原体基因组插入片段, 假 设上游人 DNA序列部分的长度为 X 下游人 DNA部分的序列长度为 χ τ ,这时病原体基因 组插入位置应取为 p+x _t 2) - The upstream part of the sequence is the human DNA sequence, and the downstream part is the pathogen genome sequence, and there are no other types of bases. At this time, the insertion position of the pathogen genome sequence fragment should be compared with the alignment position given by the software. Assume that the length of the DNA sequence of the upstream human is X, and the insertion position of the pathogen genome sequence fragment should be taken as p+x 3) The upstream and downstream of one sequence are human genome sequences, and the middle part is the pathogen genome insert, assuming the upstream person the length of the DNA sequence portion of the human DNA sequence of the downstream portion of the length X of χ τ, time pathogen genomic insert position to be taken as p + x _t
4 )一条序列的上游和下游都是病原体基因组序列, 中间部分是人基因组序列, 这种序 列带有两个病原体基因组片段插入整合点的信号, 所以可以从这条序列的比对结果中提取 出两个插入位置。 假设上游人 DNA序列部分的长度为 y _t,下游人 DNA部分的序列长度为 y T,中间部分的人基因组序列长度为 y,这时病原体基因组片段插入位置应取为 p和 p+y 本发明的一个实施例中, 对于 BWA重比对步骤, 提取断点信息: 该步骤的输出文件给 出了左支持测序数据数目、 右支持测序数据数目、 总支持测序数据数目, 以及相应的归一 化之后的测序数据数目。 下面对这些项目——做出说明。 左支持测序数据数目, 即支持某 一插入断点左端点的测序数据数目。 具体地说, 就是相对于比对的参考序列而言, 处于病 原体基因组插入片段上游的序列数目。 右支持测序数据数目, 即支持某一插入断点右端点 的测序数据数目。 具体地说, 就是相对于比对的参考序列而言, 处于病原体基因组插入片 段下游的序列数目。 总支持测序数据数目等于左支持序列数目与右支持序列数目之和。 归 一化后的左支持序列数目。 设归一化之前左支持测序数据数目为 VL, 有用的测序数据对数 为 b M对, 那么归一化之后的左支持序列数目为 Vi7b。 归一化后的右支持序列数目。 设归 一化之前右支持测序数据数目为 a, 有用的测序数据对数为 Vr M对, 那么归一化之后的左 支持序列数目为 VJb。 归一化后的总支持序列数目是归一化后的左支持序列数目与归一化 后的右支持序列数目之和。 4) Both the upstream and downstream of a sequence are pathogen genomic sequences, and the middle part is the human genome sequence. This sequence carries the signal of the insertion of the two pathogen genomic fragments into the integration point, so it can be extracted from the alignment result of this sequence. Two insertion positions. Assume that the length of the DNA sequence of the upstream human is y _t, the length of the DNA portion of the downstream human is y T , and the length of the human genome of the middle part is y , and the insertion position of the genomic fragment of the pathogen should be taken as p and p+y In one embodiment of the present invention, for the BWA heavy comparison step, the breakpoint information is extracted: the output file of the step gives the number of left support sequencing data, the number of right support sequencing data, the total number of supported sequencing data, and the corresponding return The number of sequencing data after the ization. Here are some explanations for these items. The number of sequencing data supported by the left is the number of sequencing data that supports the left endpoint of a certain insertion breakpoint. Specifically, the number of sequences upstream of the pathogen genomic insert relative to the aligned reference sequence. The right supports the number of sequencing data, that is, the number of sequencing data that supports the right endpoint of a certain insertion breakpoint. Specifically, the number of sequences downstream of the pathogen genomic insert relative to the aligned reference sequence. The total number of supported sequencing data is equal to the sum of the number of left support sequences and the number of right support sequences. The number of normalized left support sequences. The number of left support sequencing data before the normalization is V L , and the useful logarithm of the sequencing data is b M pairs, then the number of left support sequences after normalization is Vi7b. The number of normalized support sequences after normalization. Before normalization, the number of right support sequencing data is a, and the useful sequencing data log is V r M pairs, then the number of left support sequences after normalization is VJb. The normalized total number of support sequences is the sum of the number of normalized left support sequences and the number of normalized right support sequences.
本发明的一些实施例中, 对于 BWA重比对步骤, 提取断点信息:  In some embodiments of the invention, for the BWA heavy comparison step, the breakpoint information is extracted:
需要对得到的输出文件中的测序数据再次进行去重复序列操作, 因为步骤 2 与步骤 5 的去重复操作都没有考虑到反相补序列也可能是重复序列的情况。 所以需要对得到的结果 中的序列进行左后一次筛查, 去掉冗余的序列, 得出最优的结果。  It is necessary to perform the de-repetitive sequence operation on the sequenced data in the obtained output file, because the de-repeat operation of steps 2 and 5 does not take into account the case where the reverse complement sequence or the repeat sequence is also considered. Therefore, it is necessary to perform the left and next screening of the sequence in the obtained result, and remove the redundant sequence to obtain an optimal result.
本发明的一些实施例中, 对于检查是否存在置换类别的步骤: 具体的方法是, 当人类 基因组上的两个断点信息表现为只有左支持测序数据或只有右支持测序数据时, 这时两个 断点位置之间的序列就是发生置换的人基因组序列。 分别找到左支持测序数据与右支持测 序数据在病原体基因组上的比对位置, 找到的两个比对位置之间的序列就是发生置换的病 原体基因组序列。 本发明中, 将按照人类基因组上符合条件的断点的随机组合, 输出所有 置换情况。  In some embodiments of the present invention, the step of checking whether there is a substitution category: the specific method is when two breakpoint information on the human genome appears to have only left support sequencing data or only right support sequencing data, then two The sequence between the breakpoint positions is the human genome sequence in which the substitution occurs. The alignment positions of the left support sequencing data and the right support sequence data on the pathogen genome were found separately, and the sequence between the two aligned positions found was the pathogen genome sequence in which the substitution occurred. In the present invention, all displacements will be output in accordance with a random combination of eligible breakpoints on the human genome.
本发明的一些实施例中, 对于计算病原体基因组插入片段的长度和型别的步骤: 具体 的方法是, 找到一个病原体基因组插入片段的左端支持序列和右端支持序列, 然后分别找 到这两端支持序列在病原体基因组上的比对位置, 两个位置之间的序列就是插入片段。  In some embodiments of the invention, the step of calculating the length and type of the pathogen genomic insert: the specific method is to find the left-end support sequence and the right-end support sequence of a pathogen genomic insert, and then find the support sequences at the two ends, respectively. In the alignment position on the pathogen genome, the sequence between the two positions is the insert.
对于本发明的计算捕获实际效率的步骤: 具体的做法是计算出参与 BWA重比对中, 既 能比上人类基因组 hgl9, 又能比上病原体基因组的序列数目, 记为 A。 原始测序数据中有 用的 PE测序数据对数记为 B。 则有效的捕获效率的计算公式为 A/B。  The step of calculating the actual efficiency for the calculation of the present invention: The specific method is to calculate the number of sequences participating in the BWA heavy alignment, which is comparable to the upper human genome hgl9 and the pathogen genome, and is denoted as A. The logarithm of the PE sequencing data useful in the original sequencing data is recorded as B. The effective capture efficiency is calculated as A/B.
本发明中, 最后可以找到的断点信息包括病原体基因组片段在人类基因组上的断点信 息和病原体基因组片段在病原体基因组上的断点信息。  In the present invention, the last breakpoint information that can be found includes breakpoint information of the pathogen genome fragment on the human genome and breakpoint information of the pathogen genome fragment on the pathogen genome.
本发明中, 所涉及到的生物信息分析流程理论上可以适用于所有病原体遗传物质 DNA 或 RNA插入人类基因组情况下的信号检测。 确定外源基因在人类因组中整合方式的系统  In the present invention, the biological information analysis process involved can be theoretically applied to signal detection in the case where all pathogen genetic material DNA or RNA is inserted into the human genome. A system for determining the way in which foreign genes are integrated in human factors
根据本发明的又一方面, 本发明提出了一种确定外源基因在人类因组中整合方式的系 统。 参考图 3, 该系统包括捕获装置 100、 测序装置 200、 第一除杂装置 300、 第一比对装 置 400、 组装装置 500、 第二除杂装置 600、 第二比对装置 700以及分析装置 800。 其中, 参考图 3, 这些装置在工艺流程上依次连接。 According to still another aspect of the present invention, the present invention provides a system for determining the manner in which a foreign gene is integrated in a human due group. Referring to FIG. 3, the system includes a capture device 100, a sequencing device 200, a first impurity removal device 300, and a first alignment device. 400, assembly device 500, second impurity removal device 600, second comparison device 700, and analysis device 800. Among them, referring to FIG. 3, these devices are sequentially connected in the process flow.
根据本发明的实施例, 捕获装置 100 利用捕获探针从人类基因组核酸样本中捕获可能 含有外源基因片段整合的 DNA片段。 根据本发明的实施例, 可以利用本发明方法进行分析 的外源基因的类型髌骨受特别限制。 只要其可以和人类基因组整合, 并且可以获得或者已 经知道其基因序列即可。 根据本发明的实施例, 可以研究的外源基因为病原体基因组。 另 夕卜, 根据本发明的具体实例, 所述病原体为 HBV。 由此, 可以有效地分析病原体例如 HBV 与人类基因组的整合。  According to an embodiment of the present invention, the capture device 100 utilizes a capture probe to capture a DNA fragment that may contain an integration of a foreign gene fragment from a human genomic nucleic acid sample. According to an embodiment of the present invention, the type of exogenous gene that can be analyzed using the method of the present invention is particularly limited. As long as it can be integrated with the human genome, and its gene sequence can be obtained or already known. According to an embodiment of the invention, the foreign gene that can be studied is the pathogen genome. Further, according to a specific example of the present invention, the pathogen is HBV. Thereby, the integration of pathogens such as HBV with the human genome can be effectively analyzed.
根据本发明的实施例, 测序装置 200针对所捕获的 DNA片段进行测序, 以便获得由多 个测序数据构成的测序结果。 根据本发明的实施例, 对经过捕获的 DNA片段进行测序的方 式并不受特别限制。 根据本发明的实施例, 测序是通过第二代测序平台进行的。 根据本发 明的实施例, 可以釆用选自 Hiseq2000、 SOLiD、 454和单分子测序装置的至少一种对所述 全基因组测序文库进行测序。 由此, 能够利用这些测序装置的高通量、 深度测序的特点, 进一步提高了确定单细胞染色体非整倍性的效率。 当然, 本领域技术人员能够理解的是, 还可以釆用其他的测序方法和装置进行全基因组测序, 例如第三代测序技术, 以及以后可 能开发出来的更先进的测序技术。 根据本发明的实施例, 通过全基因组测序所得到的测序 数据的长度不受特别限制。 根据本发明的实施例, 优选测序长度为 lOObp, 由此, 可以进一 步提高分析效果。  According to an embodiment of the present invention, the sequencing device 200 performs sequencing on the captured DNA fragments to obtain sequencing results composed of a plurality of sequencing data. According to an embodiment of the present invention, the manner of sequencing the captured DNA fragments is not particularly limited. According to an embodiment of the invention, sequencing is performed by a second generation sequencing platform. According to an embodiment of the present invention, the whole genome sequencing library can be sequenced using at least one selected from the group consisting of Hiseq2000, SOLiD, 454, and a single molecule sequencing device. Thereby, the efficiency of determining the aneuploidy of single cell chromosomes can be further improved by utilizing the characteristics of high-throughput and deep sequencing of these sequencing devices. Of course, those skilled in the art will appreciate that other sequencing methods and devices can be used for whole genome sequencing, such as third generation sequencing techniques, as well as more advanced sequencing techniques that may be developed in the future. According to an embodiment of the present invention, the length of the sequencing data obtained by whole genome sequencing is not particularly limited. According to an embodiment of the present invention, it is preferable that the sequencing length is lOObp, whereby the analysis effect can be further improved.
根据本发明的实施例, 第一除杂装置 300对所述测序结果进行第一除杂, 以便获得经 过第一除杂的测序结果。 根据本发明的实施例, 进行第一除杂的类型, 并不受特别限制, 例如第一除杂可以进一步包括去除 PCR重复、 去除低质量测序数据以及去除含接头的测序 数据的至少一种。 由此, 可以进一步提高分析效率。  According to an embodiment of the present invention, the first impurity removing device 300 performs the first impurity removal on the sequencing result to obtain the sequencing result of the first impurity removal. According to an embodiment of the present invention, the type of the first impurity is performed, and is not particularly limited. For example, the first impurity may further include at least one of removing PCR repeats, removing low-quality sequencing data, and removing linker-containing sequencing data. Thereby, the analysis efficiency can be further improved.
根据本发明的实施例, 第一比对装置 400将所述经过第一除杂的测序结果与已知的人 类基因组序列和外源基因序列进行第一比对, 以便获得可能含有外源基因整合片段的测序 数据。 根据本发明的实施例, 可以釆用 SOAP进行该第一比对。 由此, 可以进一步提高分 析效率。  According to an embodiment of the present invention, the first comparison device 400 firstly compares the first impurity-sequencing result with a known human genome sequence and a foreign gene sequence to obtain a possible integration of the foreign gene. Sequencing data for the fragment. According to an embodiment of the invention, the first alignment can be performed using SOAP. Thereby, the analysis efficiency can be further improved.
根据本发明的实施例, 组装装置 500将所得到的可能含有外源基因整合片段的测序数 据进行组装, 以便得到组装结果, 所述组装结果由多个组装数据构成。 根据本发明的实施 例, 所述组装是通过基于测序数据之间的重叠关系进行的。  According to an embodiment of the present invention, the assembly device 500 assembles the obtained sequencing data which may contain the foreign gene integration fragment to obtain an assembly result, which is composed of a plurality of assembly data. According to an embodiment of the invention, the assembly is performed by based on an overlapping relationship between the sequencing data.
根据本发明的实施例, 第二除杂装置 600对所述组装结果进行第二除杂, 以便获得经 过第二除杂的组装结果。 根据本发明的实施例, 所述第二除杂进一步包括去除重复的组装 数据。  According to an embodiment of the present invention, the second impurity removing device 600 performs second impurity removal on the assembly result to obtain an assembly result of the second impurity removal. According to an embodiment of the invention, the second impurity removal further comprises removing duplicate assembly data.
根据本发明的实施例, 第二比对装置 700将所述经过第二除杂的组装结果与已知的人 类基因组序列和外源基因序列进行第二比对。 根据本发明的实施例, 所述第二比对是利用 BWA进行的。  According to an embodiment of the invention, the second alignment device 700 performs a second alignment of the second impurity-free assembly result with a known human genomic sequence and a foreign gene sequence. According to an embodiment of the invention, the second alignment is performed using BWA.
分析装置 800基于所述第二比对结果, 确定所述外源基因在人类基因组中的整合方式。 根据本发明的实施例, 基于所述第二比对结果, 确定所述外源基因在人类基因组中的整合 方式进一步包括选择同时能够比对上已知的人类基因组序列和外源基因序列的组装数据, 该组装数据中包含人类基因组断点信息和外源基因断点信息。 根据本发明的实施例, 还可 以进一步基于所述人类基因组断点信息和外源基因断点信息, 判断是否存在置换变异; 或 者基于所述人类基因组断点信息和外源基因断点信息, 确定人类基因组中外源基因插入长 度和类型的至少一种, 例如确定人类基因组中至少一部分外源基因的插入长度和类型。 The analysis device 800 determines the manner in which the foreign gene is integrated in the human genome based on the second alignment result. According to an embodiment of the present invention, determining the manner in which the foreign gene is integrated in the human genome based on the second alignment result further comprises selecting and simultaneously aligning the known human genome sequence and the assembly of the foreign gene sequence Data, the assembly data includes human genome breakpoint information and foreign gene breakpoint information. According to an embodiment of the present invention, it may further be determined whether there is a replacement variation based on the human genome breakpoint information and the foreign gene breakpoint information; or determining, based on the human genome breakpoint information and the foreign gene breakpoint information, At least one of the length and type of foreign gene insertion in the human genome, for example, determining the length and type of insertion of at least a portion of the foreign gene in the human genome.
需要说明的是, 本领域技术人员能够理解, 在前面所描述的确定外源基因在人类基因 组中整合方式的方法的特征和优点也适合于确定外源基因在人类基因组中整合方式的系 统, 为描述方便, 不再详述。 计算机可读介盾  It should be noted that those skilled in the art will appreciate that the features and advantages of the methods described above for determining the manner in which foreign genes are integrated in the human genome are also suitable for systems for determining the manner in which foreign genes are integrated in the human genome. The description is convenient and will not be described in detail. Computer readable shield
在本发明的又一方面, 本发明提出了一种计算机可读介质。 根据本发明的实施例, 在 该计算机可读介质上存储有指令, 所述指令适于被处理器执行以便通过下列步骤确定外源 基因在人类基因组中整合方式: 对测序结果进行第一除杂, 以便获得经过第一除杂的测序 结果; 将所述经过第一除杂的测序结果与已知的人类基因组序列和外源基因序列进行第一 比对, 以便获得可能含有外源基因整合片段的测序数据; 将所得到的可能含有外源基因整 合片段的测序数据进行组装, 以便得到组装结果, 所述组装结果由多个组装数据构成; 对 所述组装结果进行第二除杂, 以便获得经过第二除杂的组装结果; 以及将所述经过第二除 杂的组装结果与已知的人类基因组序列和外源基因序列进行第二比对, 并且基于所述第二 比对结果, 确定所述外源基因在人类基因组中的整合方式, 其中, 测序结果是通过下列获 得的:利用捕获探针从人类基因组核酸样本中捕获可能含有外源基因片段整合的 DNA片段; 针对所捕获的 DNA片段进行测序, 以便获得由多个测序数据构成的测序结果。 利用该计算 机可读介质可以有效地确定外源基因例如病原体基因组在人类基因组中的整合方式。  In yet another aspect of the invention, the invention provides a computer readable medium. According to an embodiment of the invention, instructions are stored on the computer readable medium, the instructions being adapted to be executed by the processor to determine how the foreign gene is integrated in the human genome by the following steps: performing a first impurity removal on the sequencing result In order to obtain the sequencing result of the first impurity removal; the first alignment result of the first impurity is first aligned with the known human genome sequence and the foreign gene sequence, so as to obtain an integration fragment which may contain the foreign gene Sequencing data; assembling the obtained sequencing data, which may contain the integrated fragment of the foreign gene, to obtain an assembly result, the assembly result being composed of a plurality of assembly data; performing second impurity removal on the assembly result to obtain Performing a second impurity-free assembly result; and performing a second alignment of the second impurity-free assembly result with a known human genome sequence and a foreign gene sequence, and determining based on the second alignment result The manner in which the foreign gene is integrated in the human genome, wherein the sequencing result is obtained by the following : capturing a DNA fragment that may contain an integration of a foreign gene fragment from a human genomic nucleic acid sample using a capture probe; sequencing the captured DNA fragment to obtain a sequencing result composed of a plurality of sequencing data. The computer readable medium can be used to efficiently determine the manner in which foreign genes, such as pathogen genomes, are integrated in the human genome.
就本说明书而言, "计算机可读介质" 可以是任何可以包含、 存储、 通信、 传播或传输 程序以供指令执行系统、 装置或设备或结合这些指令执行系统、 装置或设备而使用的装置。 计算机可读介质的更具体的示例 (非穷尽性列表) 包括以下: 具有一个或多个布线的电连 接部 (电子装置), 便携式计算机盘盒(磁装置), 随机存取存储器 (RAM ), 只读存储器 ( ROM ), 可擦除可编辑只读存储器(EPROM或闪速存储器), 光纤装置, 以及便携式光 盘只读存储器 (CDROM )。 另外, 计算机可读介质甚至可以是可在其上打印所述程序的纸 或其他合适的介质, 因为可以例如通过对纸或其他介质进行光学扫描, 接着进行编辑、 解 译或必要时以其他合适方式进行处理来以电子方式获得所述程序, 然后将其存储在计算机 存储器中。  For the purposes of this specification, a "computer-readable medium" can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by the instruction execution system, apparatus, or device, or in conjunction with the instruction execution system, apparatus, or device. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read only memory (CDROM). Furthermore, the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method proceeds to obtain the program electronically and then store it in computer memory.
应当理解, 本发明的各部分可以用硬件、 软件、 固件或它们的组合来实现。 在上述实 施方式中, 多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或 固件来实现。 例如, 如果用硬件来实现, 和在另一实施方式中一样, 可用本领域公知的下 列技术中的任一项或他们的组合来实现: 具有用于对数据信号实现逻辑功能的逻辑门电路 的离散逻辑电路, 具有合适的组合逻辑门电路的专用集成电路, 可编程门阵列 (PGA ), 现 场可编程门阵列 (FPGA )等。 It should be understood that portions of the invention may be implemented in hardware, software, firmware or a combination thereof. In the above-described embodiments, multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any of the following techniques or combinations thereof known in the art: having logic gates for implementing logic functions on data signals Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), and the like.
本技术领域的普通技术人员可以理解, 实现上述实施例方法携带的全部或部分步骤是 可以通过程序来指令相关的硬件完成, 所述的程序可以存储于一种计算机可读存储介质中, 该程序在执行时, 包括方法实施例的步骤之一或其组合。  A person skilled in the art can understand that all or part of the steps carried by the method of the foregoing embodiment can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium. In execution, one or a combination of the steps of the method embodiments is included.
此外, 在本发明各个实施例中的各功能单元可以集成在一个处理模块中, 也可以是各 个单元单独物理存在, 也可以两个或两个以上单元集成在一个模块中。 上述集成的模块既 可以釆用硬件的形式实现, 也可以釆用软件功能模块的形式实现。 所述集成的模块如果以 软件功能模块的形式实现并作为独立的产品销售或使用时, 也可以存储在一个计算机可读 取存储介质中。  In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as separate products, may also be stored in a computer readable storage medium.
上述提到的存储介质可以是只读存储器, 磁盘或光盘等。  The above-mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
需要说明的是, 本领域技术人员能够理解, 在前面所描述的确定外源基因在人类基因 组中整合方式的方法的特征和优点也适合于该计算机可读介质, 为描述方便, 不再详述。 下面将结合实施例对本发明的方案进行解释。 本领域技术人员将会理解, 下面的实施 例仅用于说明本发明, 而不应视为限定本发明的范围。 实施例中未注明具体技术或条件的, 按照本领域内的文献所描述的技术或条件(例如参考 J.萨姆布鲁克等著, 黄培堂等译的《分 子克隆实验指南》, 第三版, 科学出版社)或者按照产品说明书进行。 所用试剂或仪器未注 明生产厂商者, 均为可以通过市购获得的常规产品, 例如可以釆购自 Illumina公司。 实施例 1  It should be noted that those skilled in the art can understand that the features and advantages of the method for determining the manner in which the foreign gene is integrated in the human genome described above are also suitable for the computer readable medium, which is convenient for description and will not be described in detail. . The solution of the present invention will be explained below in conjunction with the embodiments. Those skilled in the art will appreciate that the following examples are merely illustrative of the invention and are not to be considered as limiting the scope of the invention. In the examples, the specific techniques or conditions are not indicated, according to the techniques or conditions described in the literature in the field (for example, refer to J. Sambrook et al., Huang Peitang et al., Molecular Cloning Experimental Guide, Third Edition, Science Press) or in accordance with the product manual. The reagents or instruments used are not specified by the manufacturer, and are conventional products that are commercially available, for example, from Illumina. Example 1
样本文库制备  Sample library preparation
1. 样本来源  Sample source
样品的来源为一患者的肝癌组织, 且此患者肝癌组织有全基因组测序信息以及通过全 基因组数据找到的断点信息。  The source of the sample is a patient's liver cancer tissue, and the patient's liver cancer tissue has genome-wide sequencing information and breakpoint information found by genome-wide data.
2.前期实验  2. Pre-experiment
前期实险部分包括以下步骤:  The pre-risk section includes the following steps:
( 1 )提取 DNA。  (1) Extract DNA.
( 2 )制备样本文库  (2) Preparing a sample library
按照 Illumina公司的标准文库制备流程说明书 ( Paired-End Sample Preparation Guide ) 构建文库, 釆用 Covaris s2打断基因组 DNA, 末端补平修复, 末端加 A, 加入接头, 对加 入接头的片段进行 PCR, 得到样本文库。  The library was constructed according to Illumina's Paired-End Sample Preparation Guide. The genomic DNA was disrupted with Covaris s2, the ends were filled with repair, the end was added with A, the linker was added, and the fragment added to the adaptor was PCR. Sample library.
( 3 )制备 HBV捕获探针  (3) Preparation of HBV capture probe
针对 HBV的 b型和 c型设计探针。每个探针的长度时 60bp,设计原则是相邻两个探针 之间的重叠部分长度为 55bp。 具体地, 制备 HB V探针的步骤如下: 设计引物, PCR反应, PCR纯化及电泳检测, PCR产物片段化, 片段化产物电泳检测, 探针保存。 ( 4 ) HBV捕获探针与样本文库进行杂交、 测序 Design probes for b and c types of HBV. The length of each probe is 60 bp, and the design principle is that the overlap between adjacent probes is 55 bp in length. Specifically, the steps for preparing the HB V probe are as follows: Design primers, PCR reaction, PCR purification and electrophoresis detection, PCR product fragmentation, fragmentation product electrophoresis detection, and probe storage. (4) Hybridization and sequencing of HBV capture probes with sample libraries
利用 Nimblegen 的杂交平台,将 HB V捕获探针与样本文库进行杂交,杂交后洗脱, PCR 扩增, PCR产物上机测序。 然后, 利用 Hiseq 2000测序平台进行测序, 其中上机测序按照 IUumina/Solexa官方公布的 c-Bot和 Hiseq 2000 ( PE sequencing )说明书进行操作。 主要测 序文库长度 170bp, 测序序列长度 lOObp, 测序产量 lG bp。  The HB V capture probe was hybridized with the sample library using Nimblegen's hybridization platform, hybridized, eluted, PCR amplified, and the PCR product was sequenced. Then, sequencing was performed using the Hiseq 2000 sequencing platform, in which the sequencing was performed in accordance with the c-Bot and Hiseq 2000 (PE sequencing) specifications officially published by IUumina/Solexa. The main sequence library was 170 bp in length, the sequencing sequence was lOObp in length, and the sequencing yield was lG bp.
3. 生物信息学分析  3. Bioinformatics analysis
( 1 )去除 PCR重复, 去除低质量测序数据以及去除含接头的测序数据  (1) Remove PCR repeats, remove low-quality sequencing data, and remove sequencing-containing sequencing data
拿到下机数据后, 对这些数据去除 PCR重复, 去除低质量测序数据以及去除含接头的 测序数据。  After getting the data, the PCR is repeated for these data, the low quality sequencing data is removed, and the sequencing data containing the linker is removed.
去除 PCR重复的策略: 当两条序列完全一样时, 则认定为重复序列。 当一对 PE测序 数据中有一条测序数据出现重复时, 则去掉这一对测序数据。  Strategy for removing PCR repeats: When the two sequences are identical, they are considered to be repeats. When one of the paired PE sequencing data is duplicated, the pair of sequencing data is removed.
去除低质量测序数据的策略: 当一条测序数据中测序质量值小于或等于 5 的碱基数目 占这条测序数据总碱基数目的 50%以上时,则认为这条测序数据为低质量测序数据. 当一对 Strategy for removing low-quality sequencing data: When the number of bases with a sequencing quality value less than or equal to 5 in a sequencing data accounts for more than 50% of the total number of bases of the sequencing data, the sequencing data is considered to be low-quality sequencing data. When a pair
PE测序数据中有一条测序数据是低质量时, 则去掉这一对测序数据。 When one of the PE sequencing data is low quality, the pair of sequencing data is removed.
去除含接头的测序数据的策略: 当一条测序数据中含有一段接头序列时, 则认为这条 测序数据 是含接头的测序数据。 当一对 PE测序数据中有一条测序数据是含接头的测序数 据时, 去掉这一对测序数据。  Strategy for removing sequencing data containing linkers: When a sequence of linker data contains a linker sequence, the sequenced data is considered to be sequenced data containing the linker. When one of the paired PE sequencing data is sequencing data containing the linker, the pair of sequencing data is removed.
4. Soap比对, 选取需要的测序数据, 计算比对率  4. Soap comparison, select the required sequencing data, calculate the comparison rate
将经过处理的测序数据分别比对到人类基因组 hgl9以及 HBV基因组上。 因为 HBV病 毒有多个亚型, 所以这里的 HBV 基因组包括 23 个亚型 (AB014381.1, AB032431.1, AB033554.1, AB036910.1, AB064310.1, AF090842.1, AF100309.1, AF160501.1, AF223965.1, AF405706.1, AY090454.1, AY090457.1, AY090460.1, AY123041.1, D00329.1, M32138.1, X02763.1, X04615.1, X51970.1, X65259.1, X69798.1, X75657.1, X85254.1 )的基因组序列。 比 对完成后, 通过分析两次比对结果之间的成对关系, 选取可能含有病毒整合片段的测序数 据。并分别计算原始测序数据中有用序列的比例,以及有用序列中人类基因组比对率和 HBV 基因组的比对率。 Soap比对的参数: -m l38 -x l98 -p 8 -140 -v 5 -r l。 下表 2是数据质量报 告, 表 3是通过初步 soap比对结果对测序数据进行的分类结果, 表 4是比对率统计。  The processed sequencing data were aligned to the human genome hgl9 and the HBV genome, respectively. Because the HBV virus has multiple subtypes, the HBV genome here includes 23 subtypes (AB014381.1, AB032431.1, AB033554.1, AB036910.1, AB064310.1, AF090842.1, AF100309.1, AF160501. 1, AF223965.1, AF405706.1, AY090454.1, AY090457.1, AY090460.1, AY123041.1, D00329.1, M32138.1, X02763.1, X04615.1, X51970.1, X65259.1, Genomic sequence of X69798.1, X75657.1, X85254.1). After the comparison is completed, by analyzing the pairwise relationship between the two alignment results, the sequencing data that may contain the viral integration fragment is selected. The proportions of useful sequences in the original sequencing data were calculated separately, as well as the human genome alignment rate and the HBV genome alignment rate in the useful sequences. The parameters of the Soap comparison: -m l38 -x l98 -p 8 -140 -v 5 -r l. Table 2 below is the data quality report, Table 3 is the classification result of the sequencing data by the preliminary soap comparison result, and Table 4 is the comparison rate statistics.
表 2. 数据质量报告  Table 2. Data Quality Report
Figure imgf000015_0001
GC含量 48.78;49.08
Figure imgf000015_0001
GC content 48.78; 49.08
含接头的测序数据比例 0.07%  Proportion of sequencing data with linker 0.07%
低质量测序数据比例 6.32%  Low quality sequencing data ratio 6.32%
PCR重复率 5.14%  PCR repetition rate 5.14%
有效数据比例 88.47% 表 3. 测序数据九宫格分类结果  Effective data ratio 88.47% Table 3. Sequencing data Jiugong grid classification results
Figure imgf000016_0001
表 4. 比对率统计
Figure imgf000016_0002
Figure imgf000016_0001
Table 4. Comparison rate statistics
Figure imgf000016_0002
5. PER组装 5. PER assembly
将第三步得到的可能含有病毒整合片段的测序数据进行 PER组装。 PER是指双向(pair end )测序数据组装。 即根据序列之间的重叠关系, 将 pair end测序得到的每对 PE测序数据 进行组装。 组装成功率为 94.33%  The sequencing data obtained in the third step, which may contain the viral integration fragment, was subjected to PER assembly. PER refers to the assembly of pair end sequencing data. That is, each pair of PE sequencing data obtained by pair end sequencing is assembled according to the overlapping relationship between the sequences. Assembly success rate is 94.33%
6.再次去除重复的序列  6. Remove duplicate sequences again
通过 PER组装后,得到一个组装后序列的集合。再次对这个序列集合进行去重复操作。 这里的策略是釆用 SE测序数据的去重复策略, 即: 当一条测序数据出现重复的情况时, 则 去掉这条测序数据。 结果去掉了 1.823%的重复序列, 剩余 850696条组装后的可用序列。  After assembly by PER, a collection of assembled sequences is obtained. This sequence set is deduplicated again. The strategy here is to use the deduplication strategy of SE sequencing data, ie when a sequence of sequencing data is duplicated, then the sequencing data is removed. As a result, 1.823% of the repeats were removed, leaving 850,696 available sequences after assembly.
7. BWA重比对, 提取断点信息  7. BWA heavy comparison, extract breakpoint information
通过第五步的去重复步骤, 得到一个序列集合。 然后使用 BWA软件将这个序列集合分 别再一次比对重新到人类基因组 hgl9和 HBV病毒基因组上。 通过分析两次比对的结果文 件, 选出同时都能比上人类基因组 hgl9和 HBV病毒基因组的序列。 这些序列是含有断点 信息的。 分别分析这些序列与人类基因组 hgl9和 HBV病毒基因组的比对情况, 得 HBV病 毒在人类基因组上的整合情况, 以及在 HBV病毒基因组上的分布情况。 最后找到 33个断 点, 其中过阈值的断点数目为 8个。 对得到的结果在进行一次去重复操作, 得出最终结果。 如下表 5,显示的是本发明找到的 HBV病毒插入到人类基因组 hgl9的最显著的病毒插入位 置。 表 5. HBV病毒插入断点信息 By repeating the steps in the fifth step, a sequence set is obtained. This sequence set was then again aligned to the human genome hgl9 and HBV viral genomes using BWA software. By analyzing the results of the two alignments, sequences were selected that simultaneously corresponded to the human genome hgl9 and HBV viral genomes. These sequences contain breakpoint information. The alignment of these sequences with the human genome hgl9 and HBV viral genomes was separately analyzed, and the integration of the HBV virus on the human genome and the distribution on the HBV viral genome were obtained. Finally, 33 breakpoints were found, of which the number of breakpoints exceeding the threshold was 8. A de-duplication operation is performed on the obtained result to obtain a final result. As shown in Table 5 below, the most prominent viral insertion site for insertion of the HBV virus found in the present invention into the human genome hgl9 is shown. Table 5. HBV virus insertion breakpoint information
Figure imgf000017_0001
Figure imgf000017_0001
8. 检查是否存在置换类别 8. Check if there is a replacement category
通过分析人类基因组断点和 HBV病毒基因组断点之间的内在联系,检查是否存在置换 类型变异。 具体的方法是, 当人类基因组上的两个断点信息表现为相距 500bp 以内, 并且 两个断点都只有左端支持测序数据或只有右端支持测序数据时, 这时两个断点之间很可能 发生了置换的情况。 找到两个断点相应的 HBV病毒基因组比对信息, 确定发生置换的型别 和位置。 下表 6显示了本发明找到的 7个置换情况。  The presence of a substitution type variation is examined by analyzing the intrinsic link between the human genome breakpoint and the HBV viral genome breakpoint. The specific method is that when the two breakpoint information on the human genome appears to be within 500 bp, and both breakpoints have only the left end supporting the sequencing data or only the right end supporting the sequencing data, then the two breakpoints are likely to be A situation has occurred in the replacement. Find the HBV viral genome alignment information for the two breakpoints and determine the type and location of the replacement. Table 6 below shows the seven replacement cases found by the present invention.
表 6. 置换类型  Table 6. Types of substitution
Figure imgf000017_0002
Figure imgf000017_0002
9. 计算 HBV插入片段的长度和型别  9. Calculate the length and type of HBV inserts
通过分析人类基因组断点和 HBV 病毒基因组断点之间的内在联系, 找到能够计算出 HBV病毒插入片段的断点信息, 计算出这一部分的断点处的 HBV病毒插入片段长度和型 另1 J。 具体的方法是, 找到一个病毒插入片段的左端支持序列和右端支持序列, 然后分别找 到这两端序列在 HBV病毒基因组上的比对位置, 两个位置之间的序列就是插入片段。 这里 只找到一个插入断点达到可以找出插入片段型别的要求。 下表 7是本发明找到查到的插入 片段的型别。 表 7. HBV病毒插入片段型别
Figure imgf000018_0001
By analyzing the internal links between the human genome breakpoints and HBV viral genome breakpoints can be calculated to find the HBV viral insert the breakpoint information, calculates the breakpoint in this portion of HBV insert length and another type 1 J . Specifically, a left-end support sequence and a right-end support sequence of a viral insert are found, and then the alignment positions of the two-end sequences on the HBV viral genome are found, and the sequence between the two positions is an insert. Only one insertion breakpoint is found here to find out the type of insert. Table 7 below is the type of insert found in the present invention. Table 7. HBV virus insert type
Figure imgf000018_0001
10. 计算捕获有效效率 10. Calculate the effective efficiency of capture
综合重比对的结果信息和初始的序列信息计算探针捕获的实际效率。 具体的方法是, 计算出参与 BWA重比对中, 既能比上人类基因组 hgl9, 又能比上 HBV病毒基因组的序列 数目, 记为 A。 原始测序数据中有用的 PE 测序数据对数记为 B。 则, 有效的捕获效率的计 算公式为 A/B。 这里算得的有效的捕获效率是 0.0001059。 工业实用性  The result information of the integrated weight comparison and the initial sequence information are used to calculate the actual efficiency of the probe capture. The specific method is to calculate the number of sequences in the BWA heavy comparison that can compare with the human genome hgl9 and the HBV viral genome, and record it as A. The logarithm of the PE sequencing data useful in the raw sequencing data is recorded as B. Then, the effective capture efficiency is calculated as A/B. The effective capture efficiency calculated here is 0.0001059. Industrial applicability
本发明的确定外源基因在人类基因组中整合方式的方法、 系统和计算机可读介质, 能 够有效地用于确定外源基因例如病原体基因组在人类基因组中的整合方式。 在本说明书的描述中, 参考术语 "一个实施例"、 "一些实施例"、 "示例"、 "具体示 例"、 或 "一些示例" 等的描述意指结合该实施例或示例描述的具体特征、 结构、 材料或者 特点包含于本发明的至少一个实施例或示例中。 在本说明书中, 对上述术语的示意性表述 不一定指的是相同的实施例或示例。 而且, 描述的具体特征、 结构、 材料或者特点可以在 任何的一个或多个实施例或示例中以合适的方式结合。  The method, system and computer readable medium of the present invention for determining the manner in which a foreign gene is integrated in the human genome can be effectively used to determine the manner in which a foreign gene, such as a pathogen genome, is integrated in the human genome. In the description of the present specification, the description of the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" and the like means a specific feature described in connection with the embodiment or example. A structure, material or feature is included in at least one embodiment or example of the invention. In the present specification, the schematic representation of the above terms does not necessarily mean the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples.
尽管已经示出和描述了本发明的实施例, 本领域的普通技术人员可以理解: 在不脱离 本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、 修改、 替换和变型, 本发 明的范围由权利要求及其等同物限定。  While the embodiments of the present invention have been shown and described, the embodiments of the invention may The scope of the invention is defined by the claims and their equivalents.

Claims

权利要求书 claims
1、 一种确定外源基因在人类基因组中整合方式的方法, 其特征在于, 包括: 利用捕获探针从人类基因组核酸样本中捕获可能含有外源基因片段整合的 DNA片段; 针对所捕获的 DNA片段进行测序, 以便获得由多个测序数据构成的测序结果; 对所述测序结果进行第一除杂, 以便获得经过第一除杂的测序结果; 1. A method for determining how foreign genes are integrated in the human genome, which is characterized by: using capture probes to capture DNA fragments that may contain integrated foreign gene fragments from human genome nucleic acid samples; targeting the captured DNA Sequencing the fragments to obtain a sequencing result composed of multiple sequencing data; performing a first impurity removal on the sequencing result to obtain a sequencing result that has undergone the first impurity removal;
将所述经过第一除杂的测序结果与已知的人类基因组序列和外源基因序列进行第一比 对, 以便获得可能含有外源基因整合片段的测序数据; Perform a first comparison of the first-removed sequencing results with known human genome sequences and exogenous gene sequences to obtain sequencing data that may contain exogenous gene integration fragments;
将所得到的可能含有外源基因整合片段的测序数据进行组装, 以便得到组装结果, 所 述组装结果由多个组装数据构成; Assemble the obtained sequencing data that may contain exogenous gene integration fragments to obtain an assembly result, which is composed of multiple assembly data;
对所述组装结果进行第二除杂, 以便获得经过第二除杂的组装结果; 以及 Performing a second impurity removal on the assembly result to obtain an assembly result that has undergone the second impurity removal; and
将所述经过第二除杂的组装结果与已知的人类基因组序列和外源基因序列进行第二比 对, 并且基于所述第二比对结果, 确定所述外源基因在人类基因组中的整合方式。 Perform a second alignment of the assembly result that has undergone the second cleanup with known human genome sequences and exogenous gene sequences, and based on the second alignment result, determine the position of the exogenous gene in the human genome Integrated approach.
2、 根据权利要求 1所述的方法, 其特征在于, 所述外源基因为病原体基因组。 2. The method according to claim 1, characterized in that the exogenous gene is a pathogen genome.
3、 根据权利要求 2所述的方法, 其特征在于, 所述病原体为 HBV。 3. The method according to claim 2, characterized in that the pathogen is HBV.
4、根据权利要求 1所述的方法, 其特征在于, 所述测序是通过第二代测序平台进行的。 4. The method of claim 1, wherein the sequencing is performed using a second-generation sequencing platform.
5、根据权利要求 1所述的方法,其特征在于,所述第一除杂进一步包括去除 PCR重复、 去除低质量测序数据以及去除含接头的测序数据的至少一种。 5. The method of claim 1, wherein the first impurity removal further includes at least one of removing PCR repeats, removing low-quality sequencing data, and removing sequencing data containing adapters.
6、 根据权利要求 1所述的方法, 其特征在于, 所述第一比对是利用 SOAP进行的。 6. The method of claim 1, wherein the first comparison is performed using SOAP.
7、 根据权利要求 1所述的方法, 其特征在于, 所述组装是通过基于测序数据之间的重 叠关系进行的。 7. The method of claim 1, wherein the assembly is performed based on overlapping relationships between sequencing data.
8、 根据权利要求 1所述的方法, 其特征在于, 所述第二除杂进一步包括去除重复的组 装数据。 8. The method of claim 1, wherein the second impurity removal further includes removing duplicate assembly data.
9、 根据权利要求 1所述的方法, 其特征在于, 所述第二比对是利用 BWA进行的。 9. The method of claim 1, wherein the second comparison is performed using BWA.
10、 根据权利要求 9 所述的方法, 其特征在于, 基于所述第二比对结果, 确定所述外 源基因在人类基因组中的整合方式进一步包括: 10. The method according to claim 9, characterized in that, based on the second comparison result, determining the integration mode of the foreign gene in the human genome further includes:
选择同时能够比对上已知的人类基因组序列和外源基因序列的组装数据, 该组装数据 中包含人类基因组断点信息和外源基因断点信息。 Select assembly data that can simultaneously compare known human genome sequences and foreign gene sequences. The assembly data contains human genome breakpoint information and foreign gene breakpoint information.
11、 根据权利要求 10所述的方法, 其特征在于, 进一步包括: 11. The method according to claim 10, further comprising:
基于所述人类基因组断点信息和外源基因断点信息, 判断是否存在置换变异; 或者 基于所述人类基因组断点信息和外源基因断点信息, 确定人类基因组中外源基因插入 长度和类型的至少一种。 Based on the human genome breakpoint information and exogenous gene breakpoint information, determine whether there is a substitution mutation; or based on the human genome breakpoint information and exogenous gene breakpoint information, determine the length and type of exogenous gene insertion in the human genome At least one.
12、 一种确定外源基因在人类基因组中整合方式的系统, 其特征在于, 包括: 捕获装置, 所述捕获装置适于利用捕获探针从人类基因组核酸样本中捕获可能含有外 源基因片段整合的 DNA片段; 12. A system for determining the integration mode of foreign genes in the human genome, characterized by comprising: a capture device, the capture device is suitable for using capture probes to capture the integration of foreign gene fragments that may contain from human genome nucleic acid samples DNA fragments;
测序装置, 所述测序装置与所述捕获装置相连, 并且适于针对所捕获的 DNA片段进行 测序, 以便获得由多个测序数据构成的测序结果; A sequencing device, the sequencing device is connected to the capture device and is adapted to perform sequencing on the captured DNA fragments. Sequencing, in order to obtain sequencing results composed of multiple sequencing data;
第一除杂装置, 所述第一除杂装置与所述测序装置相连, 并且适于对所述测序结果进 行第一除杂, 以便获得经过第一除杂的测序结果; A first impurity removal device, the first impurity removal device is connected to the sequencing device, and is adapted to perform a first impurity removal on the sequencing result, so as to obtain a sequencing result that has undergone the first impurity removal;
第一比对装置, 所述第一比对装置与所述第一除杂装置相连, 并且适于将所述经过第 一除杂的测序结果与已知的人类基因组序列和外源基因序列进行第一比对, 以便获得可能 含有外源基因整合片段的测序数据; A first comparison device, the first comparison device is connected to the first impurity removal device, and is suitable for comparing the sequencing results after the first impurity removal with known human genome sequences and foreign gene sequences. The first comparison is to obtain sequencing data that may contain integrated fragments of foreign genes;
组装装置, 所述组装装置与所述第一比对装置相连, 并且适于将所得到的可能含有外 源基因整合片段的测序数据进行组装, 以便得到组装结果, 所述组装结果由多个组装数据 构成; Assembly device, the assembly device is connected to the first comparison device, and is suitable for assembling the obtained sequencing data that may contain exogenous gene integration fragments, so as to obtain an assembly result, the assembly result is composed of multiple assembly Data composition;
第二除杂装置, 所述第二除杂装置与所述组装装置相连, 并且适于对所述组装结果进 行第二除杂, 以便获得经过第二除杂的组装结果; a second impurity removal device, the second impurity removal device is connected to the assembly device, and is adapted to perform a second impurity removal on the assembly result, so as to obtain an assembly result that has undergone the second impurity removal;
第二比对装置, 所述第二比对装置与所述第二除杂装置相连, 并且适于将所述经过第 二除杂的组装结果与已知的人类基因组序列和外源基因序列进行第二比对; 以及 A second alignment device, the second alignment device is connected to the second impurity removal device, and is suitable for comparing the assembly result after the second impurity removal with known human genome sequences and foreign gene sequences. second comparison; and
分析装置, 所述分析装置适于基于所述第二比对结果, 确定所述外源基因在人类基因 组中的整合方式。 An analysis device, the analysis device is adapted to determine the integration mode of the foreign gene in the human genome based on the second comparison result.
13、 根据权利要求 12所述的系统, 其特征在于, 所述测序装置是第二代测序平台。 13. The system according to claim 12, characterized in that the sequencing device is a second-generation sequencing platform.
14、 根据权利要求 12所述的系统, 其特征在于, 所述第一除杂装置进一步包括下列至 少之一: 14. The system according to claim 12, characterized in that the first impurity removal device further includes at least one of the following:
适于进行去除 PCR重复的单元; Units suitable for removing PCR repeats;
适于进行去除低质量测序数据的单元; 以及 Units suitable for removing low-quality sequencing data; and
适于进行去除含接头的测序数据的单元。 A unit suitable for removing sequencing data containing adapters.
15、 根据权利要求 12所述的系统, 其特征在于, 所述第一比对装置适于利用 SOAP进 行比对。 15. The system according to claim 12, characterized in that the first comparison device is adapted to perform comparison using SOAP.
16、 根据权利要求 12所述的系统, 其特征在于, 所述组装装置适于通过基于测序数据 之间的重叠关系进行组装。 16. The system according to claim 12, wherein the assembling device is adapted to assemble based on overlapping relationships between sequencing data.
17、 根据权利要求 12所述的系统, 其特征在于, 所述第二除杂装置进一步包括适于去 除重复的组装数据的单元。 17. The system according to claim 12, wherein the second impurity removal device further includes a unit adapted to remove duplicate assembly data.
18、 根据权利要求 12所述的系统, 其特征在于, 所述第二比对装置适于利用 BWA进 行比对。 18. The system according to claim 12, characterized in that the second comparison device is adapted to perform comparison using BWA.
19、 根据权利要求 18所述的系统, 其特征在于, 所述分析装置适于选择同时能够比对 上已知的人类基因组序列和外源基因序列的组装数据, 该组装数据中包含人类基因组断点 信息和外源基因断点信息。 19. The system according to claim 18, characterized in that the analysis device is adapted to select assembly data that can simultaneously compare known human genome sequences and foreign gene sequences, and the assembly data includes human genome fragments. point information and foreign gene breakpoint information.
20、 根据权利要求 19所述的系统, 其特征在于, 所述分析装置适于: 20. The system according to claim 19, characterized in that the analysis device is suitable for:
基于所述人类基因组断点信息和外源基因断点信息, 判断是否存在置换变异; 或者 基于所述人类基因组断点信息和外源基因断点信息, 确定人类基因组中外源基因插入 长度和类型的至少一种。 Based on the human genome breakpoint information and exogenous gene breakpoint information, determine whether there is a substitution mutation; or based on the human genome breakpoint information and exogenous gene breakpoint information, determine the length and type of exogenous gene insertion in the human genome At least one.
21、 一种计算机可读介质, 其特征在于, 所述计算机可读介质上存储有指令, 所述指 令适于被处理器执行以便通过下列步骤确定外源基因在人类基因组中整合方式: 21. A computer-readable medium, characterized in that instructions are stored on the computer-readable medium, and the instructions are suitable for execution by a processor to determine the integration mode of foreign genes in the human genome through the following steps:
对测序结果进行第一除杂, 以便获得经过第一除杂的测序结果; Perform the first impurity removal on the sequencing results to obtain the sequencing results that have undergone the first impurity removal;
将所述经过第一除杂的测序结果与已知的人类基因组序列和外源基因序列进行第一比 对, 以便获得可能含有外源基因整合片段的测序数据; Perform a first comparison of the first-removed sequencing results with known human genome sequences and exogenous gene sequences to obtain sequencing data that may contain exogenous gene integration fragments;
将所得到的可能含有外源基因整合片段的测序数据进行组装, 以便得到组装结果, 所 述组装结果由多个组装数据构成; Assemble the obtained sequencing data that may contain exogenous gene integration fragments to obtain an assembly result, which is composed of multiple assembly data;
对所述组装结果进行第二除杂, 以便获得经过第二除杂的组装结果; 以及 Performing a second impurity removal on the assembly result to obtain an assembly result that has undergone the second impurity removal; and
将所述经过第二除杂的组装结果与已知的人类基因组序列和外源基因序列进行第二比 对, 并且基于所述第二比对结果, 确定所述外源基因在人类基因组中的整合方式, Perform a second alignment of the assembly result that has undergone the second cleanup with known human genome sequences and exogenous gene sequences, and based on the second alignment result, determine the position of the exogenous gene in the human genome Integration method,
其中, 所述测序结果是通过下列步骤获得的: Wherein, the sequencing results are obtained through the following steps:
利用捕获探针从人类基因组核酸样本中捕获可能含有外源基因片段整合的 DNA片段; 针对所捕获的 DNA片段进行测序, 以便获得由多个测序数据构成的测序结果。 Use capture probes to capture DNA fragments that may contain integrated exogenous gene fragments from human genome nucleic acid samples; sequence the captured DNA fragments to obtain sequencing results consisting of multiple sequencing data.
22、 根据权利要求 21所述的计算机可读介质, 其特征在于, 所述外源基因为病原体基 因组。 22. The computer-readable medium according to claim 21, wherein the foreign gene is a pathogen genome.
23、 根据权利要求 22所述的计算机可读介质, 其特征在于, 所述病原体为 HBV。 23. The computer-readable medium according to claim 22, wherein the pathogen is HBV.
24、 根据权利要求 21所述的计算机可读介质, 其特征在于, 所述测序是通过第二代测 序平台进行的。 24. The computer-readable medium according to claim 21, wherein the sequencing is performed by a second-generation sequencing platform.
25、 根据权利要求 21所述的计算机可读介质, 其特征在于, 所述第一除杂进一步包括 去除 PCR重复、 去除低质量测序数据以及去除含接头的测序数据的至少一种。 25. The computer-readable medium according to claim 21, wherein the first impurity removal further includes at least one of removing PCR repeats, removing low-quality sequencing data, and removing sequencing data containing adapters.
26、根据权利要求 21所述的计算机可读介质,其特征在于,所述第一比对是利用 SOAP 进行的。 26. The computer-readable medium of claim 21, wherein the first comparison is performed using SOAP.
27、 根据权利要求 21所述的计算机可读介质, 其特征在于, 所述组装是通过基于测序 数据之间的重叠关系进行的。 27. The computer-readable medium according to claim 21, wherein the assembly is performed based on overlapping relationships between sequencing data.
28、 根据权利要求 21所述的计算机可读介质, 其特征在于, 所述第二除杂进一步包括 去除重复的组装数据。 28. The computer-readable medium according to claim 21, wherein the second removal of impurities further includes removing duplicate assembly data.
29、根据权利要求 21所述的计算机可读介质, 其特征在于, 所述第二比对是利用 BWA 进行的。 29. The computer-readable medium of claim 21, wherein the second comparison is performed using BWA.
30、 根据权利要求 29所述的计算机可读介质, 其特征在于, 基于所述第二比对结果, 确定所述外源基因在人类基因组中的整合方式进一步包括: 30. The computer-readable medium according to claim 29, characterized in that, based on the second comparison result, determining the integration manner of the foreign gene in the human genome further includes:
选择同时能够比对上已知的人类基因组序列和外源基因序列的组装数据, 该组装数据 中包含人类基因组断点信息和外源基因断点信息。 Select assembly data that can simultaneously compare known human genome sequences and foreign gene sequences. The assembly data contains human genome breakpoint information and foreign gene breakpoint information.
31、 根据权利要求 30所述的计算机可读介质, 其特征在于, 进一步包括: 31. The computer-readable medium according to claim 30, further comprising:
基于所述人类基因组断点信息和外源基因断点信息, 判断是否存在置换变异; 或者 基于所述人类基因组断点信息和外源基因断点信息, 确定人类基因组中外源基因插入 长度和类型的至少一种。 Based on the human genome breakpoint information and exogenous gene breakpoint information, determine whether there is a substitution mutation; or based on the human genome breakpoint information and exogenous gene breakpoint information, determine the length and type of exogenous gene insertion in the human genome At least one.
PCT/CN2012/078311 2012-07-06 2012-07-06 Method and system for determining integration manner of foreign gene in human genome WO2014005329A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2012/078311 WO2014005329A1 (en) 2012-07-06 2012-07-06 Method and system for determining integration manner of foreign gene in human genome
CN201280074522.5A CN104428423A (en) 2012-07-06 2012-07-06 Method and system for determining integration manner of foreign gene in human genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/078311 WO2014005329A1 (en) 2012-07-06 2012-07-06 Method and system for determining integration manner of foreign gene in human genome

Publications (1)

Publication Number Publication Date
WO2014005329A1 true WO2014005329A1 (en) 2014-01-09

Family

ID=49881273

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/078311 WO2014005329A1 (en) 2012-07-06 2012-07-06 Method and system for determining integration manner of foreign gene in human genome

Country Status (2)

Country Link
CN (1) CN104428423A (en)
WO (1) WO2014005329A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111199772A (en) * 2019-12-27 2020-05-26 上海派森诺生物科技股份有限公司 PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing
CN111584003A (en) * 2020-04-10 2020-08-25 中国人民解放军海军军医大学 Optimized detection method for virus sequence integration
CN113957130A (en) * 2021-09-27 2022-01-21 江汉大学 Method for identifying transgenic event based on high-throughput sequencing and probe enrichment

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3056428A1 (en) * 2017-03-20 2018-09-27 Illumina, Inc. Methods and compositions for preparing nucleic acid libraries
TW202020165A (en) * 2018-06-29 2020-06-01 美商格瑞爾公司 Nucleic acid rearrangement and integration analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DUNCAVAGE, E.J. ET AL.: "Hybrid Capture and Next-Generation Sequencing Identify Viral Integration Sites from Formalin-Fixed, Paraffin-Embedded Tissue", THE JOURNAL OF MOLECULAR DIAGNOSTICS, vol. 13, no. 3, May 2011 (2011-05-01), pages 325 - 333, XP002683089, DOI: doi:10.1016/J.JMOLDX.2011.01.006 *
MATTHEW, R. ET AL.: "Comparative analysis of algorithms for next-generation sequencing read alignment", BIOINFORMATICS ADVANCE ACCESS, 19 August 2011 (2011-08-19), pages 1 - 7 *
USTEK, D. ET AL.: "A genome-wide analysis of lentivector integration sites using targeted sequence capture and next generation sequencing technology", INFECTION, GENETICS AND EVOLUTION, vol. 12, no. 7, 14 May 2012 (2012-05-14), pages 1349 - 1354 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111199772A (en) * 2019-12-27 2020-05-26 上海派森诺生物科技股份有限公司 PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing
CN111199772B (en) * 2019-12-27 2023-05-23 上海派森诺生物科技股份有限公司 PEDV (porcine reproductive and respiratory syndrome Virus) genome analysis method based on second-generation sequencing
CN111584003A (en) * 2020-04-10 2020-08-25 中国人民解放军海军军医大学 Optimized detection method for virus sequence integration
CN111584003B (en) * 2020-04-10 2022-05-10 中国人民解放军海军军医大学 Optimized detection method for virus sequence integration
CN113957130A (en) * 2021-09-27 2022-01-21 江汉大学 Method for identifying transgenic event based on high-throughput sequencing and probe enrichment
CN113957130B (en) * 2021-09-27 2023-12-22 江汉大学 Method for identifying transgenic event based on high-throughput sequencing and probe enrichment

Also Published As

Publication number Publication date
CN104428423A (en) 2015-03-18

Similar Documents

Publication Publication Date Title
BR112019014651A2 (en) methods for sequencing nucleic acid molecules and for preparing sequencing adapters, a computer program product, and a computer system.
KR101795124B1 (en) Method and system for detecting copy number variation
KR102028375B1 (en) Systems and methods to detect rare mutations and copy number variation
WO2015149719A1 (en) Heterozygous genome processing method
WO2013107048A1 (en) Method and system for determining whether copy number variation exists in sample genome, and computer readable medium
CN110997937A (en) Universal short adaptors with variable length non-random unique molecular identifiers
AU2020202153A1 (en) Single-molecule sequencing of plasma DNA
CN110343748B (en) Method for analyzing tumor mutation load based on high-throughput targeted sequencing
WO2014005329A1 (en) Method and system for determining integration manner of foreign gene in human genome
EP2971168A2 (en) Systems and methods to detect rare mutations and copy number variation
US11862299B2 (en) Algorithms for sequence determinations
US20240006022A1 (en) Methods and systems for detecting insertions and deletions
US20200075123A1 (en) Genetic variant detection based on merged and unmerged reads
Larson et al. A clinician’s guide to bioinformatics for next-generation sequencing
WO2015043278A1 (en) Method and system for simultaneously performing target gene haplotype analysis and chromosomal aneuploidy detection
WO2013097048A1 (en) Method and device for labelling single nucleotide polymorphism sites in genome
KR102530247B1 (en) Method of enhancing the proportion of the unique DNA fragment used for NGS analysis of cfDNA to detect low frequency variant
CN104073500A (en) Method for screening genes related to PRRSV (porcine reproductive and respiratory syndrome virus) infection and resistance
CN114067907B (en) Method for accurately identifying RNA virus genome variation
US20210164033A1 (en) Method and system for nucleic acid sequencing
EP3704265A1 (en) Correcting for deamination-induced sequence errors
US20230392187A1 (en) Reference ladders and adaptors
WO2018176474A1 (en) Method and system for combined assembly of second generation sequence and third generation sequence genomes
WO2023244735A2 (en) Methods for determination and monitoring of transplant rejection by measuring rna
CN117965748A (en) Identification method for screening synegg twins based on SNV and INDEL

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12880603

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 03/07/2015)

122 Ep: pct application non-entry in european phase

Ref document number: 12880603

Country of ref document: EP

Kind code of ref document: A1