WO2013097149A1 - Method and device for estimating repeating sequence content of genome - Google Patents

Method and device for estimating repeating sequence content of genome Download PDF

Info

Publication number
WO2013097149A1
WO2013097149A1 PCT/CN2011/084928 CN2011084928W WO2013097149A1 WO 2013097149 A1 WO2013097149 A1 WO 2013097149A1 CN 2011084928 W CN2011084928 W CN 2011084928W WO 2013097149 A1 WO2013097149 A1 WO 2013097149A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
sequence
sequences
depth
repeat
Prior art date
Application number
PCT/CN2011/084928
Other languages
French (fr)
Chinese (zh)
Inventor
郑泽群
陶晔
冯子浩
汪健
杨焕明
王俊
Original Assignee
深圳华大基因科技服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技服务有限公司 filed Critical 深圳华大基因科技服务有限公司
Priority to PCT/CN2011/084928 priority Critical patent/WO2013097149A1/en
Publication of WO2013097149A1 publication Critical patent/WO2013097149A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention relates to the field of bioinformatics technology, and more particularly to a method and apparatus for estimating the content of a genomic repeat sequence. Background technique
  • a repeat sequence refers to a segment of DNA that has multiple copies on an individual's genome.
  • the second generation of DNA sequencing technology is a high-throughput, low-cost sequencing technology.
  • the basic principle is sequencing while synthesizing. Taking the solexa sequencing method as an example, the DNA strand is randomly interrupted by physical means, and then a specific linker is added to both ends of the fragment, and an amplified primer sequence is added to the linker.
  • DNA polymerase synthesizes the complementary strand of the fragment to be tested, and reads the sequence of the fragment to be tested by detecting the fluorescent signal carried by the newly synthesized base. These sequences are called sequencing fragments or reads ( Http://www.illumina.com ).
  • Sequencing and re-engineering a species of DNA molecules generally requires a general understanding of the sequence of the species. Since sequence assembly is to restore the sequence information of the genome by overlapping the overlapping segments. In this case, if the repeat sequence content is too high, the sequencing data obtained by the whole genome shotgun method will not be ideal for Denovo assembly. Therefore, it is often necessary to perform a Genome Survey prior to assembly in Denovo to understand the repeat content of the genome.
  • the kmer frequency distribution map is obtained by using the reads data, thereby estimating the genomic heterozygosity rate.
  • the specific method is to assume that there is a complete continuous sequence and randomly select the length of the segment. For K, the fragment is called kmer. Therefore, when the length of reads is L and the length of kmer is taken as K, then L-K+1 kmer can be obtained on one read. Then, by counting the frequency of occurrence of different types of kmer on all reads, the kmer frequency distribution map can be obtained.
  • the specific process is shown in Figure 1.
  • the frequency distribution of the genome kmer can be approximated as ⁇ from the Poisson distribution.
  • the depth of sequencing corresponding to the peak is the average sequencing depth of the genome.
  • the genomic repeat sequence is relatively high, repeated peaks or tailing will occur behind the main peak of the kmer distribution.
  • the artificial reads are artificially set to generate simulated reads that are consistent with the depth of sequencing of the target genome, and then the kmer frequency distribution map is obtained by simulating reads.
  • different repeat sequences are set to estimate the repeat sequence content of the target genome, as shown in Figure 2.
  • the present invention has been made in view of the above problems.
  • a first aspect of the invention provides a method of estimating the content of a genomic repeat sequence, comprising: obtaining a RAD single-end sequencing sequence of a somatic genome (reads); filtering the RAD single-ended sequencing sequence to remove the unqualified sequencing sequence; counting the sequencing sequences with the same sequence to obtain the depth information of each sequencing sequence; filtering out the sequencing sequence with a sequencing depth of 1; Each pair of sequencing sequences is subjected to a pairwise alignment of unallowable gaps, and all sequencing sequences satisfying the alignment conditions are clustered; a threshold is selected according to the depth value of each clustering result, and a sequencing sequence higher than the threshold is determined as a repeat sequence; the repeat sequence content of the individual genome is obtained based on the determined repeat sequence.
  • the number of allowable mismatches for the pairwise alignment of the non-allowing gaps is determined according to the length of the sequence of the sequences, i.e., the alignment conditions of the pairwise alignments that do not allow the gaps are determined based on the length of the sequencing sequence.
  • the selecting a threshold according to the depth value of each clustering result comprises: counting a depth value of each clustering result; and further counting the number of sequencing sequences in each sequencing depth; selecting according to the depth value and the number of sequencing sequences a certain depth value as a threshold,
  • the condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 94%-96% of the number of sequencing sequences in all clustering results; in one embodiment of the present invention, less than the The number of sequencing sequences of depth values accounted for 95% of the number of sequencing sequences in all clustering results.
  • the unsuccessful sequencing sequence comprises: a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the sequencing result in the sequencing sequence is uncertain a sequence in which the number of bases exceeds 10% of the number of bases of the entire sequencing sequence; and/or a sequence in which the exogenous sequence is present; and/or a sequence in which the first few bases are not the ends of the restriction endonuclease .
  • obtaining the repeat sequence content of the individual genome according to the determined repeat sequence comprises: counting the number of clustering results above the threshold, multiplying the length of the sequencing sequence to obtain the total length of the single copy of the repeat sequence; using the average sequencing of the repeat sequence Depth is divided by the average sequencing depth of the non-repetitive sequence to obtain the copy number of the repeat sequence; The total length of the single copy is multiplied by the copy number of the repeat sequence to obtain the total length of the repeat sequence of the RAD sequencing position; the total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat and non-repetitive sequences is the RAD The repeat sequence content at the sequencing position, i.e., approximately the repeat sequence content of the individual genome.
  • Another aspect of the present invention provides an apparatus for estimating a genomic repeat sequence content, comprising: a sequencing sequence acquisition device for obtaining a RAD single-end sequencing sequence of a certain genome; a sequencing sequence filtering device for obtaining an RAD single The end sequencing sequence is filtered to remove the unqualified sequencing sequence; the sequencing depth determining device is used to count the sequencing sequence with the same sequence to obtain the depth information of each sequencing sequence; the sequence depth filtering device is used to filter out the sequencing depth 1 a sequencing sequence; a clustering device for performing a pairwise alignment of unacceptable gaps between each of the obtained sequencing sequences, clustering all sequencing sequences satisfying the alignment conditions; a repeating sequence determining device for each The depth value of the clustering result is selected as a threshold, and the sequencing sequence above the threshold is determined as a repeating sequence; and the repeating sequence content obtaining device is configured to obtain the repeating sequence content of the individual genome according to the determined re-listing.
  • the number of allowable mismatches for the pairwise alignment of the non-allowing gaps is determined according to the length of the sequence of the sequences, i.e., the alignment conditions of the pairwise alignments that do not allow the gaps are determined based on the length of the sequencing sequence.
  • the repeating sequence determining device comprises: a clustering result depth statistic unit for counting the depth value of each clustering result; a sequencing depth distribution statistic unit for counting the number of sequencing sequences at each sequencing depth; threshold selection a unit, configured to select a depth value as a threshold according to the depth value and the number of sequencing sequences, and the sequencing sequence above the threshold is determined as a repeated sequence; the depth value to be satisfied as the threshold is: a sequence smaller than the depth value The number of sequences accounted for 94%-96% of the number of sequencing sequences in all clustering results. In one embodiment of the invention, the number of sequencing sequences less than the depth value is 95% of the number of sequencing sequences in all clustering results.
  • the unqualified sequencing sequence comprises: the sequencing quality is lower than a predetermined low quality
  • the number of bases of the threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the number of bases whose sequencing results are undefined in the sequencing sequence exceeds 10% of the number of bases of the entire sequencing sequence a sequencing sequence; and/or a sequencing sequence in which a foreign sequence is present; and/or a starting sequence of several bases that are not a restriction endonuclease sequence.
  • An advantage of the present invention is that the partial sequence of the genome allows for easy estimation of the replied sequence content of the genome, reduces sequencing costs and computational resource costs, and does not require known genomic data for simulation, simplifying the processing steps.
  • 1 is a schematic flow chart showing a kmer frequency distribution map by sequencing reads in the prior art
  • the abscissa in the figure represents the sequencing depth of kmer, and the ordinate represents the percentage of the kmer species with a certain sequencing depth as a percentage of the total kmer species;
  • Figure 2 is a schematic diagram showing the hybridization rate of a target genome by the Arabidopsis genome in the prior art
  • Figure 3 shows a schematic diagram of the various steps of the RAD sequencing technique
  • Figure 4 is a flow chart showing one embodiment of a method of estimating the genomic repeat content of the present invention.
  • Figure 5 is a schematic diagram showing an example of RAD single-end sequencing of a genome
  • Figure 6 is a schematic diagram showing the depth information of a sequencing sequence
  • Figure 7 is a schematic diagram showing the depth information storage of a sequencing sequence
  • Figure 8 is a flow chart showing an example of the repeat sequence determination of the present invention
  • Figure 9 is a view showing a sequencing depth profile of an application of the method for estimating the genomic repeat sequence of the present invention
  • Figure 10 shows one of the methods of estimating the genomic repeat content of the present invention. Schematic diagram of an application example
  • Figure 11 is a block diagram showing an embodiment of an apparatus for estimating the content of a genomic repeat of the present invention.
  • Fig. 12 is a view showing the structure of another embodiment of the apparatus for estimating the genomic repeat sequence of the present invention. detailed description
  • the present disclosure provides a new bioinformatics analysis program that processes RAD (Eestriction-site Associated ⁇ , P ⁇ ⁇ ⁇ ⁇ ) )) data to find duplicates on RAD sequencing fragments Sequence letter
  • RAD Engineriction-site Associated ⁇ , P ⁇ ⁇ ⁇ ⁇ )
  • RAD sequencing technology adopts a new database construction method.
  • the specific process of sequencing is shown in Figure 3.
  • the specific site of DNA is cleaved by restriction endonuclease, and the DNA molecules after digestion are randomly interrupted by physical methods.
  • the agarose gel DNA separation technique selects a DNA molecule of a specific length, and then adds a specific amplification linker and a sequencing linker at the end of the selected DNA to construct a library for high-throughput sequencing.
  • the repeat sequence content refers to the total length of the repeat sequences in the sequencing sequence divided by the sum of the sum of the total length of the repeat and non-repetitive sequences.
  • the pairwise alignment that does not allow the gap means that the vacancy is not allowed when the alignment is performed. That is, the situation of the open space alignment is not considered.
  • the comparison result of the following two sequences does not satisfy the two-two comparison condition that does not allow the gap:
  • Reads refers to the sequencing fragments produced by the sequencer.
  • Figure 4 is a flow chart showing one embodiment of a method of estimating the genomic repeat content of the present invention.
  • step 402 a RAD single-end sequencing sequence of a certain body genome is obtained.
  • Figure 5 shows a schematic of an example of RAD single-ended sequencing.
  • the palindrome sequence of "G A AATTC” on the DNA molecule is identified by the P-type endonuclease Ecorl, and the DNA molecule is cleaved between G and A, and the digested DNA molecule is used.
  • the physical method is interrupted into a short sequence fragment, and a ligation end is added to the end of the restriction enzyme, and the DNA fragment is subjected to single-end sequencing.
  • the sequencing read length is generally 50 nt or 100 nt.
  • the RAD single-ended sequencing sequence is filtered to remove unqualified sequencing sequences. For example, after receiving a high-throughput RAD single-ended sequencing sequence, sequencing The column is filtered to remove the unqualified sequence. High-throughput sequencing technology can be
  • Illumina GA sequencing technology can also be used for other high-throughput sequencing technologies available.
  • the unqualified sequencing sequence includes, for example, that the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequence is considered to be a defective sequence.
  • the low quality threshold is determined by the specific sequencing technology and sequencing environment, for example, the single base sequencing quality is lower than 20; the number of bases with undefined sequencing results in the sequencing sequence (such as N in Illumina GA sequencing results) exceeds the whole number. 10% of the number of bases in the sequencing sequence is considered to be a non-conforming sequence; in addition to the sample linker sequence, it is aligned with other exogenous sequences introduced by experiments, such as various linker sequences.
  • exogenous sequence is present in the sequence, it is considered to be a non-conforming sequence; in the sequencing sequence, if the first few bases are not the end-cutting sequence, then it is filtered out (such as the restriction endonuclease Ecorl, if the sequencing sequence is not at the beginning) AATTC "filters out the entire sequencing sequence.”
  • Step 406 performs statistics on the sequencing sequences of the same sequence to obtain depth information of each sequencing sequence. For example, sequencing sequences with the same sequence are counted statistically, and each sequencing sequence is assembled into a stack, so that the sequencing depth information of each sequencing sequence can be obtained.
  • the specific process is shown in Figure 6.
  • the information of the heap can be in the manner of FIG. 7.
  • the first column indicates the RAD sequencing sequence information
  • the second column indicates the number of times the sequence is sequenced, that is, the depth information
  • the third column indicates The ID of the sequence information.
  • Step 408 filters out the sequencing sequence with a sequencing depth of one. Sequences with a depth of 1 are usually caused by sequencing errors, filtering out sequence information with a depth of 1 and reducing SNPs due to sequencing errors.
  • Step 410 performs a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences, and clusters all the sequencing sequences satisfying the alignment conditions.
  • the number of mismatches allowed during the alignment is determined by the length of the sequence. For example, if the sequencing length is less than 50 bases (nt), the allowable number of mismatches is 1, and the sequencing length is lOOnt. The number of mismatches is 2. Specifically, for example, when the sequencing length is less than 50 nt In this case, only two sequences with different Wichi are grouped together.
  • Step 412 selects a threshold based on the depth value of each clustering result, and the sequencing sequence above the threshold is determined as a repeating sequence.
  • Step 414 obtains the repeat sequence content of the individual genome based on the determined repeat sequence.
  • obtaining the repeat sequence content of the individual genome according to the determined repeat sequence comprises: counting the number of clustering results higher than the threshold, multiplying the length of the sequencing sequence, to obtain the total length of the single copy of the repeated sequence; The average sequencing depth of the repeated sequence is divided by the average sequencing depth of the non-repetitive sequence to obtain the copy number of the repeated sequence; and the total length of the single copy of the repeated sequence is multiplied by the copy number of the repeated sequence to obtain the total length of the repeated sequence of the RAD sequencing position; The total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat sequence and the non-repetitive sequence is the repeat sequence content of the RAD sequencing position, that is, the repeat sequence content of the individual genome is approximately obtained.
  • RAD sequencing by directly processing the RAD sequencing sequence data, searching for the repetitive sequence on the RAD fragment, further obtaining the information of the repetitive sequence content, and not relying on the data information of the known genome, overcoming some technical bottlenecks of the traditional method for obtaining the repetitive sequence content.
  • RAD sequencing specific regions of the genome will be enriched and sequenced, which will reduce the amount of data sequencing, and reduce the computational resources and sequencing costs required for analysis due to different analytical methods and reduced data volume.
  • an embodiment of the present invention proposes a new method for determining a repeat sequence, the basic idea of which is: performing a pairwise alignment of unallowable gaps between each sequencing sequence, using an alignment
  • the software can be any sequence alignment software, such as blast, blat, etc.; all the sequencing sequences satisfying the matching condition of the mismatch can be clustered; the threshold is selected according to the depth value of each clustering result, above the threshold
  • the sequencing sequence was determined to be a repeat sequence.
  • the depth value according to each clustering result The selection threshold includes: counting the depth value of each clustering result, and then counting the number of sequencing sequences at each sequencing depth, and selecting a certain depth value as the threshold according to the depth value and the number of sequencing sequences.
  • the condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for the number of sequencing sequences in all clustering results.
  • the number of sequencing sequences less than the depth value accounts for 95% of the number of sequencing sequences in all clustering results.
  • the specific process is shown in Figure 8:
  • step 802 a pairwise alignment of the unallowable gaps is performed between each of the sequencing sequences, and all the sequencing sequences satisfying the alignment conditions are clustered.
  • Step 804 calculating a depth value of each clustering result.
  • Step 806 counting the number of sequencing sequences at each sequencing depth.
  • Step 808 Select a depth value as the threshold according to the depth value and the number of sequencing sequences.
  • the condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 94%-96% of the number of sequencing sequences in all clustering results; in an embodiment of the present invention The number of sequencing sequences smaller than the depth value accounts for 95% of the number of sequencing sequences in all clustering results.
  • the sequencing sequence above the threshold is determined to be a repeat sequence.
  • the amount of calculation is small, the speed is fast, and the efficiency is high, which simplifies the processing steps in the conventional manner.
  • the RAD sequencing method is equivalent to randomly extracting some fragments of the genomic DNA sequence, and analyzing the RAD sequencing fragments to obtain the repeat sequence content of the RAD sequencing fragment. Since the RAD sequencing method is capable of measuring sequence information from three to six percent of the genome, the sample size of the sample is large. This allows the repeat sequence content at the sequencing position to approximate the repeat content of the entire genome.
  • Figure 10 shows a method for estimating the genomic re-listing content of the present invention. A schematic of the use case. The data of this example used wild ⁇ white, flowering ⁇ white, common ⁇ white RAD sequencing sequence data (ie, reads data). Among them, the RAD sequencing method is a method well known in the art, for example, the following documents can be referred to:
  • step 1002 the three kinds of white sequencing read data are filtered according to the sequencing quality value, the N content, and whether the end-cut sequence is included, and the unqualified sequencing sequence is removed.
  • the valid data statistics are shown in Table 1. Table 1 Three kinds of white RAD sequencing effective data statistics
  • step 1004 the sequencing sequences with the same sequence are statistically counted to obtain the depth of each sequencing sequence, and the sequencing sequence with a sequencing depth of 1 is filtered out.
  • Table 2 three kinds of white reading data statistics
  • step 1006 the sequencing sequence data with the same sequence are subjected to pairwise alignment, and all the sequencing sequences satisfying the alignment conditions are clustered.
  • the mismatch number allowed for the alignment is, for example, 1, that is, the alignment condition is that only one base is different between the two sequences, and the two sequences are classified into one class. If there is only one radii between the A sequence and the B sequence, and there is only one other base between B and C, then the three sequences are grouped into one class, and so on, through the alignment between all sequencing sequences. , all sequencing sequences that satisfy the alignment conditions can be clustered.
  • step 1008 the depth of each clustering result is counted, and the number of sequencing sequences at each sequencing depth is counted.
  • a sequencing depth profile of the sequencing sequence is made. As shown in Figure 9.
  • Step 1010 Select a depth value as a threshold according to the depth value and the number of sequencing sequences. Specifically, the condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 95% of the number of sequencing sequences in all clustering results. The sequencing sequence in the clustering result above the threshold is determined to be a repeating sequence.
  • the threshold selected in this embodiment is a depth value of 100. Accordingly, sequencing sequences below the threshold are determined to be non-repetitive sequences.
  • Step 1012 Obtain a sequence repeat content of the genome according to the determined repeat sequence. Specifically, the method includes: counting the number of clustering results above the threshold, multiplying the length of the sequencing sequence, and obtaining the total length of the single copy of the repeated sequence; dividing the average sequencing depth of the repeated sequence by the average sequencing depth of the non-repetitive sequence to obtain the repeated sequence Copy number; multiply the total length of the single copy of the repeat sequence by the copy number of the repeat sequence to obtain the total length of the repeat sequence of the RAD sequencing position; the total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat and non-repetitive sequences The percentage obtained is the repeat sequence content of the RAD sequencing position, i.e., the heavy f-column content of the genome is approximately obtained.
  • the average sequencing depth of the repeated sequence refers to the number of all sequencing sequences corresponding to the repeated sequence divided by the number of clustering results of the repeated sequence; the average sequencing depth of the non-repetitive sequence refers to the number of all sequencing sequences corresponding to the non-repetitive sequence divided by the non-repetitive sequence The number of clustering results. The number of clustering results below the threshold is counted, multiplied by the length of the sequencing sequence to obtain the total length of the non-repetitive sequence.
  • FIG. 11 is a block diagram showing an embodiment of an apparatus for estimating the content of a genomic repeat of the present invention.
  • the apparatus comprises: a sequencing sequence acquisition device 111 for obtaining a RAD single-end sequencing sequence of a certain body genome.
  • the sequencing sequence filtering device 112 filters the obtained RAD single-ended sequencing sequence to remove unqualified sequencing sequences.
  • the unqualified sequencing sequence includes, for example, a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the base in which the sequencing result is indeterminate in the sequencing sequence. a sequencing sequence having a number exceeding 10% of the number of bases of the entire sequencing sequence; and/or a sequencing sequence in which the exogenous sequence is present; and/or a starting sequence of several bases that is not a restriction endonuclease sequence.
  • the sequencing depth determining device 113 performs statistics on the sequencing sequences having the same sequence to obtain depth information of each sequencing sequence.
  • a sequence depth filtering device 114 is used to filter out sequencing sequences with a sequencing depth of one.
  • the clustering device 115 is configured to perform a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences, and cluster all the sequencing sequences satisfying the alignment conditions.
  • a repeating sequence determining device 116 configured to select a threshold according to a depth value of each clustering result, and a sequencing sequence higher than the threshold is determined as a repeating sequence;
  • a repeating sequence content obtaining device 117 configured to obtain the individual genome according to the determined repeating sequence Repeat sequence content.
  • Figure 12 is a block diagram showing another embodiment of the apparatus for estimating the content of a genomic repeat of the present invention.
  • the repeated sequence determining device 116 includes: a cluster depth statistic unit 1161 for counting the depth of each clustering result; a sequencing depth distribution statistic unit 1162 for counting the sequencing at each sequencing depth
  • the threshold number selecting unit 1163 is configured to select a certain depth value as a threshold according to the depth value and the number of sequencing sequences, and the sequencing sequence higher than the threshold is determined as a repeated sequence.
  • the condition that the depth value as the threshold needs to be satisfied is: The number of sequencing sequences smaller than the depth value accounts for 94% - 96% of the number of sequencing sequences in all clustering results.
  • the number of sequencing sequences less than the depth value is 95% of the number of sequencing sequences in all clustering results.
  • Figures 11 and 12 can be implemented by separate computing processing devices or integrated into a single device implementation. They are shown in boxes in Figures 11 and 12 to illustrate their function. These functional blocks can be implemented in hardware, software, firmware, middleware, microcode, hardware description speech, or any combination thereof. For example, one or both of the functional blocks can be implemented with code running on a microprocessor, digital signal processor (DSP), or any other suitable computing device.
  • a code can represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, or any combination of instructions, data structures, or program statements.
  • the code can be located on a computer readable medium.
  • the computer readable medium can include one or more storage devices including, for example, RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, mobile hard disk, CD-ROM, or any other form known in the art. Storage medium.
  • the computer readable medium can also include a carrier wave that encodes the data signal.
  • the method and device for estimating the content of genomic repeat sequences use RAD sequencing method to sequence part of the genome, determine the repeated sequence by cluster analysis, achieve the purpose of genome survey, and do not need other genomic data to simulate, simplifying The complexity of genome analysis processing, while saving computing resources and reducing sequencing costs.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Zoology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Wood Science & Technology (AREA)
  • Bioethics (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Disclosed is a method for estimating the repeating sequence content of a genome. The method is carried out by filtration, statistical analysis, alignment, clustering and threshold selection of single-end RAD sequences for sequencing of an individual genome. Further disclosed is a device for estimating the repeating sequence content of a genome.

Description

估计基因组重复序列含量的方法和装置 技术领域  Method and apparatus for estimating genomic repeat content
本发明涉及生物信息学技术领域, 尤其涉及一种估计基因组 重复序列含量的方法和装置。 背景技术  The present invention relates to the field of bioinformatics technology, and more particularly to a method and apparatus for estimating the content of a genomic repeat sequence. Background technique
重复序列指的是这样一段 DNA片段, 该序列在个体基因组上 存在多个拷贝。  A repeat sequence refers to a segment of DNA that has multiple copies on an individual's genome.
第二代 DNA测序技术是一种高通量低成本的测序技术, 基本 原理是边合成边测序。 以 solexa测序方法为例, 先用物理方法将 DNA链随机打断, 然后在片段两端加上特定接头, 接头上有扩增 引物序列。 测序时, DNA聚合酶合成待测片段的互补链, 通过检 测新合成碱基所携带的荧光信号读取 序列, 从而获得待测片 段的序列, 这些序列称为测序片段或测序序列 ( reads ) ( http://www.illumina.com ) 。  The second generation of DNA sequencing technology is a high-throughput, low-cost sequencing technology. The basic principle is sequencing while synthesizing. Taking the solexa sequencing method as an example, the DNA strand is randomly interrupted by physical means, and then a specific linker is added to both ends of the fragment, and an amplified primer sequence is added to the linker. When sequencing, DNA polymerase synthesizes the complementary strand of the fragment to be tested, and reads the sequence of the fragment to be tested by detecting the fluorescent signal carried by the newly synthesized base. These sequences are called sequencing fragments or reads ( Http://www.illumina.com ).
对一个物种的 DNA分子进行测序并进行重新(Denovo )组 装一般需要对物种的序列情况先有个大概的了解。 由于序列组装 是通过测序片段之间的重叠关系来还原基因组的序列信息。 在这 种情况下, 如果重复序列含量过高的话, 利用全基因组鸟枪法获 得的测序数据进行 Denovo组装的效果不会太理想。 因此通常需 要在 Denovo组装前进行基因组勘测 ( Genome Survey ) , 以了 解基因组的重复序列含量。  Sequencing and re-engineering a species of DNA molecules generally requires a general understanding of the sequence of the species. Since sequence assembly is to restore the sequence information of the genome by overlapping the overlapping segments. In this case, if the repeat sequence content is too high, the sequencing data obtained by the whole genome shotgun method will not be ideal for Denovo assembly. Therefore, it is often necessary to perform a Genome Survey prior to assembly in Denovo to understand the repeat content of the genome.
对基因组进行 Survey的传统方式需要进行全基因组测序, 测 序深度大概在 20~30x的数据之间。 在得到测序数据之后, 利用 reads数据得到 kmer频数分布图, 从而进行基因组杂合率的估 计。 具体方法为, 假设存在完整的连续序列, 随机选取片段长度 为 K, 该片段称为 kmer。 因此, 当 reads长度为 L, kmer长度 取为 K时, 则一个 reads上面可以得到 L-K+1个 kmer。 接着统 计所有 reads上不同种类 kmer出现的频数, 就可以得到 kmer频 率分布图。 具体过程如图 1所示。 The traditional approach to Surveying the genome requires whole-genome sequencing, with sequencing depths between approximately 20 and 30x. After the sequencing data is obtained, the kmer frequency distribution map is obtained by using the reads data, thereby estimating the genomic heterozygosity rate. The specific method is to assume that there is a complete continuous sequence and randomly select the length of the segment. For K, the fragment is called kmer. Therefore, when the length of reads is L and the length of kmer is taken as K, then L-K+1 kmer can be obtained on one read. Then, by counting the frequency of occurrence of different types of kmer on all reads, the kmer frequency distribution map can be obtained. The specific process is shown in Figure 1.
根据 Lander-Waterman统计, 基因组 kmer的频数分布可以 近似地认为 ^从泊松分布。 根据泊松分布的理论, 峰值对应的 测序深度即为基因组的平均测序深度。 对于二倍体而言, 如果基 因组重复序列含量比较高的话, 会在 kmer分布主峰后面出现重 复峰或者拖尾现象。 要估计基因组的重复序列含量话, 需要用其 他基因组的数据来进行模拟, 比如通过拟南芥的基因组 ^拟目 标基因组的重复序列含量。 在拟南芥中, 通过人为设置特定杂合 率, 生成与目标基因组测序深度一致的模拟 reads, 接着通过模 拟 reads得到 kmer频数分布图。 通过比较模拟生成的 kmer频数 分布与目标基因组 kmer分布的一致性, 设置不同的重复序列含 量, 从而估计出目标基因组的重复序列含量, 具体如图 2所示。  According to Lander-Waterman statistics, the frequency distribution of the genome kmer can be approximated as ^ from the Poisson distribution. According to the Poisson distribution theory, the depth of sequencing corresponding to the peak is the average sequencing depth of the genome. For diploids, if the genomic repeat sequence is relatively high, repeated peaks or tailing will occur behind the main peak of the kmer distribution. To estimate the repetitive sequence content of the genome, it is necessary to use other genomic data to simulate, for example, the repetitive sequence content of the genome of the Arabidopsis genome. In Arabidopsis, the artificial reads are artificially set to generate simulated reads that are consistent with the depth of sequencing of the target genome, and then the kmer frequency distribution map is obtained by simulating reads. By comparing the consistency of the kmer frequency distribution generated by the simulation with the kmer distribution of the target genome, different repeat sequences are set to estimate the repeat sequence content of the target genome, as shown in Figure 2.
由于这种传统的基因组勘测方法需要进行全基因组测序, 测 序深度大概在 20~30x的数据之间, 因此成本比较高; 由于测序数 据量大, 在处理数据的时候需要较多的计算资源; 而且需要已知 基因组的数据进行模拟, 进一步增加了处理步骤和数据处理量。 因此亟需一种新的基因组勘测方法, 利用较少的测序数据量即可 方便地估计出基因组的重复序列含量, 以降低传统方法所需要的 极高的测序成本和计算资源成本。 发明内容  Because this traditional genomic survey method requires whole-genome sequencing, the sequencing depth is between 20~30x, so the cost is relatively high; because of the large amount of sequencing data, more computing resources are needed when processing data; Data from known genomes are required for simulation, further increasing processing steps and data throughput. Therefore, there is a need for a new genomic survey method that can easily estimate the repeat content of the genome with less sequencing data to reduce the high sequencing cost and computational resource cost required by traditional methods. Summary of the invention
鉴于以上问题提出本发明。  The present invention has been made in view of the above problems.
本发明的第一方面提供了一种估计基因组重复序列含量的 方法, 包括: 获得某个体基因组的 RAD 单端测序序列 ( reads ); 对 RAD 单端测序序列进行过滤以去除不合格的测序 序列; 对序列相同的测序序列进行统计, 得到每种测序序列的深 度信息; 过滤掉测序深度为 1 的测序序列; 在得到的每种测序序 列之间进行不容许空隙的两两比对, 将所有满足比对条件的测序 序列进行聚类; 根据每个聚类结果的深度值选取阈值, 高于阈值 的测序序列确定为重复序列; 根据确定的重复序列得到该个体基 因组的重复序列含量。 A first aspect of the invention provides a method of estimating the content of a genomic repeat sequence, comprising: obtaining a RAD single-end sequencing sequence of a somatic genome (reads); filtering the RAD single-ended sequencing sequence to remove the unqualified sequencing sequence; counting the sequencing sequences with the same sequence to obtain the depth information of each sequencing sequence; filtering out the sequencing sequence with a sequencing depth of 1; Each pair of sequencing sequences is subjected to a pairwise alignment of unallowable gaps, and all sequencing sequences satisfying the alignment conditions are clustered; a threshold is selected according to the depth value of each clustering result, and a sequencing sequence higher than the threshold is determined as a repeat sequence; the repeat sequence content of the individual genome is obtained based on the determined repeat sequence.
优选地, 所述不容许空隙的两两比对的容许的错配数根据测 序序列的长度确定, 即根据测序序列的长度确定不容许空隙的两 两比对的比对条件。  Preferably, the number of allowable mismatches for the pairwise alignment of the non-allowing gaps is determined according to the length of the sequence of the sequences, i.e., the alignment conditions of the pairwise alignments that do not allow the gaps are determined based on the length of the sequencing sequence.
优选地, 所述根据每个聚类结果的深度值选取阈值包括: 统 计每一个聚类结果的深度值; 进而统计每一个测序深度下的测序 序列个数; 根据深度值和测序序列个数选取某一深度值作为阈 值,  Preferably, the selecting a threshold according to the depth value of each clustering result comprises: counting a depth value of each clustering result; and further counting the number of sequencing sequences in each sequencing depth; selecting according to the depth value and the number of sequencing sequences a certain depth value as a threshold,
所述作为阈值的深度值需要满足的条件是: 小于该深度值的 测序序列个数占所有聚类结果中测序序列个数的 94%-96%; 在 本发明的一个实施方案中, 小于该深度值的测序序列个数占所有 聚类结果中测序序列个数的 95%。  The condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 94%-96% of the number of sequencing sequences in all clustering results; in one embodiment of the present invention, less than the The number of sequencing sequences of depth values accounted for 95% of the number of sequencing sequences in all clustering results.
优选地, 不合格的测序序列包括: 测序质量低于预定的低质 量阈值的碱基个数超过整条测序序列碱基个数的 50%的测序序 列; 和 /或测序序列中测序结果不确定的碱基个数超过整条测序序 列碱基个数的 10%的测序序列; 和 /或存在外源序列的测序序 列; 和 /或起始的几个碱基不是酶切末端序列的测序序列。  Preferably, the unsuccessful sequencing sequence comprises: a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the sequencing result in the sequencing sequence is uncertain a sequence in which the number of bases exceeds 10% of the number of bases of the entire sequencing sequence; and/or a sequence in which the exogenous sequence is present; and/or a sequence in which the first few bases are not the ends of the restriction endonuclease .
优选地, 根据确定的重复序列得到该个体基因组的重复序列 含量包括: 统计高于阈值的聚类结果数, 乘以测序序列的长度, 得到重复序列单拷贝的总长度; 用重复序列的平均测序深度除以 非重复序列的平均测序深度, 得到重复序列的拷贝数; 用重复序 列单拷贝的总长度乘以重复序列的拷贝数, 得到 RAD 测序位置 的重复序列总长度; 测序序列中重复序列的总长度除以重复序列 和非重复序列总长度的和所得到的百分比为 RAD 测序位置的重 复序列含量, 即近似地得到该个体基因组的重复序列含量。 Preferably, obtaining the repeat sequence content of the individual genome according to the determined repeat sequence comprises: counting the number of clustering results above the threshold, multiplying the length of the sequencing sequence to obtain the total length of the single copy of the repeat sequence; using the average sequencing of the repeat sequence Depth is divided by the average sequencing depth of the non-repetitive sequence to obtain the copy number of the repeat sequence; The total length of the single copy is multiplied by the copy number of the repeat sequence to obtain the total length of the repeat sequence of the RAD sequencing position; the total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat and non-repetitive sequences is the RAD The repeat sequence content at the sequencing position, i.e., approximately the repeat sequence content of the individual genome.
本发明的另一方面提供了一种估计基因组重复序列含量的 装置, 包括: 测序序列获取设备, 用于获得某个体基因组的 RAD 单端测序序列; 测序序列过滤设备, 用于对获得的 RAD单端测 序序列进行过滤以去除不合格的测序序列; 测序深度确定设备, 用于统计序列相同的测序序列, 得到每种测序序列的深度信息; 序列深度过滤设备, 用于过滤掉测序深度为 1 的测序序列; 聚类 设备, 用于在得到的每种测序序列之间进行不容许空隙的两两比 对, 将所有满足比对条件的测序序列进行聚类; 重复序列确定设 备, 用于根据每个聚类结果的深度值选取阈值, 高于阈值的测序 序列确定为重复序列; 重复序列含量获取设备, 用于根据确定的 重 列得到该个体基因组的重复序列含量。  Another aspect of the present invention provides an apparatus for estimating a genomic repeat sequence content, comprising: a sequencing sequence acquisition device for obtaining a RAD single-end sequencing sequence of a certain genome; a sequencing sequence filtering device for obtaining an RAD single The end sequencing sequence is filtered to remove the unqualified sequencing sequence; the sequencing depth determining device is used to count the sequencing sequence with the same sequence to obtain the depth information of each sequencing sequence; the sequence depth filtering device is used to filter out the sequencing depth 1 a sequencing sequence; a clustering device for performing a pairwise alignment of unacceptable gaps between each of the obtained sequencing sequences, clustering all sequencing sequences satisfying the alignment conditions; a repeating sequence determining device for each The depth value of the clustering result is selected as a threshold, and the sequencing sequence above the threshold is determined as a repeating sequence; and the repeating sequence content obtaining device is configured to obtain the repeating sequence content of the individual genome according to the determined re-listing.
优选地, 所述不容许空隙的两两比对的容许的错配数根据测 序序列的长度确定, 即根据测序序列的长度确定不容许空隙的两 两比对的比对条件。  Preferably, the number of allowable mismatches for the pairwise alignment of the non-allowing gaps is determined according to the length of the sequence of the sequences, i.e., the alignment conditions of the pairwise alignments that do not allow the gaps are determined based on the length of the sequencing sequence.
优选地, 重复序列确定设备包括: 聚类结果深度统计单元, 用于统计每一个聚类结果的深度值; 测序深度分布统计单元, 用 于统计每一个测序深度下的测序序列个数; 阈值选取单元, 用于 根据深度值和测序序列个数选取某一深度值作为阈值, 高于阈值 的测序序列确定为重复序列; 所述作为阈值的深度值需要满足的 条件是: 小于该深度值的测序序列个数占所有聚类结果中测序序 列个数的 94%-96%。 在本发明的一个实施方案中, 小于该深度 值的测序序列个数占所有聚类结果中测序序列个数的 95%。  Preferably, the repeating sequence determining device comprises: a clustering result depth statistic unit for counting the depth value of each clustering result; a sequencing depth distribution statistic unit for counting the number of sequencing sequences at each sequencing depth; threshold selection a unit, configured to select a depth value as a threshold according to the depth value and the number of sequencing sequences, and the sequencing sequence above the threshold is determined as a repeated sequence; the depth value to be satisfied as the threshold is: a sequence smaller than the depth value The number of sequences accounted for 94%-96% of the number of sequencing sequences in all clustering results. In one embodiment of the invention, the number of sequencing sequences less than the depth value is 95% of the number of sequencing sequences in all clustering results.
优选地, 不合格的测序序列包括: 测序质量低于预定的低质 量阈值的碱基个数超过整条测序序列碱基个数的 50%的测序序 列; 和 /或测序序列中测序结果不确定的碱基个数超过整条测序序 列碱基个数的 10%的测序序列; 和 /或存在外源序列的测序序 列; 和 /或起始的几个碱基不是酶切末端序列的测序序列。 Preferably, the unqualified sequencing sequence comprises: the sequencing quality is lower than a predetermined low quality The number of bases of the threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the number of bases whose sequencing results are undefined in the sequencing sequence exceeds 10% of the number of bases of the entire sequencing sequence a sequencing sequence; and/or a sequencing sequence in which a foreign sequence is present; and/or a starting sequence of several bases that are not a restriction endonuclease sequence.
本发明的一个优点在于, 通过基因组的部分测序即可方便 地估计出基因组的重复序列含量, 降低了测序成本和计算资源成 本, 同时不需要已知的基因组数据进行模拟, 简化了处理步骤。  An advantage of the present invention is that the partial sequence of the genome allows for easy estimation of the replied sequence content of the genome, reduces sequencing costs and computational resource costs, and does not require known genomic data for simulation, simplifying the processing steps.
通过以下参照附图对本发明的示例性实施例的详细描述, 本发明的其它特征及其优点将会变得清楚。 附图说明  Other features and advantages of the present invention will become apparent from the Detailed Description of the <RTI DRAWINGS
图 1示出现有技术中通过测序 reads得到 kmer频数分布图 的流程示意图;  1 is a schematic flow chart showing a kmer frequency distribution map by sequencing reads in the prior art;
图中的横坐标表示 kmer 的测序深度, 纵坐标表示具有某一 特定测序深度的 kmer种类数占总的 kmer种类数的百分比;  The abscissa in the figure represents the sequencing depth of kmer, and the ordinate represents the percentage of the kmer species with a certain sequencing depth as a percentage of the total kmer species;
图 2 示出现有技术中通过拟南芥基因组模拟目标基因组杂合 率的示意图;  Figure 2 is a schematic diagram showing the hybridization rate of a target genome by the Arabidopsis genome in the prior art;
图 3示出 RAD测序技术的各个步骤的示意图;  Figure 3 shows a schematic diagram of the various steps of the RAD sequencing technique;
图 4示出本发明的估计基因组重复序列含量的方法的一个实 施例的流程图;  Figure 4 is a flow chart showing one embodiment of a method of estimating the genomic repeat content of the present invention;
图 5示出基因组的 RAD单端测序的一个例子的示意图; 图 6示出测序序列的深度信息统计示意图;  Figure 5 is a schematic diagram showing an example of RAD single-end sequencing of a genome; Figure 6 is a schematic diagram showing the depth information of a sequencing sequence;
图 7示出测序序列的深度信息存储示意图;  Figure 7 is a schematic diagram showing the depth information storage of a sequencing sequence;
图 8示出本发明的重复序列确定的一个例子的流程图; 图 9示出本发明的估计基因组重复序列含量的方法的一个应 用例的测序深度分布图;  Figure 8 is a flow chart showing an example of the repeat sequence determination of the present invention; Figure 9 is a view showing a sequencing depth profile of an application of the method for estimating the genomic repeat sequence of the present invention;
图 10 示出本发明的估计基因组重复序列含量的方法的一个 应用例的示意图; Figure 10 shows one of the methods of estimating the genomic repeat content of the present invention. Schematic diagram of an application example;
图 11 示出本发明的估计基因组重复序列含量的装置的一个 实施例的结构图;  Figure 11 is a block diagram showing an embodiment of an apparatus for estimating the content of a genomic repeat of the present invention;
图 12 示出本发明的估计基因组重复序列含量的装置的另一 个实施例的结构图。 具体实施方式  Fig. 12 is a view showing the structure of another embodiment of the apparatus for estimating the genomic repeat sequence of the present invention. detailed description
现在将参照附图来详细描述本发明的各种示例性实施例。 应注意到: 除非另外具体说明, 否则在这些实施例中阐述的部 件和步驟的相对布置、 数字表达式和数值不限制本发明的范 围。  Various exemplary embodiments of the present invention will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in the embodiments are not intended to limit the scope of the invention unless otherwise specified.
同时, 应当明白, 为了便于描述, 附图中所示出的各个部 分的尺寸并不是按照实际的比例关系绘制的。  In the meantime, it should be understood that the dimensions of the various parts shown in the drawings are not drawn in the actual scale relationship for the convenience of the description.
以下对至少一个示例性实施例的描述实际上仅仅是说明性 的, 决不作为对本发明及其应用或使用的任何限制。  The following description of the at least one exemplary embodiment is merely illustrative and is in no way
对于相关领域普通技术人员已知的技术、 方法和设备可能 不作详细讨论, 但在适当情况下, 技术、 方法和设备应当被视 为授权说明书的一部分。  Techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods and apparatus should be considered as part of the authorization specification.
在这里示出和讨论的所有示例中, 任何具体值应被解释为 仅仅是示例性的, 而不是作为限制。 因此, 示例性实施例的其 它示例可以具有不同的值。  In all of the examples shown and discussed herein, any specific values are to be construed as illustrative only and not as a limitation. Thus, other examples of the exemplary embodiments may have different values.
应注意到: 相似的标号和字母在下面的附图中表示类似 项, 因此, 一旦某一项在一个附图中被定义, 则在随后的附图 中不需要对其进行进一步讨论。  It should be noted that similar reference numerals and letters indicate similar items in the following figures, and therefore, once an item is defined in a drawing, it is not required to be further discussed in the subsequent drawings.
针对现有技术的问题, 本公开提供了一种新的生物信息学分 析方案, 处理 RAD ( Eestriction-site Associated ΝΑ, P艮制性内 切位点相关 DNA )数据, 寻找 RAD 测序片段上的重复序列信 息, 以计算重复序列含量, 简化了现有技术中的处理步骤, 也降 低了测序成本和计算资源成本。 In response to the problems of the prior art, the present disclosure provides a new bioinformatics analysis program that processes RAD (Eestriction-site Associated ΝΑ, P艮 内 相关 相关 ) )) data to find duplicates on RAD sequencing fragments Sequence letter In order to calculate the repeat sequence content, the processing steps in the prior art are simplified, and the sequencing cost and the calculation resource cost are also reduced.
下面介绍几个本发明的技术方案涉及的概念。  Several concepts related to the technical solution of the present invention are described below.
RAD测序技术采用了新的建库方式, 其测序具体过程如图 3 所示, 用限制性内切酶切断 DNA特定的位点, 再用物理方法将 酶切之后的 DNA分子随机打断, 通过琼脂糖胶 DNA分离技术挑 选特定长度的 DNA分子, 然后在挑选出来的 DNA末端添加特定 的扩增接头与测序接头, 从而构建上机文库进行高通量测序。  RAD sequencing technology adopts a new database construction method. The specific process of sequencing is shown in Figure 3. The specific site of DNA is cleaved by restriction endonuclease, and the DNA molecules after digestion are randomly interrupted by physical methods. The agarose gel DNA separation technique selects a DNA molecule of a specific length, and then adds a specific amplification linker and a sequencing linker at the end of the selected DNA to construct a library for high-throughput sequencing.
重复序列含量是指测序序列中重复序列的总长度除以重复序 列和非重复序列总长度的和所得到的百分比。  The repeat sequence content refers to the total length of the repeat sequences in the sequencing sequence divided by the sum of the sum of the total length of the repeat and non-repetitive sequences.
不容许空隙的两两比对是指比对的时候不容许开空位。 即不 考虑开空位比对上的情况, 例如以下两条序列的比对结果就不满 足不容许空隙的两两比对条件:  The pairwise alignment that does not allow the gap means that the vacancy is not allowed when the alignment is performed. That is, the situation of the open space alignment is not considered. For example, the comparison result of the following two sequences does not satisfy the two-two comparison condition that does not allow the gap:
序列 1: AATTCATCGAC  Sequence 1: AATTCATCGAC
序列 2: AA CATCGTC。  Sequence 2: AA CATCGTC.
Reads是指测序仪产生的测序片段。  Reads refers to the sequencing fragments produced by the sequencer.
图 4示出本发明的估计基因组重复序列含量的方法的一个实 施例的流程图。  Figure 4 is a flow chart showing one embodiment of a method of estimating the genomic repeat content of the present invention.
如图 4所示, 步骤 402, 获得某个体基因组的 RAD单端测序 序列。 图 5示出了 RAD单端测序的一个例子的示意图。 在图 5 中显示了用 P艮制性内切酶 Ecorl , 识别 DNA分子上 "GA AATTC" 的回文序列, 并在 G与 A之间将 DNA分子切断, 将酶切后的 DNA分子用物理方法打断成短的序列片段, 并在其中酶切的一端 加上接头并对 DNA 片段进行单末端测序, 测序读长一般为 50nt, 也可以为 100nt。 As shown in Figure 4, step 402, a RAD single-end sequencing sequence of a certain body genome is obtained. Figure 5 shows a schematic of an example of RAD single-ended sequencing. In Figure 5, the palindrome sequence of "G A AATTC" on the DNA molecule is identified by the P-type endonuclease Ecorl, and the DNA molecule is cleaved between G and A, and the digested DNA molecule is used. The physical method is interrupted into a short sequence fragment, and a ligation end is added to the end of the restriction enzyme, and the DNA fragment is subjected to single-end sequencing. The sequencing read length is generally 50 nt or 100 nt.
步骤 404, 对 RAD单端测序序列进行过滤以去除不合格的测 序序列。 例如, 接收到高通量 RAD单端测序序列后, 对测序序 列进行过滤, 去除不合格的序列。 其中高通量测序技术可以为In step 404, the RAD single-ended sequencing sequence is filtered to remove unqualified sequencing sequences. For example, after receiving a high-throughput RAD single-ended sequencing sequence, sequencing The column is filtered to remove the unqualified sequence. High-throughput sequencing technology can be
Illumina GA 测序技术, 也可以为现有的其他高通量测序技术。 不合格测序序列例如包括: 测序质量低于预定的低质量阈值的碱 基个数超过整条序列碱基个数的 50%则认为是不合格序列。 低质 量阈值由具体测序技术及测序环境而定, 例如设定为单碱基测序 质量低于 20; 测序序列中测序结果不确定的碱基(如 Illumina GA测序结果中的 N )个数超过整条测序序列碱基个数的 10%则 认为是不合格序列; 除样本接头序列外, 与其它实验引入的外源 序列比对, 如各种接头序列。 若序列中存在外源序列则认为是不 合格序列; 在测序序列中, 若起始的几个碱基不是酶切末端序列 则过滤掉 (如限制性内切酶 Ecorl , 测序序列开头若不是 "AATTC"则过滤掉整个测序序列)。 Illumina GA sequencing technology can also be used for other high-throughput sequencing technologies available. The unqualified sequencing sequence includes, for example, that the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequence is considered to be a defective sequence. The low quality threshold is determined by the specific sequencing technology and sequencing environment, for example, the single base sequencing quality is lower than 20; the number of bases with undefined sequencing results in the sequencing sequence (such as N in Illumina GA sequencing results) exceeds the whole number. 10% of the number of bases in the sequencing sequence is considered to be a non-conforming sequence; in addition to the sample linker sequence, it is aligned with other exogenous sequences introduced by experiments, such as various linker sequences. If the exogenous sequence is present in the sequence, it is considered to be a non-conforming sequence; in the sequencing sequence, if the first few bases are not the end-cutting sequence, then it is filtered out (such as the restriction endonuclease Ecorl, if the sequencing sequence is not at the beginning) AATTC "filters out the entire sequencing sequence."
步骤 406对序列相同的测序序列进行统计, 得到每种测序序 列的深度信息。 例如, 将序列相同的测序序列进行统计计数, 每 种测序序列集合为一堆(Stack ), 这样就可以得到每一种测序序 列的测序深度信息。 具体过程如图 6所示。 堆的信息可以以图 7 的方式^ ·, 在图 7 中, 第一列表示的是 RAD测序序列信息; 第二列表示的是该序列被测序的次数, 即深度信息; 第三列是该 序列信息的 ID。  Step 406 performs statistics on the sequencing sequences of the same sequence to obtain depth information of each sequencing sequence. For example, sequencing sequences with the same sequence are counted statistically, and each sequencing sequence is assembled into a stack, so that the sequencing depth information of each sequencing sequence can be obtained. The specific process is shown in Figure 6. The information of the heap can be in the manner of FIG. 7. In FIG. 7, the first column indicates the RAD sequencing sequence information; the second column indicates the number of times the sequence is sequenced, that is, the depth information; the third column indicates The ID of the sequence information.
步骤 408过滤掉测序深度为 1的测序序列。 深度为 1的测序 序列通常是由测序错误导致的, 过滤掉深度为 1 的测序序列信 息, 减少由于测序错误引起的 的 SNP位点。  Step 408 filters out the sequencing sequence with a sequencing depth of one. Sequences with a depth of 1 are usually caused by sequencing errors, filtering out sequence information with a depth of 1 and reducing SNPs due to sequencing errors.
步骤 410在得到的每种测序序列之间进行不容许空隙的两两 比对, 将所有满足比对条件的测序序列进行聚类。 比对的时候容 许的错配数随测序的长度来定, 例如在测序长度小于 50 个碱基 ( nt )的情况下, 容许的错配数为 1 , 测序长度在 lOOnt 的情况 下, 容许的错配数为 2。 具体地, 例如在测序长度小于 50nt的情 况下, 只有 1个威基不同的两条序列聚为一类。 Step 410 performs a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences, and clusters all the sequencing sequences satisfying the alignment conditions. The number of mismatches allowed during the alignment is determined by the length of the sequence. For example, if the sequencing length is less than 50 bases (nt), the allowable number of mismatches is 1, and the sequencing length is lOOnt. The number of mismatches is 2. Specifically, for example, when the sequencing length is less than 50 nt In this case, only two sequences with different Wichi are grouped together.
步骤 412根据每个聚类结果的深度值选取阈值, 高于阈值的 测序序列确定为重复序列。  Step 412 selects a threshold based on the depth value of each clustering result, and the sequencing sequence above the threshold is determined as a repeating sequence.
步骤 414根据确定的重复序列得到该个体基因组的重复序列 含量。  Step 414 obtains the repeat sequence content of the individual genome based on the determined repeat sequence.
在本发明的一个实施例中 , 根据确定的重复序列得到该个体 基因组的重复序列含量包括: 统计高于阈值的聚类结果数, 乘以 测序序列的长度, 得到重复序列单拷贝的总长度; 用重复序列的 平均测序深度除以非重复序列的平均测序深度, 得到重复序列的 拷贝数; 用重复序列单拷贝的总长度乘以重复序列的拷贝数, 得 到 RAD 测序位置的重复序列总长度; 测序序列中重复序列的总 长度除以重复序列和非重复序列总长度的和所得到的百分比为 RAD测序位置的重复序列含量, 即近似地得到该个体基因组的重 复序列含量。 上述实施例中, 通过直接处理 RAD 测序序列数据, 寻找 RAD片段上的重复序列, 进一步获得重复序列含量信息, 不依赖 于已知基因组的数据信息, 克服了传统获得重复序列含量方法的 一些技术瓶颈。 通过 RAD 测序方式将会对基因组的特定区域进 行富集测序, 从而降低了数据测序量, 并且由于分析方法的不同 和数据量的减少, 降低了分析所需的计算资源和测序成本。  In one embodiment of the present invention, obtaining the repeat sequence content of the individual genome according to the determined repeat sequence comprises: counting the number of clustering results higher than the threshold, multiplying the length of the sequencing sequence, to obtain the total length of the single copy of the repeated sequence; The average sequencing depth of the repeated sequence is divided by the average sequencing depth of the non-repetitive sequence to obtain the copy number of the repeated sequence; and the total length of the single copy of the repeated sequence is multiplied by the copy number of the repeated sequence to obtain the total length of the repeated sequence of the RAD sequencing position; The total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat sequence and the non-repetitive sequence is the repeat sequence content of the RAD sequencing position, that is, the repeat sequence content of the individual genome is approximately obtained. In the above embodiment, by directly processing the RAD sequencing sequence data, searching for the repetitive sequence on the RAD fragment, further obtaining the information of the repetitive sequence content, and not relying on the data information of the known genome, overcoming some technical bottlenecks of the traditional method for obtaining the repetitive sequence content. . By RAD sequencing, specific regions of the genome will be enriched and sequenced, which will reduce the amount of data sequencing, and reduce the computational resources and sequencing costs required for analysis due to different analytical methods and reduced data volume.
根据本发明的方法, 本发明的一个实施例提出一种新的确定 重复序列方法, 该方法的基本思路为: 在每种测序序列之间进行 不容许空隙的两两比对, 使用的比对软件可以是任何一款序列比 对软件, 如 blast、 blat等; 将所有满足容许错配的比对条件的测 序序列进行聚类; 根据每个聚类结果的深度值选取阈值, 高于阈 值的测序序列确定为重复序列。 所述根据每个聚类结果的深度值 选取阈值包括: 统计每一个聚类结果的深度值, 进而统计每一个 测序深度下的测序序列个数, 根据深度值和测序序列个数选取某 一深度值作为阈值。 所述作为阈值的深度值需要满足的条件是: 小于该深度值的测序序列个数占所有聚类结果中测序序列个数的According to the method of the present invention, an embodiment of the present invention proposes a new method for determining a repeat sequence, the basic idea of which is: performing a pairwise alignment of unallowable gaps between each sequencing sequence, using an alignment The software can be any sequence alignment software, such as blast, blat, etc.; all the sequencing sequences satisfying the matching condition of the mismatch can be clustered; the threshold is selected according to the depth value of each clustering result, above the threshold The sequencing sequence was determined to be a repeat sequence. The depth value according to each clustering result The selection threshold includes: counting the depth value of each clustering result, and then counting the number of sequencing sequences at each sequencing depth, and selecting a certain depth value as the threshold according to the depth value and the number of sequencing sequences. The condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for the number of sequencing sequences in all clustering results.
94%-96%; 在本发明的一个实施方案中, 小于该深度值的测序序 列个数占所有聚类结果中测序序列个数的 95%。 具体过程如图 8 所示: 94%-96%; In one embodiment of the invention, the number of sequencing sequences less than the depth value accounts for 95% of the number of sequencing sequences in all clustering results. The specific process is shown in Figure 8:
步骤 802 , 在每种测序序列之间进行不容许空隙的两两比 对, 将所有满足比对条件的测序序列进行聚类。  In step 802, a pairwise alignment of the unallowable gaps is performed between each of the sequencing sequences, and all the sequencing sequences satisfying the alignment conditions are clustered.
步骤 804, 统计每一个聚类结果的深度值。  Step 804, calculating a depth value of each clustering result.
步骤 806, 统计每一个测序深度下的测序序列个数。  Step 806, counting the number of sequencing sequences at each sequencing depth.
步骤 808, 根据深度值和测序序列个数选取某一深度值作为 阈值。 具体地, 所述作为阈值的深度值需要满足的条件是: 小于 该深度值的测序序列个数占所有聚类结果中测序序列个数的 94%-96%; 在本发明的一个实施方案中, 小于该深度值的测序序 列个数占所有聚类结果中测序序列个数的 95%。  Step 808: Select a depth value as the threshold according to the depth value and the number of sequencing sequences. Specifically, the condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 94%-96% of the number of sequencing sequences in all clustering results; in an embodiment of the present invention The number of sequencing sequences smaller than the depth value accounts for 95% of the number of sequencing sequences in all clustering results.
步骤 810, 高于阈值的测序序列确定为重复序列。  At step 810, the sequencing sequence above the threshold is determined to be a repeat sequence.
通过上述实施例的比对和统计方法, 运算量小, 速度快、 效 率高, 简化了传统方式中的处理步驟。  Through the comparison and statistical methods of the above embodiments, the amount of calculation is small, the speed is fast, and the efficiency is high, which simplifies the processing steps in the conventional manner.
由于在基因组序列上, 重复序列的分布是比较均匀的, RAD 测序方法相当于随机抽取了基因组 DNA序列上的某些片段, 并 通 RAD测序片段的分析, 得到 RAD测序片段位置的重复序 列含量。 由于 RAD 测序方法能够测到基因组百分之三到百分之 六的序列信息, 因此, 抽样的样本容量大。 这样就可以用测序位 置的重复序列含量来近似估计整个基因组的重复序列含量。 图 10示出本发明的估计基因组重 列含量的方法的一个应 用例的示意图。 该实施例数据采用野生茭白, 开花茭白, 普通茭 白的 RAD测序序列数据(即 reads数据) 。 其中 RAD测序方法为 本领域公知的方法, 例如可参考以下文献: Since the distribution of the repeat sequences is relatively uniform in the genome sequence, the RAD sequencing method is equivalent to randomly extracting some fragments of the genomic DNA sequence, and analyzing the RAD sequencing fragments to obtain the repeat sequence content of the RAD sequencing fragment. Since the RAD sequencing method is capable of measuring sequence information from three to six percent of the genome, the sample size of the sample is large. This allows the repeat sequence content at the sequencing position to approximate the repeat content of the entire genome. Figure 10 shows a method for estimating the genomic re-listing content of the present invention. A schematic of the use case. The data of this example used wild 茭 white, flowering 茭 white, common 茭 white RAD sequencing sequence data (ie, reads data). Among them, the RAD sequencing method is a method well known in the art, for example, the following documents can be referred to:
(1) Michael R Miller, Tressa S Atwood, B Frank Eames, et al, RAD marker microarrays enable rapid mapping of  (1) Michael R Miller, Tressa S Atwood, B Frank Eames, et al, RAD marker microarrays enable rapid mapping of
zebrafishmutations, Genome Biology , 2007, 8(6):R105.1-R105.10; Zebrafishmutations, Genome Biology, 2007, 8(6): R105.1-R105.10;
(2) Michael R, Miller, Joseph P. Dnnhaiii, An gel Aniores5et al, Rnpicl and cost-effective polyniorphtsiii identificatloniiec! (2) Michael R, Miller, Joseph P. Dnnhaiii, An gel Aniores 5 et al, Rnpicl and cost-effective polyniorphtsiii identificatloniiec!
genotypieg iising restriction site nssoelat d DNA(RAD) markers, Genome Research, 2007? 17? 240-248? Genothypieg iising restriction site nssoelat d DNA(RAD) markers, Genome Research, 2007? 17? 240-248?
(3) Nathan A. Bairdl, Paul D. Etter, Tressa S. Atwood, et al, Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers, PLoS ONE, 2008,3(10), e3376, clot: 10,137! /joiiraaLpone.0003376,  (3) Nathan A. Bairdl, Paul D. Etter, Tressa S. Atwood, et al, Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers, PLoS ONE, 2008,3(10), e3376, clot: 10,137! /joiiraaLpone .0003376,
利用传统方法得知三种茭白的重复序列含量相差不大。  Using conventional methods, it is known that the content of the three white repeat sequences is not much different.
实施例具体操作流程如图 10所示, 步骤 1002, 将三种茭白 的测序 reads数据, 根据测序质量值, N的含量, 以及是否含有 酶切末端序列进行过滤, 去除不合格的测序序列, 得到的有效数 据统计如表 1所示。 表 1三种茭白 RAD测序有效数据统计  The specific operation flow of the embodiment is shown in FIG. 10, step 1002, the three kinds of white sequencing read data are filtered according to the sequencing quality value, the N content, and whether the end-cut sequence is included, and the unqualified sequencing sequence is removed. The valid data statistics are shown in Table 1. Table 1 Three kinds of white RAD sequencing effective data statistics
Figure imgf000012_0001
步骤 1004, 将序列相同的测序序列进行统计计数获得各个测 序序列的深度, 过滤掉测序深度为 1的测序序列。 结果如表 2所 示。 表 2、 三种茭白 reads数据统计
Figure imgf000012_0001
In step 1004, the sequencing sequences with the same sequence are statistically counted to obtain the depth of each sequencing sequence, and the sequencing sequence with a sequencing depth of 1 is filtered out. The results are shown in Table 2. Table 2, three kinds of white reading data statistics
Figure imgf000013_0001
步骤 1006, 将序列相同的测序序列数据进行两两比对, 将所 有满足比对条件的测序序列进行聚类。 比对容许的错配数例如为 1, 即比对条件为两条序列之间只有一个碱基不相同, 则这两条 序列归为一类。 如果 A序列与 B序列之间只有一个威基不相同, 而 B与 C之间只有另外一个碱基不相同, 则三条序列归为一类, 以此类推, 通过所有测序序列之间的比对, 可以将所有满足比对 条件的测序序列进行聚类。
Figure imgf000013_0001
In step 1006, the sequencing sequence data with the same sequence are subjected to pairwise alignment, and all the sequencing sequences satisfying the alignment conditions are clustered. The mismatch number allowed for the alignment is, for example, 1, that is, the alignment condition is that only one base is different between the two sequences, and the two sequences are classified into one class. If there is only one radii between the A sequence and the B sequence, and there is only one other base between B and C, then the three sequences are grouped into one class, and so on, through the alignment between all sequencing sequences. , all sequencing sequences that satisfy the alignment conditions can be clustered.
步骤 1008, 统计每一个聚类结果的深度, 并统计每一个测序 深度下的测序序列个数。 做出测序序列的测序深度分布图。 如图 9所示。  In step 1008, the depth of each clustering result is counted, and the number of sequencing sequences at each sequencing depth is counted. A sequencing depth profile of the sequencing sequence is made. As shown in Figure 9.
步骤 1010, 根据深度值和测序序列个数选取某一深度值作为 阈值。 具体地, 所述作为阈值的深度值需要满足的条件是: 小于 该深度值的测序序列个数占所有聚类结果中测序序列个数的 95%。 高于阈值的聚类结果中的测序序列确定为重复序列。  Step 1010: Select a depth value as a threshold according to the depth value and the number of sequencing sequences. Specifically, the condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 95% of the number of sequencing sequences in all clustering results. The sequencing sequence in the clustering result above the threshold is determined to be a repeating sequence.
本实施例中选取的阈值为深度值 100。 相应地, 低于阈值的 测序序列确定为非重复序列。 步骤 1012 , 根据确定的重复序列得到基因组的重复序列含 量。 具体包括: 统计高于阈值的聚类结果数, 乘以测序序列的长 度, 得到重复序列单拷贝的总长度; 用重复序列的平均测序深度 除以非重复序列的平均测序深度, 得到重复序列的拷贝数; 用重 复序列单拷贝的总长度乘以重复序列的拷贝数, 得到 RAD 测序 位置的重复序列总长度; 测序序列中重复序列的总长度除以重复 序列和非重复序列总长度的和所得到的百分比, 为 RAD 测序位 置的重复序列含量, 即近似地得到基因组的重 f列含量。 The threshold selected in this embodiment is a depth value of 100. Accordingly, sequencing sequences below the threshold are determined to be non-repetitive sequences. Step 1012: Obtain a sequence repeat content of the genome according to the determined repeat sequence. Specifically, the method includes: counting the number of clustering results above the threshold, multiplying the length of the sequencing sequence, and obtaining the total length of the single copy of the repeated sequence; dividing the average sequencing depth of the repeated sequence by the average sequencing depth of the non-repetitive sequence to obtain the repeated sequence Copy number; multiply the total length of the single copy of the repeat sequence by the copy number of the repeat sequence to obtain the total length of the repeat sequence of the RAD sequencing position; the total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat and non-repetitive sequences The percentage obtained is the repeat sequence content of the RAD sequencing position, i.e., the heavy f-column content of the genome is approximately obtained.
其中重复序列的平均测序深度是指重复序列对应的所有测序 序列数除以重复序列的聚类结果数; 非重复序列的平均测序深度 是指非重复序列对应的所有测序序列数除以非重复序列的聚类结 果数。 统计低于阈值的聚类结果数, 乘以测序序列的长度, 得到 非重复序列总长度.  The average sequencing depth of the repeated sequence refers to the number of all sequencing sequences corresponding to the repeated sequence divided by the number of clustering results of the repeated sequence; the average sequencing depth of the non-repetitive sequence refers to the number of all sequencing sequences corresponding to the non-repetitive sequence divided by the non-repetitive sequence The number of clustering results. The number of clustering results below the threshold is counted, multiplied by the length of the sequencing sequence to obtain the total length of the non-repetitive sequence.
综上, 通过以上步骤的处理并计算重复序列含量, 得到的结 果如表 3所示。  In summary, the results of the above steps are calculated and the content of the repeat sequence is calculated, and the results obtained are shown in Table 3.
表 3 重复区域聚类结果信息统计  Table 3 Repeating area clustering results information statistics
Figure imgf000014_0001
可以看出, 利用 RAD 测序技术对基因组进行抽样测序并估 计基因组的重复序列含量方法的结果与传统分析方法的结果一 致。 图 11 示出本发明的估计基因组重复序列含量的装置的一个 实施例的结构图。 如图 11 所示, 该装置包括: 测序序列获取设 备 111 , 获得某个体基因组的 RAD单端测序序列。 测序序列过滤 设备 112, 对获得的 RAD单端测序序列进行过滤以去除不合格的 测序序列。 不合格的测序序列例如包括: 测序质量低于预定的低 质量阈值的碱基个数超过整条测序序列碱基个数的 50%的测序序 列; 和 /或测序序列中测序结果不确定的碱基个数超过整条测序序 列碱基个数的 10%的测序序列; 和 /或存在外源序列的测序序 列; 和 /或起始的几个碱基不是酶切末端序列的测序序列。 测序深 度确定设备 113, 对序列相同的测序序列进行统计, 得到每种测 序序列的深度信息。 序列深度过滤设备 114, 用于过滤掉测序深 度为 1 的测序序列。 聚类设备 115, 用于在得到的每种测序序列 之间进行不容许空隙的两两比对, 将所有满足比对条件的测序序 列进行聚类。 重复序列确定设备 116, 用于根据每个聚类结果的 深度值选取阈值, 高于阈值的测序序列确定为重复序列; 重复序 列含量获取设备 117, 用于根据确定的重复序列得到该个体基因 组的重复序列含量。
Figure imgf000014_0001
It can be seen that the results of the method of sampling and sequencing the genome using the RAD sequencing technology and estimating the repeat content of the genome are consistent with the results of the conventional analysis method. Figure 11 is a block diagram showing an embodiment of an apparatus for estimating the content of a genomic repeat of the present invention. As shown in FIG. 11, the apparatus comprises: a sequencing sequence acquisition device 111 for obtaining a RAD single-end sequencing sequence of a certain body genome. The sequencing sequence filtering device 112 filters the obtained RAD single-ended sequencing sequence to remove unqualified sequencing sequences. The unqualified sequencing sequence includes, for example, a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the base in which the sequencing result is indeterminate in the sequencing sequence. a sequencing sequence having a number exceeding 10% of the number of bases of the entire sequencing sequence; and/or a sequencing sequence in which the exogenous sequence is present; and/or a starting sequence of several bases that is not a restriction endonuclease sequence. The sequencing depth determining device 113 performs statistics on the sequencing sequences having the same sequence to obtain depth information of each sequencing sequence. A sequence depth filtering device 114 is used to filter out sequencing sequences with a sequencing depth of one. The clustering device 115 is configured to perform a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences, and cluster all the sequencing sequences satisfying the alignment conditions. a repeating sequence determining device 116, configured to select a threshold according to a depth value of each clustering result, and a sequencing sequence higher than the threshold is determined as a repeating sequence; a repeating sequence content obtaining device 117, configured to obtain the individual genome according to the determined repeating sequence Repeat sequence content.
图 12示出本发明的估计基因组重复序列含量的装置的另一 个实施例的结构图。 根据本发明的一个实施例, 重复序列确定设 备 116包括: 聚类深度统计单元 1161 , 用于统计每一个聚类结果 的深度; 测序深度分布统计单元 1162, 用于统计每一个测序深度 下的测序序列个数; 阈值选取单元 1163, 用于才艮据深度值和测序 序列个数选取某一深度值作为阈值, 高于阈值的测序序列确定为 重复序列。 所述作为阈值的深度值需要满足的条件是: 小于该深 度值的测序序列个数占所有聚类结果中测序序列个数的 94%- 96%。 在本发明的一个实施方案中, 小于该深度值的测序序列个 数占所有聚类结果中测序序列个数的 95%。 对于图 11、 12 中各个装置或单元的功能, 可以参考上文中 关于本发明方法的实施例中对应部分的说明, 为简洁起见, 在此 不再详述。 Figure 12 is a block diagram showing another embodiment of the apparatus for estimating the content of a genomic repeat of the present invention. According to an embodiment of the present invention, the repeated sequence determining device 116 includes: a cluster depth statistic unit 1161 for counting the depth of each clustering result; a sequencing depth distribution statistic unit 1162 for counting the sequencing at each sequencing depth The threshold number selecting unit 1163 is configured to select a certain depth value as a threshold according to the depth value and the number of sequencing sequences, and the sequencing sequence higher than the threshold is determined as a repeated sequence. The condition that the depth value as the threshold needs to be satisfied is: The number of sequencing sequences smaller than the depth value accounts for 94% - 96% of the number of sequencing sequences in all clustering results. In one embodiment of the invention, the number of sequencing sequences less than the depth value is 95% of the number of sequencing sequences in all clustering results. For the functions of the various devices or units in Figures 11 and 12, reference may be made to the above description of the corresponding portions of the embodiment of the method of the present invention, which will not be described in detail herein for the sake of brevity.
本领域的技术人员应当理解, 对于图 11、 12 中的各个装 置, 可以通过单独的计算处理设备实现, 或者将其集成为一个独 立的设备实现。 在图 11、 12 中用框示出以说明它们的功能。 这 些功能块可以用硬件、 软件、 固件、 中间件、 微代码、 硬件描述 语音或者它们的任意组合来实现。 举例来说, 一个或者两个功能 块都可以利用运行在微处理器、 数字信号处理器(DSP )或任何 其他适当计算设备上的代码实现。 代码可以表示过程、 功能、 子 程序、 程序、 例行程序、 子例行程序、 模块或者指令、 数据结构 或程序语句的任意组合。 代码可以位于计算机可读介质中。 计算 机可读介质可以包括一个或者多个存储设备, 例如, 包括 RAM 存储器、 闪存存储器、 ROM 存储器、 EPROM 存储器、 EEPROM存储器、 寄存器、 硬盘、 移动硬盘、 CD-ROM或本领 域公知的其他任何形式的存储介质。 计算机可读介质还可以包括 编码数据信号的载波。  Those skilled in the art will appreciate that the various devices of Figures 11 and 12 can be implemented by separate computing processing devices or integrated into a single device implementation. They are shown in boxes in Figures 11 and 12 to illustrate their function. These functional blocks can be implemented in hardware, software, firmware, middleware, microcode, hardware description speech, or any combination thereof. For example, one or both of the functional blocks can be implemented with code running on a microprocessor, digital signal processor (DSP), or any other suitable computing device. A code can represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, or any combination of instructions, data structures, or program statements. The code can be located on a computer readable medium. The computer readable medium can include one or more storage devices including, for example, RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, mobile hard disk, CD-ROM, or any other form known in the art. Storage medium. The computer readable medium can also include a carrier wave that encodes the data signal.
本公开提供的估计基因组重复序列含量的方法和装置, 利用 RAD测序方法对部分基因组进行测序, 通过聚类分析确定重复序 列, 达到了基因组勘测的目的, 而且不需要其它基因组数据进行 模拟, 简化了基因组分析处理的复杂度, 同时, 节约了计算资 源, 减少了测序成本。  The method and device for estimating the content of genomic repeat sequences provided by the present disclosure use RAD sequencing method to sequence part of the genome, determine the repeated sequence by cluster analysis, achieve the purpose of genome survey, and do not need other genomic data to simulate, simplifying The complexity of genome analysis processing, while saving computing resources and reducing sequencing costs.
至此, 已经详细描述了根据本发明的估计基因组重复序列含 量的方法和装置。 为了避免遮蔽本发明的构思, 没有描述本领域 所公知的一些细节。 本领域技术人员根据上面的描述, 完全可以 明白如何实施这里公开的技术方案。  Heretofore, a method and apparatus for estimating the content of a genomic repeat sequence according to the present invention have been described in detail. In order to avoid obscuring the concepts of the present invention, some details known in the art are not described. Those skilled in the art will fully understand how to implement the technical solutions disclosed herein based on the above description.
虽然已经通过示例对本发明的一些特定实施例进行了详细 说明, 但是本领域的技术人员应该理解, 以上示例仅是为了进 行说明, 而不是为了限制本发明的范围。 本领域的技术人员应 该理解, 可在不脱离本发明的范围和精神的情况下, 对以上实 施例进行修改。 本发明的范围由所附权利要求来限定。 Although specific embodiments of the invention have been described in detail by way of example It is to be understood that the above examples are for illustrative purposes only and are not intended to limit the scope of the invention. It will be appreciated by those skilled in the art that the above embodiments may be modified without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

权 利 要 求 Rights request
1. 一种估计基因组重复序列含量的方法, 其特征在于, 包 括 ··  A method for estimating the content of a genomic repeat sequence, characterized by comprising:
获得某个体基因组的 RAD单端测序序列;  Obtaining a RAD single-end sequencing sequence of a somatic genome;
对 RAD单端测序序列进行过滤以去除不合格的测序序列; 对序列相同的测序序列进行统计, 得到每种测序序列的深度 The RAD single-end sequencing sequence is filtered to remove unqualified sequencing sequences; the sequencing sequences with the same sequence are counted to obtain the depth of each sequencing sequence.
>f^息 > f^息
过滤掉测序深度为 1的测序序列;  Filtering out the sequencing sequence with a sequencing depth of 1;
在得到的每种测序序列之间进行不容许空隙的两两比对, 将 所有满足比对条件的测序序列进行聚类;  Performing a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences, and clustering all the sequencing sequences satisfying the alignment conditions;
根据每个聚类结果的深度值选取阈值, 高于阈值的测序序列 确定为重复序列;  A threshold is selected according to a depth value of each clustering result, and a sequencing sequence higher than the threshold is determined as a repeated sequence;
根据确定的重复序列得到该个体基因组的重复序列含量。  The repeat sequence content of the individual genome is obtained based on the determined repeat sequence.
2. 根据权利要求 1 所述的方法, 其特征在于, 根据测序序 列的长度确定所述不容许空隙的两两比对的容许的错配数。 2. The method according to claim 1, characterized in that the allowable mismatch number of the pairwise alignment of the unallowable gaps is determined according to the length of the sequencing sequence.
3. 根据权利要求 1 所述的方法, 其特征在于, 根据每个聚 类结果的深度值选取阈值包括: 3. The method according to claim 1, wherein selecting a threshold according to a depth value of each clustering result comprises:
统计每一个聚类结果的深度值;  Count the depth value of each clustering result;
进而统计每一个测序深度下的测序序列个数;  Further counting the number of sequencing sequences at each sequencing depth;
根据深度值和测序序列个数选取某一深度值作为阈值, 所述 作为阈值的深度值需要满足的条件是: 小于该深度值的测序序列 个数占所有聚类结果中测序序列个数的 94%-96%。  A certain depth value is selected as a threshold according to the depth value and the number of sequencing sequences, and the condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 94 of the number of sequencing sequences in all clustering results. %-96%.
4. 根据权利要求 1 所述的方法, 其特征在于, 所述不合格 的测序序列包括: 测序质量低于预定的低质量阈值的碱基个数超过整条测序序 列减基个数的 50%的测序序列; 和 /或 4. The method according to claim 1, wherein the unqualified sequencing sequence comprises: a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or
测序序列中测序结果不确定的碱基个数超过整条测序序列碱 基个数的 10%的测序序列; 和 /或  The number of bases in the sequencing sequence that are not determined by the sequencing result exceeds 10% of the number of bases of the entire sequencing sequence; and/or
存在外源序列的测序序列; 和 /或  a sequencing sequence in which a foreign sequence is present; and/or
起始的几个碱基不是酶切末端序列的测序序列。  The first few bases are not the sequencing sequence of the restriction endonuclease sequence.
5. 根据权利要求 1 所述的方法, 其特征在于, 所述根据确 定的重复序列得到该个体基因组的重 列含量包括: 5. The method according to claim 1, wherein the obtaining the content of the individual genome according to the determined repeat sequence comprises:
统计高于阈值的聚类结果数, 乘以测序序列的长度, 得到重 复序列单拷贝的总长度;  Counting the number of clustering results above the threshold, multiplying by the length of the sequencing sequence, to obtain the total length of the single copy of the repeated sequence;
用重复序列的平均测序深度除以非重复序列的平均测序深 度, 得到重 列的拷贝数;  The average sequencing depth of the repeated sequences is divided by the average sequencing depth of the non-repetitive sequences to obtain a renumbered copy number;
用重复序列单拷贝的总长度乘以重复序列的拷贝数, 得到 RAD测序位置的重复序列总长度;  Multiplying the total length of a single copy of the repeat sequence by the copy number of the repeat sequence to obtain the total length of the repeat sequence of the RAD sequencing position;
测序序列中重复序列的总长度除以重复序列和非重复序列总 长度的和所得到的百分比, 为 RAD 测序位置的重复序列含量, 即近似地得到该个体基因组的重复序列含量。  The total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat and non-repetitive sequences is the repeat sequence content of the RAD sequencing position, i.e., the repeat sequence content of the individual genome is approximated.
6. 一种估计基因组重复序列含量的装置, 其特征在于, 包 括 ·· 6. A device for estimating the content of a genomic repeat sequence, characterized by comprising:
测序序列获取设备, 用于获得某个体基因组的 RAD单端测 序序列;  a sequencing sequence acquisition device for obtaining a RAD single-ended sequence of a certain genome;
测序序列过滤设备, 用于对获得的 RAD单端测序序列进行 过滤以去除不合格的测序序列;  a sequencing sequence filtering device for filtering the obtained RAD single-end sequencing sequence to remove unqualified sequencing sequences;
序列相同的测序序列统计设备, 用于对序列相同的测序序列 进行统计, 得到每种测序序列的深度信息;  a sequencing sequence statistical device with the same sequence for counting the sequencing sequences with the same sequence to obtain depth information of each sequencing sequence;
序列深度过滤设备, 用于过滤掉测序深度为 1的测序序列; 聚类设备, 用于在得到的每种测序序列之间进行不容许空隙 的两两比对, 将所有满足比对条件的测序序列进行聚类; a sequence depth filtering device for filtering out a sequencing sequence having a sequencing depth of 1; a clustering device for performing a pairwise alignment of unacceptable gaps between each of the obtained sequencing sequences, and clustering all the sequencing sequences satisfying the alignment conditions;
重复序列确定设备, 用于根据每个聚类结果的深度值选取阈 值, 高于阈值的测序序列确定为重复序列;  a repeating sequence determining device, configured to select a threshold according to a depth value of each clustering result, and a sequencing sequence higher than the threshold is determined as a repeating sequence;
重复序列含量获取设备, 用于根据确定的重复序列得到该个 体基因组的重复序列含量。  A repeat sequence content acquisition device for obtaining a repeat sequence content of the individual genome based on the determined repeat sequence.
7. 根据权利要求 6 所述的装置, 其特征在于, 所述不容许 空隙的两两比对的容许的错配数根据测序序列的长度确定。 7. Apparatus according to claim 6 wherein the number of allowable mismatches for the pairwise mismatch of the gaps is determined based on the length of the sequencing sequence.
8. 根据权利要求 6 所述的装置, 其特征在于, 所述重复序 列确定设备包括: 聚类结果深度统计单元, 用于统计每一个聚类 结果的深度值; 测序深度分布统计单元, 用于统计每一个测序深 度下的测序序列个数; 阈值选取单元, 用于根据深度值和测序序 列个数选取某一深度值作为阈值, 高于阈值的测序序列确定为重 复序列; 所述作为阈值的深度值需要满足的条件是: 小于该深度 值的测序序列个数占所有聚类结果中测序序列个数的 94 % - 96%。 The device according to claim 6, wherein the repeating sequence determining device comprises: a clustering result depth statistic unit, configured to calculate a depth value of each clustering result; a sequencing depth distribution statistic unit, Counting the number of sequencing sequences at each sequencing depth; the threshold selecting unit is configured to select a certain depth value as a threshold according to the depth value and the number of sequencing sequences, and the sequencing sequence higher than the threshold is determined as a repeated sequence; The conditions for the depth value to be satisfied are: The number of sequencing sequences smaller than the depth value accounts for 94% - 96% of the number of sequencing sequences in all clustering results.
9. 根据权利要求 6 所述的装置, 其特征在于, 所述不合格 的测序序列包括: 9. The apparatus according to claim 6, wherein the unqualified sequencing sequence comprises:
测序质量低于预定的低质量阈值的碱基个数超过整条测序序 列碱基个数的 50%的测序序列; 和 /或  a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or
测序序列中测序结果不确定的碱基个数超过整条测序序列碱 基个数的 10%的测序序列; 和 /或  The number of bases in the sequencing sequence that are not determined by the sequencing result exceeds 10% of the number of bases of the entire sequencing sequence; and/or
存在外源序列的测序序列; 和 /或起始的几个碱基不是酶切末 端序列的测序序列。  A sequencing sequence in which a foreign sequence is present; and/or a starting sequence of several bases is not a sequencing sequence of the restriction end-end sequence.
PCT/CN2011/084928 2011-12-29 2011-12-29 Method and device for estimating repeating sequence content of genome WO2013097149A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/084928 WO2013097149A1 (en) 2011-12-29 2011-12-29 Method and device for estimating repeating sequence content of genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/084928 WO2013097149A1 (en) 2011-12-29 2011-12-29 Method and device for estimating repeating sequence content of genome

Publications (1)

Publication Number Publication Date
WO2013097149A1 true WO2013097149A1 (en) 2013-07-04

Family

ID=48696227

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/084928 WO2013097149A1 (en) 2011-12-29 2011-12-29 Method and device for estimating repeating sequence content of genome

Country Status (1)

Country Link
WO (1) WO2013097149A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114937475A (en) * 2022-04-12 2022-08-23 桂林电子科技大学 Automatic evaluation method for error correction result of PacBio sequencing data
CN117106875A (en) * 2023-10-23 2023-11-24 中国科学院昆明植物研究所 Method for estimating plant genome size and/or repeatability based on low-depth sequencing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BAIRD, N.A. ET AL.: "Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD markers", PLOS ONE, vol. 3, no. 10, October 2008 (2008-10-01), pages E3376 *
BAXTER, S.W. ET AL.: "Linkage Mapping and comparative Genomics Using Next-Generation RAD Sequencing of a Non-Model Organism", PLOS ONE, vol. 6, no. 4, April 2011 (2011-04-01), pages E19315 *
WF, P. ET AL.: "Mapping with RAD (restriction-site associated DNA) markers to rapidly identify QTL for stem rust resistance in Lolium perenne", TAG THEORETICAL AND APPLIED GENETICS, vol. 122, no. 8, 23 February 2011 (2011-02-23), pages 1467 - 1480 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114937475A (en) * 2022-04-12 2022-08-23 桂林电子科技大学 Automatic evaluation method for error correction result of PacBio sequencing data
CN117106875A (en) * 2023-10-23 2023-11-24 中国科学院昆明植物研究所 Method for estimating plant genome size and/or repeatability based on low-depth sequencing
CN117106875B (en) * 2023-10-23 2024-02-06 中国科学院昆明植物研究所 Method for estimating plant genome size and/or repeatability based on low-depth sequencing

Similar Documents

Publication Publication Date Title
Pertea et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise
Clark et al. Performance comparison of exome DNA sequencing technologies
Smith et al. Widespread purifying selection on RNA structure in mammals
Martin et al. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads
Sun et al. Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study
KR101795124B1 (en) Method and system for detecting copy number variation
Mangul et al. ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues
Gogol-Döring et al. An overview of the analysis of next generation sequencing data
Bansal A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
Zhou et al. Prevention, diagnosis and treatment of high‐throughput sequencing data pathologies
Zhang et al. Isoform evolution in primates through independent combination of alternative RNA processing events
Corney RNA-seq using next generation sequencing
WO2012116658A2 (en) Method and device for assembling genome sequence
Li et al. CircMarker: a fast and accurate algorithm for circular RNA detection
Ma et al. The analysis of ChIP-Seq data
WO2013097048A1 (en) Method and device for labelling single nucleotide polymorphism sites in genome
Delhomme et al. Guidelines for RNA-Seq data analysis
Kremer et al. Approaches for in silico finishing of microbial genome sequences
Akogwu et al. A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis
Guo et al. Single-nucleotide variants in human RNA: RNA editing and beyond
CN107832584B (en) Gene analysis method, device, equipment and storage medium of metagenome
CN115101128A (en) Method for evaluating off-target risk of hybridization capture probe
WO2013097149A1 (en) Method and device for estimating repeating sequence content of genome
WO2013097143A1 (en) Method and device for estimating genome heterozygosity rate
WO2018019138A1 (en) Data processing method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11878963

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/11/2014)

122 Ep: pct application non-entry in european phase

Ref document number: 11878963

Country of ref document: EP

Kind code of ref document: A1