WO2013097149A1

WO2013097149A1 - Method and device for estimating repeating sequence content of genome

Info

Publication number: WO2013097149A1
Application number: PCT/CN2011/084928
Authority: WO
Inventors: 郑泽群; 陶晔; 冯子浩; 汪健; 杨焕明; 王俊
Original assignee: 深圳华大基因科技服务有限公司
Priority date: 2011-12-29
Filing date: 2011-12-29
Publication date: 2013-07-04

Abstract

Disclosed is a method for estimating the repeating sequence content of a genome. The method is carried out by filtration, statistical analysis, alignment, clustering and threshold selection of single-end RAD sequences for sequencing of an individual genome. Further disclosed is a device for estimating the repeating sequence content of a genome.

Description

Method and apparatus for estimating genomic repeat content

The present invention relates to the field of bioinformatics technology, and more particularly to a method and apparatus for estimating the content of a genomic repeat sequence. Background technique

A repeat sequence refers to a segment of DNA that has multiple copies on an individual's genome.

The second generation of DNA sequencing technology is a high-throughput, low-cost sequencing technology. The basic principle is sequencing while synthesizing. Taking the solexa sequencing method as an example, the DNA strand is randomly interrupted by physical means, and then a specific linker is added to both ends of the fragment, and an amplified primer sequence is added to the linker. When sequencing, DNA polymerase synthesizes the complementary strand of the fragment to be tested, and reads the sequence of the fragment to be tested by detecting the fluorescent signal carried by the newly synthesized base. These sequences are called sequencing fragments or reads ( Http://www.illumina.com ).

Sequencing and re-engineering a species of DNA molecules generally requires a general understanding of the sequence of the species. Since sequence assembly is to restore the sequence information of the genome by overlapping the overlapping segments. In this case, if the repeat sequence content is too high, the sequencing data obtained by the whole genome shotgun method will not be ideal for Denovo assembly. Therefore, it is often necessary to perform a Genome Survey prior to assembly in Denovo to understand the repeat content of the genome.

The traditional approach to Surveying the genome requires whole-genome sequencing, with sequencing depths between approximately 20 and 30x. After the sequencing data is obtained, the kmer frequency distribution map is obtained by using the reads data, thereby estimating the genomic heterozygosity rate. The specific method is to assume that there is a complete continuous sequence and randomly select the length of the segment. For K, the fragment is called kmer. Therefore, when the length of reads is L and the length of kmer is taken as K, then L-K+1 kmer can be obtained on one read. Then, by counting the frequency of occurrence of different types of kmer on all reads, the kmer frequency distribution map can be obtained. The specific process is shown in Figure 1.

According to Lander-Waterman statistics, the frequency distribution of the genome kmer can be approximated as ^ from the Poisson distribution. According to the Poisson distribution theory, the depth of sequencing corresponding to the peak is the average sequencing depth of the genome. For diploids, if the genomic repeat sequence is relatively high, repeated peaks or tailing will occur behind the main peak of the kmer distribution. To estimate the repetitive sequence content of the genome, it is necessary to use other genomic data to simulate, for example, the repetitive sequence content of the genome of the Arabidopsis genome. In Arabidopsis, the artificial reads are artificially set to generate simulated reads that are consistent with the depth of sequencing of the target genome, and then the kmer frequency distribution map is obtained by simulating reads. By comparing the consistency of the kmer frequency distribution generated by the simulation with the kmer distribution of the target genome, different repeat sequences are set to estimate the repeat sequence content of the target genome, as shown in Figure 2.

Because this traditional genomic survey method requires whole-genome sequencing, the sequencing depth is between 20~30x, so the cost is relatively high; because of the large amount of sequencing data, more computing resources are needed when processing data; Data from known genomes are required for simulation, further increasing processing steps and data throughput. Therefore, there is a need for a new genomic survey method that can easily estimate the repeat content of the genome with less sequencing data to reduce the high sequencing cost and computational resource cost required by traditional methods. Summary of the invention

The present invention has been made in view of the above problems.

A first aspect of the invention provides a method of estimating the content of a genomic repeat sequence, comprising: obtaining a RAD single-end sequencing sequence of a somatic genome (reads); filtering the RAD single-ended sequencing sequence to remove the unqualified sequencing sequence; counting the sequencing sequences with the same sequence to obtain the depth information of each sequencing sequence; filtering out the sequencing sequence with a sequencing depth of 1; Each pair of sequencing sequences is subjected to a pairwise alignment of unallowable gaps, and all sequencing sequences satisfying the alignment conditions are clustered; a threshold is selected according to the depth value of each clustering result, and a sequencing sequence higher than the threshold is determined as a repeat sequence; the repeat sequence content of the individual genome is obtained based on the determined repeat sequence.

Preferably, the number of allowable mismatches for the pairwise alignment of the non-allowing gaps is determined according to the length of the sequence of the sequences, i.e., the alignment conditions of the pairwise alignments that do not allow the gaps are determined based on the length of the sequencing sequence.

Preferably, the selecting a threshold according to the depth value of each clustering result comprises: counting a depth value of each clustering result; and further counting the number of sequencing sequences in each sequencing depth; selecting according to the depth value and the number of sequencing sequences a certain depth value as a threshold,

The condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 94%-96% of the number of sequencing sequences in all clustering results; in one embodiment of the present invention, less than the The number of sequencing sequences of depth values accounted for 95% of the number of sequencing sequences in all clustering results.

Preferably, the unsuccessful sequencing sequence comprises: a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the sequencing result in the sequencing sequence is uncertain a sequence in which the number of bases exceeds 10% of the number of bases of the entire sequencing sequence; and/or a sequence in which the exogenous sequence is present; and/or a sequence in which the first few bases are not the ends of the restriction endonuclease .

Preferably, obtaining the repeat sequence content of the individual genome according to the determined repeat sequence comprises: counting the number of clustering results above the threshold, multiplying the length of the sequencing sequence to obtain the total length of the single copy of the repeat sequence; using the average sequencing of the repeat sequence Depth is divided by the average sequencing depth of the non-repetitive sequence to obtain the copy number of the repeat sequence; The total length of the single copy is multiplied by the copy number of the repeat sequence to obtain the total length of the repeat sequence of the RAD sequencing position; the total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat and non-repetitive sequences is the RAD The repeat sequence content at the sequencing position, i.e., approximately the repeat sequence content of the individual genome.

Another aspect of the present invention provides an apparatus for estimating a genomic repeat sequence content, comprising: a sequencing sequence acquisition device for obtaining a RAD single-end sequencing sequence of a certain genome; a sequencing sequence filtering device for obtaining an RAD single The end sequencing sequence is filtered to remove the unqualified sequencing sequence; the sequencing depth determining device is used to count the sequencing sequence with the same sequence to obtain the depth information of each sequencing sequence; the sequence depth filtering device is used to filter out the sequencing depth 1 a sequencing sequence; a clustering device for performing a pairwise alignment of unacceptable gaps between each of the obtained sequencing sequences, clustering all sequencing sequences satisfying the alignment conditions; a repeating sequence determining device for each The depth value of the clustering result is selected as a threshold, and the sequencing sequence above the threshold is determined as a repeating sequence; and the repeating sequence content obtaining device is configured to obtain the repeating sequence content of the individual genome according to the determined re-listing.

Preferably, the repeating sequence determining device comprises: a clustering result depth statistic unit for counting the depth value of each clustering result; a sequencing depth distribution statistic unit for counting the number of sequencing sequences at each sequencing depth; threshold selection a unit, configured to select a depth value as a threshold according to the depth value and the number of sequencing sequences, and the sequencing sequence above the threshold is determined as a repeated sequence; the depth value to be satisfied as the threshold is: a sequence smaller than the depth value The number of sequences accounted for 94%-96% of the number of sequencing sequences in all clustering results. In one embodiment of the invention, the number of sequencing sequences less than the depth value is 95% of the number of sequencing sequences in all clustering results.

Preferably, the unqualified sequencing sequence comprises: the sequencing quality is lower than a predetermined low quality The number of bases of the threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the number of bases whose sequencing results are undefined in the sequencing sequence exceeds 10% of the number of bases of the entire sequencing sequence a sequencing sequence; and/or a sequencing sequence in which a foreign sequence is present; and/or a starting sequence of several bases that are not a restriction endonuclease sequence.

An advantage of the present invention is that the partial sequence of the genome allows for easy estimation of the replied sequence content of the genome, reduces sequencing costs and computational resource costs, and does not require known genomic data for simulation, simplifying the processing steps.

Other features and advantages of the present invention will become apparent from the Detailed Description of the <RTI DRAWINGS

1 is a schematic flow chart showing a kmer frequency distribution map by sequencing reads in the prior art;

The abscissa in the figure represents the sequencing depth of kmer, and the ordinate represents the percentage of the kmer species with a certain sequencing depth as a percentage of the total kmer species;

Figure 2 is a schematic diagram showing the hybridization rate of a target genome by the Arabidopsis genome in the prior art;

Figure 3 shows a schematic diagram of the various steps of the RAD sequencing technique;

Figure 4 is a flow chart showing one embodiment of a method of estimating the genomic repeat content of the present invention;

Figure 5 is a schematic diagram showing an example of RAD single-end sequencing of a genome; Figure 6 is a schematic diagram showing the depth information of a sequencing sequence;

Figure 7 is a schematic diagram showing the depth information storage of a sequencing sequence;

Figure 8 is a flow chart showing an example of the repeat sequence determination of the present invention; Figure 9 is a view showing a sequencing depth profile of an application of the method for estimating the genomic repeat sequence of the present invention;

Figure 10 shows one of the methods of estimating the genomic repeat content of the present invention. Schematic diagram of an application example;

Figure 11 is a block diagram showing an embodiment of an apparatus for estimating the content of a genomic repeat of the present invention;

Fig. 12 is a view showing the structure of another embodiment of the apparatus for estimating the genomic repeat sequence of the present invention. detailed description

Various exemplary embodiments of the present invention will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in the embodiments are not intended to limit the scope of the invention unless otherwise specified.

In the meantime, it should be understood that the dimensions of the various parts shown in the drawings are not drawn in the actual scale relationship for the convenience of the description.

The following description of the at least one exemplary embodiment is merely illustrative and is in no way

Techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods and apparatus should be considered as part of the authorization specification.

In all of the examples shown and discussed herein, any specific values are to be construed as illustrative only and not as a limitation. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that similar reference numerals and letters indicate similar items in the following figures, and therefore, once an item is defined in a drawing, it is not required to be further discussed in the subsequent drawings.

In response to the problems of the prior art, the present disclosure provides a new bioinformatics analysis program that processes RAD (Eestriction-site Associated ΝΑ, P艮内相关相关 ) )) data to find duplicates on RAD sequencing fragments Sequence letter In order to calculate the repeat sequence content, the processing steps in the prior art are simplified, and the sequencing cost and the calculation resource cost are also reduced.

Several concepts related to the technical solution of the present invention are described below.

RAD sequencing technology adopts a new database construction method. The specific process of sequencing is shown in Figure 3. The specific site of DNA is cleaved by restriction endonuclease, and the DNA molecules after digestion are randomly interrupted by physical methods. The agarose gel DNA separation technique selects a DNA molecule of a specific length, and then adds a specific amplification linker and a sequencing linker at the end of the selected DNA to construct a library for high-throughput sequencing.

The repeat sequence content refers to the total length of the repeat sequences in the sequencing sequence divided by the sum of the sum of the total length of the repeat and non-repetitive sequences.

The pairwise alignment that does not allow the gap means that the vacancy is not allowed when the alignment is performed. That is, the situation of the open space alignment is not considered. For example, the comparison result of the following two sequences does not satisfy the two-two comparison condition that does not allow the gap:

Sequence 1: AATTCATCGAC

Sequence 2: AA CATCGTC.

Reads refers to the sequencing fragments produced by the sequencer.

Figure 4 is a flow chart showing one embodiment of a method of estimating the genomic repeat content of the present invention.

As shown in Figure 4, step 402, a RAD single-end sequencing sequence of a certain body genome is obtained. Figure 5 shows a schematic of an example of RAD single-ended sequencing. In Figure 5, the palindrome sequence of "G ^A AATTC" on the DNA molecule is identified by the P-type endonuclease Ecorl, and the DNA molecule is cleaved between G and A, and the digested DNA molecule is used. The physical method is interrupted into a short sequence fragment, and a ligation end is added to the end of the restriction enzyme, and the DNA fragment is subjected to single-end sequencing. The sequencing read length is generally 50 nt or 100 nt.

In step 404, the RAD single-ended sequencing sequence is filtered to remove unqualified sequencing sequences. For example, after receiving a high-throughput RAD single-ended sequencing sequence, sequencing The column is filtered to remove the unqualified sequence. High-throughput sequencing technology can be

Illumina GA sequencing technology can also be used for other high-throughput sequencing technologies available. The unqualified sequencing sequence includes, for example, that the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequence is considered to be a defective sequence. The low quality threshold is determined by the specific sequencing technology and sequencing environment, for example, the single base sequencing quality is lower than 20; the number of bases with undefined sequencing results in the sequencing sequence (such as N in Illumina GA sequencing results) exceeds the whole number. 10% of the number of bases in the sequencing sequence is considered to be a non-conforming sequence; in addition to the sample linker sequence, it is aligned with other exogenous sequences introduced by experiments, such as various linker sequences. If the exogenous sequence is present in the sequence, it is considered to be a non-conforming sequence; in the sequencing sequence, if the first few bases are not the end-cutting sequence, then it is filtered out (such as the restriction endonuclease Ecorl, if the sequencing sequence is not at the beginning) AATTC "filters out the entire sequencing sequence."

Step 406 performs statistics on the sequencing sequences of the same sequence to obtain depth information of each sequencing sequence. For example, sequencing sequences with the same sequence are counted statistically, and each sequencing sequence is assembled into a stack, so that the sequencing depth information of each sequencing sequence can be obtained. The specific process is shown in Figure 6. The information of the heap can be in the manner of FIG. 7. In FIG. 7, the first column indicates the RAD sequencing sequence information; the second column indicates the number of times the sequence is sequenced, that is, the depth information; the third column indicates The ID of the sequence information.

Step 408 filters out the sequencing sequence with a sequencing depth of one. Sequences with a depth of 1 are usually caused by sequencing errors, filtering out sequence information with a depth of 1 and reducing SNPs due to sequencing errors.

Step 410 performs a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences, and clusters all the sequencing sequences satisfying the alignment conditions. The number of mismatches allowed during the alignment is determined by the length of the sequence. For example, if the sequencing length is less than 50 bases (nt), the allowable number of mismatches is 1, and the sequencing length is lOOnt. The number of mismatches is 2. Specifically, for example, when the sequencing length is less than 50 nt In this case, only two sequences with different Wichi are grouped together.

Step 412 selects a threshold based on the depth value of each clustering result, and the sequencing sequence above the threshold is determined as a repeating sequence.

Step 414 obtains the repeat sequence content of the individual genome based on the determined repeat sequence.

In one embodiment of the present invention, obtaining the repeat sequence content of the individual genome according to the determined repeat sequence comprises: counting the number of clustering results higher than the threshold, multiplying the length of the sequencing sequence, to obtain the total length of the single copy of the repeated sequence; The average sequencing depth of the repeated sequence is divided by the average sequencing depth of the non-repetitive sequence to obtain the copy number of the repeated sequence; and the total length of the single copy of the repeated sequence is multiplied by the copy number of the repeated sequence to obtain the total length of the repeated sequence of the RAD sequencing position; The total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat sequence and the non-repetitive sequence is the repeat sequence content of the RAD sequencing position, that is, the repeat sequence content of the individual genome is approximately obtained. In the above embodiment, by directly processing the RAD sequencing sequence data, searching for the repetitive sequence on the RAD fragment, further obtaining the information of the repetitive sequence content, and not relying on the data information of the known genome, overcoming some technical bottlenecks of the traditional method for obtaining the repetitive sequence content. . By RAD sequencing, specific regions of the genome will be enriched and sequenced, which will reduce the amount of data sequencing, and reduce the computational resources and sequencing costs required for analysis due to different analytical methods and reduced data volume.

According to the method of the present invention, an embodiment of the present invention proposes a new method for determining a repeat sequence, the basic idea of which is: performing a pairwise alignment of unallowable gaps between each sequencing sequence, using an alignment The software can be any sequence alignment software, such as blast, blat, etc.; all the sequencing sequences satisfying the matching condition of the mismatch can be clustered; the threshold is selected according to the depth value of each clustering result, above the threshold The sequencing sequence was determined to be a repeat sequence. The depth value according to each clustering result The selection threshold includes: counting the depth value of each clustering result, and then counting the number of sequencing sequences at each sequencing depth, and selecting a certain depth value as the threshold according to the depth value and the number of sequencing sequences. The condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for the number of sequencing sequences in all clustering results.

94%-96%; In one embodiment of the invention, the number of sequencing sequences less than the depth value accounts for 95% of the number of sequencing sequences in all clustering results. The specific process is shown in Figure 8:

In step 802, a pairwise alignment of the unallowable gaps is performed between each of the sequencing sequences, and all the sequencing sequences satisfying the alignment conditions are clustered.

Step 804, calculating a depth value of each clustering result.

Step 806, counting the number of sequencing sequences at each sequencing depth.

Step 808: Select a depth value as the threshold according to the depth value and the number of sequencing sequences. Specifically, the condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 94%-96% of the number of sequencing sequences in all clustering results; in an embodiment of the present invention The number of sequencing sequences smaller than the depth value accounts for 95% of the number of sequencing sequences in all clustering results.

At step 810, the sequencing sequence above the threshold is determined to be a repeat sequence.

Through the comparison and statistical methods of the above embodiments, the amount of calculation is small, the speed is fast, and the efficiency is high, which simplifies the processing steps in the conventional manner.

Since the distribution of the repeat sequences is relatively uniform in the genome sequence, the RAD sequencing method is equivalent to randomly extracting some fragments of the genomic DNA sequence, and analyzing the RAD sequencing fragments to obtain the repeat sequence content of the RAD sequencing fragment. Since the RAD sequencing method is capable of measuring sequence information from three to six percent of the genome, the sample size of the sample is large. This allows the repeat sequence content at the sequencing position to approximate the repeat content of the entire genome. Figure 10 shows a method for estimating the genomic re-listing content of the present invention. A schematic of the use case. The data of this example used wild 茭 white, flowering 茭 white, common 茭 white RAD sequencing sequence data (ie, reads data). Among them, the RAD sequencing method is a method well known in the art, for example, the following documents can be referred to:

(1) Michael R Miller, Tressa S Atwood, B Frank Eames, et al, RAD marker microarrays enable rapid mapping of

Zebrafishmutations, Genome Biology, 2007, 8(6): R105.1-R105.10;

(2) Michael R, Miller, Joseph P. Dnnhaiii, An gel Aniores ₅ et al, Rnpicl and cost-effective polyniorphtsiii identificatloniiec!

Genothypieg iising restriction site nssoelat d DNA(RAD) markers, Genome Research, 2007? 17? 240-248?

(3) Nathan A. Bairdl, Paul D. Etter, Tressa S. Atwood, et al, Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers, PLoS ONE, 2008,3(10), e3376, clot: 10,137! /joiiraaLpone .0003376,

Using conventional methods, it is known that the content of the three white repeat sequences is not much different.

The specific operation flow of the embodiment is shown in FIG. 10, step 1002, the three kinds of white sequencing read data are filtered according to the sequencing quality value, the N content, and whether the end-cut sequence is included, and the unqualified sequencing sequence is removed. The valid data statistics are shown in Table 1. Table 1 Three kinds of white RAD sequencing effective data statistics

In step 1004, the sequencing sequences with the same sequence are statistically counted to obtain the depth of each sequencing sequence, and the sequencing sequence with a sequencing depth of 1 is filtered out. The results are shown in Table 2. Table 2, three kinds of white reading data statistics

In step 1006, the sequencing sequence data with the same sequence are subjected to pairwise alignment, and all the sequencing sequences satisfying the alignment conditions are clustered. The mismatch number allowed for the alignment is, for example, 1, that is, the alignment condition is that only one base is different between the two sequences, and the two sequences are classified into one class. If there is only one radii between the A sequence and the B sequence, and there is only one other base between B and C, then the three sequences are grouped into one class, and so on, through the alignment between all sequencing sequences. , all sequencing sequences that satisfy the alignment conditions can be clustered.

In step 1008, the depth of each clustering result is counted, and the number of sequencing sequences at each sequencing depth is counted. A sequencing depth profile of the sequencing sequence is made. As shown in Figure 9.

Step 1010: Select a depth value as a threshold according to the depth value and the number of sequencing sequences. Specifically, the condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 95% of the number of sequencing sequences in all clustering results. The sequencing sequence in the clustering result above the threshold is determined to be a repeating sequence.

The threshold selected in this embodiment is a depth value of 100. Accordingly, sequencing sequences below the threshold are determined to be non-repetitive sequences. Step 1012: Obtain a sequence repeat content of the genome according to the determined repeat sequence. Specifically, the method includes: counting the number of clustering results above the threshold, multiplying the length of the sequencing sequence, and obtaining the total length of the single copy of the repeated sequence; dividing the average sequencing depth of the repeated sequence by the average sequencing depth of the non-repetitive sequence to obtain the repeated sequence Copy number; multiply the total length of the single copy of the repeat sequence by the copy number of the repeat sequence to obtain the total length of the repeat sequence of the RAD sequencing position; the total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat and non-repetitive sequences The percentage obtained is the repeat sequence content of the RAD sequencing position, i.e., the heavy f-column content of the genome is approximately obtained.

The average sequencing depth of the repeated sequence refers to the number of all sequencing sequences corresponding to the repeated sequence divided by the number of clustering results of the repeated sequence; the average sequencing depth of the non-repetitive sequence refers to the number of all sequencing sequences corresponding to the non-repetitive sequence divided by the non-repetitive sequence The number of clustering results. The number of clustering results below the threshold is counted, multiplied by the length of the sequencing sequence to obtain the total length of the non-repetitive sequence.

In summary, the results of the above steps are calculated and the content of the repeat sequence is calculated, and the results obtained are shown in Table 3.

Table 3 Repeating area clustering results information statistics

It can be seen that the results of the method of sampling and sequencing the genome using the RAD sequencing technology and estimating the repeat content of the genome are consistent with the results of the conventional analysis method. Figure 11 is a block diagram showing an embodiment of an apparatus for estimating the content of a genomic repeat of the present invention. As shown in FIG. 11, the apparatus comprises: a sequencing sequence acquisition device 111 for obtaining a RAD single-end sequencing sequence of a certain body genome. The sequencing sequence filtering device 112 filters the obtained RAD single-ended sequencing sequence to remove unqualified sequencing sequences. The unqualified sequencing sequence includes, for example, a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the base in which the sequencing result is indeterminate in the sequencing sequence. a sequencing sequence having a number exceeding 10% of the number of bases of the entire sequencing sequence; and/or a sequencing sequence in which the exogenous sequence is present; and/or a starting sequence of several bases that is not a restriction endonuclease sequence. The sequencing depth determining device 113 performs statistics on the sequencing sequences having the same sequence to obtain depth information of each sequencing sequence. A sequence depth filtering device 114 is used to filter out sequencing sequences with a sequencing depth of one. The clustering device 115 is configured to perform a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences, and cluster all the sequencing sequences satisfying the alignment conditions. a repeating sequence determining device 116, configured to select a threshold according to a depth value of each clustering result, and a sequencing sequence higher than the threshold is determined as a repeating sequence; a repeating sequence content obtaining device 117, configured to obtain the individual genome according to the determined repeating sequence Repeat sequence content.

Figure 12 is a block diagram showing another embodiment of the apparatus for estimating the content of a genomic repeat of the present invention. According to an embodiment of the present invention, the repeated sequence determining device 116 includes: a cluster depth statistic unit 1161 for counting the depth of each clustering result; a sequencing depth distribution statistic unit 1162 for counting the sequencing at each sequencing depth The threshold number selecting unit 1163 is configured to select a certain depth value as a threshold according to the depth value and the number of sequencing sequences, and the sequencing sequence higher than the threshold is determined as a repeated sequence. The condition that the depth value as the threshold needs to be satisfied is: The number of sequencing sequences smaller than the depth value accounts for 94% - 96% of the number of sequencing sequences in all clustering results. In one embodiment of the invention, the number of sequencing sequences less than the depth value is 95% of the number of sequencing sequences in all clustering results. For the functions of the various devices or units in Figures 11 and 12, reference may be made to the above description of the corresponding portions of the embodiment of the method of the present invention, which will not be described in detail herein for the sake of brevity.

Those skilled in the art will appreciate that the various devices of Figures 11 and 12 can be implemented by separate computing processing devices or integrated into a single device implementation. They are shown in boxes in Figures 11 and 12 to illustrate their function. These functional blocks can be implemented in hardware, software, firmware, middleware, microcode, hardware description speech, or any combination thereof. For example, one or both of the functional blocks can be implemented with code running on a microprocessor, digital signal processor (DSP), or any other suitable computing device. A code can represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, or any combination of instructions, data structures, or program statements. The code can be located on a computer readable medium. The computer readable medium can include one or more storage devices including, for example, RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, mobile hard disk, CD-ROM, or any other form known in the art. Storage medium. The computer readable medium can also include a carrier wave that encodes the data signal.

The method and device for estimating the content of genomic repeat sequences provided by the present disclosure use RAD sequencing method to sequence part of the genome, determine the repeated sequence by cluster analysis, achieve the purpose of genome survey, and do not need other genomic data to simulate, simplifying The complexity of genome analysis processing, while saving computing resources and reducing sequencing costs.

Heretofore, a method and apparatus for estimating the content of a genomic repeat sequence according to the present invention have been described in detail. In order to avoid obscuring the concepts of the present invention, some details known in the art are not described. Those skilled in the art will fully understand how to implement the technical solutions disclosed herein based on the above description.

Although specific embodiments of the invention have been described in detail by way of example It is to be understood that the above examples are for illustrative purposes only and are not intended to limit the scope of the invention. It will be appreciated by those skilled in the art that the above embodiments may be modified without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

Rights request

A method for estimating the content of a genomic repeat sequence, characterized by comprising:

Obtaining a RAD single-end sequencing sequence of a somatic genome;

The RAD single-end sequencing sequence is filtered to remove unqualified sequencing sequences; the sequencing sequences with the same sequence are counted to obtain the depth of each sequencing sequence.

^> f^息

Filtering out the sequencing sequence with a sequencing depth of 1;

Performing a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences, and clustering all the sequencing sequences satisfying the alignment conditions;

A threshold is selected according to a depth value of each clustering result, and a sequencing sequence higher than the threshold is determined as a repeated sequence;

The repeat sequence content of the individual genome is obtained based on the determined repeat sequence.

2. The method according to claim 1, characterized in that the allowable mismatch number of the pairwise alignment of the unallowable gaps is determined according to the length of the sequencing sequence.

3. The method according to claim 1, wherein selecting a threshold according to a depth value of each clustering result comprises:

Count the depth value of each clustering result;

Further counting the number of sequencing sequences at each sequencing depth;

A certain depth value is selected as a threshold according to the depth value and the number of sequencing sequences, and the condition that the depth value as the threshold needs to be satisfied is: the number of sequencing sequences smaller than the depth value accounts for 94 of the number of sequencing sequences in all clustering results. %-96%.

4. The method according to claim 1, wherein the unqualified sequencing sequence comprises: a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or

The number of bases in the sequencing sequence that are not determined by the sequencing result exceeds 10% of the number of bases of the entire sequencing sequence; and/or

a sequencing sequence in which a foreign sequence is present; and/or

The first few bases are not the sequencing sequence of the restriction endonuclease sequence.

5. The method according to claim 1, wherein the obtaining the content of the individual genome according to the determined repeat sequence comprises:

Counting the number of clustering results above the threshold, multiplying by the length of the sequencing sequence, to obtain the total length of the single copy of the repeated sequence;

The average sequencing depth of the repeated sequences is divided by the average sequencing depth of the non-repetitive sequences to obtain a renumbered copy number;

Multiplying the total length of a single copy of the repeat sequence by the copy number of the repeat sequence to obtain the total length of the repeat sequence of the RAD sequencing position;

The total length of the repeat sequence in the sequencing sequence divided by the sum of the total length of the repeat and non-repetitive sequences is the repeat sequence content of the RAD sequencing position, i.e., the repeat sequence content of the individual genome is approximated.

6. A device for estimating the content of a genomic repeat sequence, characterized by comprising:

a sequencing sequence acquisition device for obtaining a RAD single-ended sequence of a certain genome;

a sequencing sequence filtering device for filtering the obtained RAD single-end sequencing sequence to remove unqualified sequencing sequences;

a sequencing sequence statistical device with the same sequence for counting the sequencing sequences with the same sequence to obtain depth information of each sequencing sequence;

a sequence depth filtering device for filtering out a sequencing sequence having a sequencing depth of 1; a clustering device for performing a pairwise alignment of unacceptable gaps between each of the obtained sequencing sequences, and clustering all the sequencing sequences satisfying the alignment conditions;

a repeating sequence determining device, configured to select a threshold according to a depth value of each clustering result, and a sequencing sequence higher than the threshold is determined as a repeating sequence;

A repeat sequence content acquisition device for obtaining a repeat sequence content of the individual genome based on the determined repeat sequence.

7. Apparatus according to claim 6 wherein the number of allowable mismatches for the pairwise mismatch of the gaps is determined based on the length of the sequencing sequence.

The device according to claim 6, wherein the repeating sequence determining device comprises: a clustering result depth statistic unit, configured to calculate a depth value of each clustering result; a sequencing depth distribution statistic unit, Counting the number of sequencing sequences at each sequencing depth; the threshold selecting unit is configured to select a certain depth value as a threshold according to the depth value and the number of sequencing sequences, and the sequencing sequence higher than the threshold is determined as a repeated sequence; The conditions for the depth value to be satisfied are: The number of sequencing sequences smaller than the depth value accounts for 94% - 96% of the number of sequencing sequences in all clustering results.

9. The apparatus according to claim 6, wherein the unqualified sequencing sequence comprises:

a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or

A sequencing sequence in which a foreign sequence is present; and/or a starting sequence of several bases is not a sequencing sequence of the restriction end-end sequence.