CN110827920B

CN110827920B - Sequencing data analysis method and equipment and high-throughput sequencing method

Info

Publication number: CN110827920B
Application number: CN201810921895.0A
Authority: CN
Inventors: 刘舒; 刘晨; 刘莉玲; 黄金
Original assignee: Wuhan Bgi Medical Laboratory Co ltd
Current assignee: Wuhan Bgi Medical Laboratory Co ltd
Priority date: 2018-08-14
Filing date: 2018-08-14
Publication date: 2022-11-22
Anticipated expiration: 2038-08-14
Also published as: CN110827920A

Abstract

The invention relates to the field of gene sequencing, in particular to a sequencing data analysis method and equipment and a high-throughput sequencing method. The sequencing data comprises suspected contaminating sequencing reads that contain a linker matching region, the method comprising: determining an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising: a joint matching zone and a joint adjacent zone; determining a sequence corresponding to a linker adjacent region based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing reads; determining whether the suspected contaminating sequencing reads are contaminated with a linker based on a level of matching of the linker adjacent corresponding sequence to the linker adjacent. By utilizing the method and the device, sequencing reads with joint pollution can be effectively and comprehensively removed, the base balance of data after the joint pollution filtration is ensured, and the accuracy of the data can be improved.

Description

Sequencing data analysis method and equipment and high-throughput sequencing method

Technical Field

The invention relates to the field of gene sequencing, in particular to a sequencing data analysis method and equipment and a high-throughput sequencing method.

Background

After the second generation sequencing raw data is downloaded, data filtering processing is usually performed first before use, including reads (reads) for removing linker contamination, reads of low quality, reads for sequencing read N, and the like.

Linker-contaminated reads are reads in which linker sequences are detected at the end of sequencing when some of the inserts constructed from the library are smaller than the sequencing read length, and the inserts containing linker sequences are linker-contaminated reads. Since the adaptor sequence is not the sequence of the actual insert in the sample, it needs to be removed after the sequencing is completed, so as not to affect the randomness of the base of the sample and the accuracy of information analysis.

However, how to filter the reads contaminated with the joint is in need of further improvement.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

The inventor of the invention finds out in the research process that:

when the base sequence obtained by sequencing the tail end of the original reading segment is matched with the adaptor sequence, if the comparison result shows that: when the degree of matching the adaptor can reach more than 50% after a mismatch of one base, the read is considered to be the adaptor-contaminated read, and the read is removed entirely, and this method for filtering the adaptor-contaminated read has many problems and disadvantages, which are shown in the following aspects:

first, this filtering method is not effective in removing all of the splice-contaminated reads. For example, when the length of the linker is 34bp, only sequences at least matching with the length of the linker more than 16bp (containing one base mismatch and matching more than 50%) can be filtered by the filtering method; while reads contaminated with linkers below 15bp cannot be removed.

Moreover, if the degree of matching is simply reduced by reducing the linker contamination (e.g., by more than 25%), i.e., if sequences with linker lengths of 8bp or more are matched, then the sequences of the sample itself may be filtered out due to poor sequence specificity of 8bp (8 bp can be found to match a large number of genomic sequences), and thus normal reads (i.e., reads without linker contamination) may be mistakenly killed. Thus causing inaccurate filtration and still failing to filter clean for the matched joints less than 8 bp.

Second, this filtering method can result in base separation of the filtered sequencing data. Because partial reading of the adaptor pollution still exists in the filtered data, and the adaptor pollution is an exogenous fixed sequence, namely, the balance of genome bases of the sample can be broken, so that the content of A is different from that of T, and the content of C is different from that of G.

Third, the accuracy and alignment of the sequencing data is affected. Because partial reading of the adaptor pollution still remains in the filtered data, the adaptor exogenous sequence introduced by the adaptor pollution cannot be matched with the reference genome, so that the accuracy and the alignment rate of sequencing data are influenced.

Therefore, the inventor creatively designs a method which can effectively and comprehensively remove linker-contaminated reads by using an insert sequence and adopting a sliding type matching principle, ensures the base balance of sequencing data, and improves the accuracy and the alignment rate of the data. The method can break through the read which completely depends on the linker sequence to determine the linker pollution, and the finally obtained sequencing data is more accurate and the comparison rate is higher by adopting the method.

To this end, according to a first aspect of the invention, there is provided a method of sequencing data analysis, the sequencing data comprising suspected contaminating sequencing reads, the suspected contaminating sequencing reads comprising a linker matching region, the method comprising: determining an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising: a joint matching region; and a linker adjacent region contiguous with the 5' end of the linker matching region; determining a linker-adjacent region corresponding sequence based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing reads, the linker-adjacent region corresponding sequence being located at a 5' end of the corresponding sequencing read and having a same length as the linker-adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of a same insert, respectively; determining whether the suspected contaminating sequencing reads are contaminated with a linker based on a level of matching of the linker adjacent corresponding sequence to the linker adjacent.

Determining by the method of the invention whether the sequencing reads are contaminated with adaptors, thereby removing adaptor-contaminated reads during high throughput sequencing, the method can effectively and comprehensively remove the reading section polluted by the joint, can improve the accuracy of the finally obtained sequencing data, and ensures the base balance of the finally obtained sequencing data.

According to the embodiment of the invention, the above sequencing data analysis method can be further added with the following technical characteristics:

according to an embodiment of the present invention, in the above sequencing data analysis method, the length of the linker matching region is less than 50% of the length of the linker. The analysis of sequencing data using the method of the invention is particularly useful for the determination and removal of linker-contaminated reads when the length of the linker-matching region is less than 50% of the linker length, and all linker-contaminated reads can be determined in a very short time when analyzed using the method of the invention.

According to an embodiment of the present invention, in the above sequencing data analysis method, the level of matching of the linker matching region to the linker is less than 50%.

According to the present invention, similarly, in the above sequencing data analysis method, the matching level of the adaptor matching region to the adaptor is less than 50% in the case of a single base matching error.

According to an embodiment of the present invention, in the above sequencing data analysis method, the length of the linker-matching region and the linker-adjacent region is 19 to 24bp.

According to an embodiment of the present invention, in the above sequencing data analysis method, the length of the linker-matching region and the linker-adjacent region is 20bp.

According to an embodiment of the present invention, in the above sequencing data analysis method, the length of the sequencing read is 150 to 250bp.

According to an embodiment of the invention, in the above sequencing data analysis method, the sequencing reads are DNA.

According to an embodiment of the present invention, in the above sequencing data analysis method, before determining the sequence corresponding to the linker-adjacent region, the linker matching region of the suspected contaminating sequencing reads is compared with the linker sequence, and if one base mismatch is tolerated and the matching level is above 50%, it is determined that the suspected contaminating sequencing reads are contaminated by the linker.

According to an embodiment of the present invention, in the above sequencing data analysis method, if the sequence corresponding to the joint adjacent region completely matches the joint adjacent region, it is determined that the suspected contaminated sequencing read is contaminated by the joint.

According to a second aspect of the present invention, there is provided a sequencing data analysis apparatus, the sequencing data including suspected contamination sequencing reads, the suspected contamination sequencing reads containing a linker matching region, the sequencing data analysis apparatus comprising:

a window sequence determination module that determines an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising: a linker matching region; and a joint adjacent region connected to a 5' end of the joint matching region;

a corresponding sequence determination module connected to the window sequence determination module, the corresponding sequence determination module determining a corresponding sequence of a linker adjacent region based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing read, the corresponding sequence of the linker adjacent region being located at the 5' end of the corresponding sequencing read and having the same length as the linker adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of the same insert, respectively;

a dirty read determination module, the dirty read determination module coupled to the corresponding sequence determination module, the dirty read determination module determining whether the suspected dirty sequencing read is contaminated with a joint based on a level of matching of the corresponding sequence adjacent to the joint with the adjacent to the joint.

According to an embodiment of the present invention, the above sequencing data analysis apparatus may further have the following technical features:

according to an embodiment of the present invention, in the above sequencing data analysis apparatus, the length of the linker matching region is less than 50% of the length of the linker.

According to an embodiment of the present invention, in the above sequencing data analysis apparatus, the level of matching of the linker-matching region to the linker is less than 50%.

According to the present invention, similarly, in the above sequencing data analysis apparatus, the matching level of the adaptor matching region to the adaptor is less than 50% in the case of a single base matching error.

According to an embodiment of the present invention, in the above sequencing data analysis apparatus, the length of the linker matching region and the linker adjacent region is 19 to 24bp.

According to an embodiment of the present invention, in the above sequencing data analysis apparatus, the length of the linker matching region and the linker adjacent region is 20bp.

According to an embodiment of the present invention, in the above sequencing data analysis apparatus, the length of the sequencing read is 150 to 250bp.

According to an embodiment of the present invention, in the above sequencing data analysis apparatus, the sequencing reads are DNA.

According to an embodiment of the present invention, in the above sequencing data analysis apparatus, before determining the sequence corresponding to the linker-adjacent region, the linker matching region of the suspected contaminating sequencing reads is compared with the linker sequence, and if one base mismatch is tolerated and the matching level is 50% or more, it is determined that the suspected contaminating sequencing reads are contaminated with a linker.

According to an embodiment of the present invention, in the above sequencing data analysis apparatus, if the sequence corresponding to the joint vicinity completely matches the joint vicinity, it is determined that the suspected contaminated sequencing read is contaminated by the joint.

According to a third aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the method according to any of the embodiments of the first aspect of the present invention.

According to a fourth aspect of the present invention there is provided a computer scale storage medium having stored thereon a computer program which, when executed by a processor, performs the method according to any of the embodiments of the first aspect of the present invention.

According to a fifth aspect of the present invention, there is provided a high throughput sequencing method comprising: constructing a sequencing library of a sample to be detected, thereby obtaining original data of the sample to be detected; based on the sequencing data analysis method of any one of the first aspect of the invention, whether the sequencing reads are contaminated by the linker is determined, so that the original data of the sample to be tested is filtered, and the filtered sequencing data is obtained.

According to the embodiment of the present invention, the high throughput sequencing method described above may further have the following technical features:

according to an embodiment of the present invention, the high throughput sequencing method described above further includes:

and comparing the joint matching region of the suspected pollution sequencing read with the joint sequence, if one base is mismatched and the matching level is more than 50%, determining that the suspected pollution sequencing read is polluted by the joint, and directly filtering to remove the sequencing read.

According to an embodiment of the present invention, the above high throughput sequencing method further comprises removing low quality sequencing reads.

According to an embodiment of the present invention, the above high throughput sequencing method further comprises: aligning the filtered sequencing data to a reference genome.

The beneficial effects obtained by the invention are as follows: by adopting the method and the device, sequencing reads with joint pollution can be effectively and comprehensively removed, the base balance of data after the joint pollution filtration is ensured, and the accuracy of the data can be improved.

Drawings

Fig. 1 is a schematic diagram of a sequencing data analysis apparatus according to an embodiment of the present invention.

FIG. 2 is a schematic illustration of a joint-contaminated read provided in accordance with an embodiment of the present invention.

FIG. 3 is a schematic illustration of a joint-contaminated read provided in accordance with an embodiment of the present invention.

FIG. 4 is a schematic illustration of a joint-contaminated read provided in accordance with an embodiment of the present invention.

FIG. 5 is a schematic illustration of a joint-contaminated read provided in accordance with an embodiment of the present invention.

FIG. 6 is a schematic illustration of a joint-contaminated read provided in accordance with an embodiment of the present invention.

FIG. 7 is a flow diagram of a method for filtering reads that are contaminated with splices, according to one embodiment of the invention.

FIG. 8 is a base map of raw offboard data provided in accordance with an embodiment of the present invention.

FIG. 9 is a base map obtained by a method of filtering out sequencing reads that tolerate one base mismatch and have a match of greater than 50%, according to an embodiment of the present invention.

FIG. 10 is a base map of a filter obtained by the method of the present invention according to one embodiment of the present invention.

FIG. 11 is a base map of raw offboard data provided in accordance with an embodiment of the present invention.

FIG. 12 is a base map obtained by a method of filtering out sequencing reads that tolerate one base mismatch and have a match of greater than 50%, according to an embodiment of the present invention.

FIG. 13 is a base map of a filter obtained by the method of the present invention according to one embodiment of the present invention.

FIG. 14 is a base chart of raw run-down data provided in accordance with one embodiment of the present invention.

FIG. 15 is a base map obtained by a method of filtering out sequencing reads that tolerate one base mismatch and have a match of greater than 50%, according to an embodiment of the present invention.

FIG. 16 is a base map of a filter obtained by the method of the present invention according to one embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Certain terms are described and explained herein in order to facilitate understanding by those skilled in the art, and it is to be understood that such descriptions and illustrations are not to be construed as limiting the invention.

Reads, also known as sequencing reads or sequencing reads, are used herein to refer to nucleic acid fragments obtained during the pooling process, as is commonly understood by those skilled in the art.

In this context, adaptor-contaminated reads are also referred to as adaptor-contaminated reads, and refer to the situation where adaptor sequences are detected at the end of sequencing when the insert is smaller than the sequencing read length during library construction, and the insert containing the adaptor sequence is the adaptor-contaminated read.

Herein, the "adaptor-matching region" refers to a fragment of the read end derived from an adaptor sequence. The linker matching region is from a portion of a linker sequence.

Herein, the "linker adjacent region" refers to a partial fragment of the read that is connected to the 5' end of the linker matching region. The adjacent region of the adaptor is derived from the nucleic acid fragment of the sample to be tested.

As used herein, the term "analysis window sequence" refers to a sequence used to analyze sequencing reads for contamination, including linker matching regions and linker adjacent regions.

Sequencing data analysis method

According to one aspect of the invention, there is provided a method of sequencing data analysis, the sequencing data comprising suspected contaminating sequencing reads, the suspected contaminating sequencing reads comprising a linker matching region, the method comprising: determining an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising: a joint matching region; and a linker adjacent region contiguous with the 5' end of the linker matching region; determining a linker-adjacent region corresponding sequence based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing reads, the linker-adjacent region corresponding sequence being located at a 5' end of the corresponding sequencing read and having a same length as the linker-adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of a same insert, respectively; determining whether the suspected contaminating sequencing reads are contaminated with a linker based on a level of matching of the linker adjacent corresponding sequence to the linker adjacent.

The sequencing data analysis method of the present invention is illustrated by taking fig. 5 as an example to facilitate understanding, wherein the read1 is a suspected contamination sequencing read to be analyzed, and the read1 is an analysis window sequence determined according to the suspected contamination sequencing read. The 3 'end of read1 is the linker matching region (A) and the contiguous linker region (I1) is attached to the 5' end of the linker matching region. Read2 is the corresponding sequencing read of the suspected contaminating sequencing read, and read2 and read1 are two complementary strands derived from the same insert, i.e., insert 1 and insert 2 are two complementary strands of the same double-stranded DNA fragment. Determining a linker contig region corresponding sequence (I2) based on the length of the linker contig region, the linker contig region corresponding sequence being located at the 5' end of the corresponding sequencing read and having the same length as the linker contig region. If I1 and I2 are fully complementary in reverse, then the read1 is contaminated with the adapter.

The sequencing data analysis method provided by the invention breaks through the principle that conservative base number of a joint sequence is utilized to set joint pollution filtering parameters, and creatively analyzes sequencing data by means of random insertion sequences of sequencing actual data.

The inventor of the invention classifies and analyzes the read of the joint pollution, and the classification is divided into the following conditions:

in the first case, if the insert is shorter, the adaptor-contaminated reads comprise almost the full length of the adaptor sequence, as shown in FIG. 2.

In the second case, if the insert is short, the linker contaminating reads contain a partial fragment of the linker sequence that accounts for more than 50% of the linker sequence, as shown in FIG. 3.

In the third case, if the insert is slightly shorter than the sequencing read length, i.e., the linker contaminating reads contain a small fraction of the fragments of the linker sequence, as shown in FIG. 4.

In the first and second cases, if a single base mismatch is tolerated and the length of the adapter-matching region is determined to be more than 50% of the length of the adapter, it can be directly determined that the suspected contaminating sequencing reads are contaminated with adapters and can be directly removed. Therefore, in a preferred embodiment of the present invention, the linker matching region of the suspected contaminating sequencing reads is directly aligned with the linker sequence before determining the sequence corresponding to the linker adjacent region, and if there is one base mismatch and the matching level is above 50%, the suspected contaminating sequencing reads are determined to be contaminated by the linker. And after obtaining the adaptor matching region of the suspected contamination sequencing read, directly comparing the adaptor matching region with the adaptor sequence, and if one base is mismatched and the matching level is more than 50%, directly determining that the suspected contamination sequencing read is contaminated by the adaptor. Without having to perform the following steps, including: determining adjacent regions of the joint and determining corresponding sequences of the adjacent regions of the joint, and determining whether the suspected contaminating sequencing reads are contaminated with a linker based on the level of matching of the linker adjacent corresponding sequence to the linker adjacent. Therefore, the sequencing accuracy can be guaranteed, meanwhile, the computing resources are saved to the maximum extent, and the operation efficiency is improved.

In the third case where the length of the linker matching region is less than 50% of the length of the linker, which is the case where there are fewer bases contaminated by the linker and the removal of linker-contaminated reads cannot be satisfied in the above manner, the present invention inventively determines that the suspected contaminated sequencing reads are contaminated by the linker by determining the matching levels of the linker-adjacent regions and the sequences corresponding to the linker-adjacent regions.

That is, as shown in FIG. 5, when the length of the linker sequence of reads is less than 50% of the length of the linker, the present invention sets the filtering condition by the reverse complementary relationship of the read1 and read2 data insertion sequences:

(1) Reading 1 starts from a position (A) capable of matching the adaptor sequence to select an adaptor (adaptor matching region) and then an insertion sequence, and stops until the total length is 20bp, so as to obtain an I1 sequence (the I1 sequence is a partial insertion sequence at the 3' end of the reading 1);

(2) Obtaining an I2 sequence with the length corresponding to the I1 sequence (reading an insertion sequence with the length corresponding to the I1 from the 5' end of the read2 from the beginning);

(3) Screening for linker-contaminated reads:

1) If the I1 and I2 sequences are completely reversely complementary, the reads are the reads polluted by the joint, and further filtration is carried out; i.e. reads contaminated with the following joints will be removed directly.

2) The relationship between the I1 and I2 sequences is not reverse complementary, so that the data is retained and returned to the original sequencing data for subsequent data analysis.

If the I1 and I2 sequences are not reverse complementary, i.e., as shown in FIG. 6, the tail sequence A of read1 is only the same as a few bases of the adapter sequence (the first analysis shows that the adapter is suspected to contaminate the read), but the second analysis shows that the read is not adapter contaminating (the insert is larger than the sequencing read, so there is no detection, i.e., there is no reverse complementary relationship between I1 and I2), this data will remain, i.e., this filtering will not mistakenly kill the reads of the normal adapter.

Therefore, the method for sequencing data analysis provided by the invention is particularly suitable for analyzing the suspected contamination sequencing read with the matched length of the joint being less than 50% of the length of the joint.

In one embodiment of the invention, the suspected contaminating sequencing reads are determined to be contaminated with the linker if the sequence corresponding to the linker vicinity completely matches the linker vicinity.

According to an embodiment of the invention, the sequencing reads are 150-250 bp in length. According to an embodiment of the invention, the sequencing reads are DNA.

In addition, it is considered that in the process of judging the suspected contamination sequencing reads, an excessively long region (i.e., the determined region adjacent to the linker) selected from the linker matching region to the 5' end of the insertion sequence will cause the consumption of computing resources and has no practical significance for filtering; too short may cause too much mismatch. Therefore, in order to select an optimal value, the intervals of the connector matching region and the connector adjacent region are selected to be debugged at 19-25 bp:

in the experimental process, the off-line data of the same library is taken as an experimental data sample, and intervals of 19bp, 20bp, 21bp, 22bp, 23bp, 24bp and 25bp are selected for filtration test respectively. The results are shown in the following table:

TABLE 1 results for different sliding intervals

Wherein the reads with a linker in Table 1 refer to the ratio of detected reads with a linker to the total sequencing reads; low quality read fraction refers to the ratio of detected low quality reads to total sequencing reads; read occupancy for the appearance of a complete repeat refers to the ratio of reads from PCR repeats to total sequencing reads. The occurrence of a duty ratio of read N means: reads with a ratio of N greater than 5% accounted for the ratio of the total sequencing reads. The alignment ratio on the genome is also called mapping ratio of the genome, and refers to the ratio of data mapping on the genome. CPU core time (min) refers to the total time required for contamination analysis of these sequences.

As can be seen from the results in Table 1, the results obtained were all superior between 19bp and 24bp. And by integrating the computing resources and the final filtering effect, 20bp can be selected as the optimal total length of the adjacent region of the joint and the matching region.

Thus, in one embodiment of the invention, the linker matching region and the linker adjacent region are 19 to 24bp in length. In a preferred embodiment of the present invention, the linker matching region and the linker adjacent region are 20bp in length.

Sequencing data analysis equipment

According to an aspect of the present invention, the present invention provides a sequencing data analysis apparatus, wherein the sequencing data includes a suspected contamination sequencing read, the suspected contamination sequencing read includes a linker matching region, the sequencing data analysis apparatus is shown in fig. 1, and includes a window sequence determination module 201, a corresponding sequence determination module 202, and a contamination reading determination module 203, and the window sequence determination membrane block, the corresponding sequence determination module, and the contamination reading determination module are connected in sequence; the window sequence determination module determines an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising: a joint matching region; and a linker adjacent region contiguous with the 5' end of the linker matching region; the corresponding sequence determination module determines a corresponding sequence of a joint adjacent region based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing reads, the corresponding sequence of the joint adjacent region being located at the 5' end of the corresponding sequencing read and having the same length as the joint adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of the same insert, respectively; the dirty read determination module determines whether the suspected dirty sequencing read is contaminated with a linker based on a level of matching of the sequence corresponding to the linker vicinity.

In one embodiment of the invention, the sequencing data analysis apparatus of the invention is adapted for analysis of suspected contaminating sequencing reads having a linker matching region length of less than 50% of the linker length.

It should be noted that, the above description of the advantages and technical features of the sequencing data analysis method in any embodiment of the present invention is also applicable to the sequencing data analysis apparatus in this embodiment of the present invention, and will not be repeated herein.

High throughput sequencing method

The sequencing data analysis method or the sequencing data analysis equipment can be used for filtering and removing the joint pollution sequence in the high-throughput sequencing process. According to an embodiment of the invention, the high throughput sequencing method comprises: constructing a sequencing library, thereby obtaining sequencing data; based on the sequencing data analysis method, whether the sequencing read is polluted by the joint or not is determined, so that the sequencing data is filtered, and the filtered sequencing data is obtained.

According to the embodiment of the invention, if the adaptor matching region of the suspected contaminated sequencing read is compared with the adaptor sequence, if one base is mismatched and the matching level is more than 50%, the suspected contaminated sequencing read is determined to be contaminated by the adaptor, and the sequencing read is directly filtered and removed.

The sequencing data is obtained by preparing a sequencing library of the nucleic acid sequence of a sample to be tested and performing computer sequencing. According to an embodiment of the invention, obtaining the sequencing data comprises: obtaining nucleic acid in a sample to be detected, preparing a sequencing library of the nucleic acid, and sequencing the sequencing library. The preparation method of the sequencing library is carried out according to the requirements of the selected sequencing method, the sequencing method can be selected from but not limited to Hisq2000/2500 sequencing platform of Illumina, ion Torrent platform of Life Technologies, BGI BGISEQ platform and single-molecule sequencing platform according to different selected sequencing platforms, the sequencing mode can select single-ended sequencing or double-ended sequencing, and the obtained off-line data is a sequencing and reading fragment, which is called reads (reads).

When matching or aligning sequences, known alignment software, such as SOAP, BWA, teraMap, etc., can be used, but the present embodiment is not limited thereto. In the alignment process, according to the setting of alignment parameters, at most n base mismatches (mismatches) are allowed for a pair or a read, for example, n is set to 1 or 2, if more than n bases in a read are mismatched, it is considered that the pair of reads cannot be aligned to a reference sequence, or if all the mismatched n bases are located in one read of the pair of reads, it is considered that the read in the pair of reads cannot be aligned to the reference sequence.

The scheme of the invention will be explained with reference to the following examples. It will be appreciated by those skilled in the art that the following examples are illustrative of the invention only and should not be taken as limiting the scope of the invention. The examples, where specific techniques or conditions are not indicated, are to be construed according to the techniques or conditions described in the literature in the art or according to the product specifications. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products commercially available.

Example 1

Example 1 provides a method of filtering a contaminated read of a joint, as shown in fig. 7, comprising the steps of:

1) Screening original data of a sequencing off-line machine according to the principle that whether the tails of reads can be matched with a connector sequence or not;

2) Firstly, performing first-step filtration on reads which can be matched with the joint pollution according to a filtration principle of containing one base mismatch and matching more than 50%;

3) Performing a second filtering step by sliding reads with matching length more than or equal to 1bp and matching degree less than 50% from the linker sequence to the reverse complementary relationship of length 20bp, I1 (corresponding length Insert at tail of slid Read 1) and I2 (corresponding length Insert reading from head 25 end to head and I1);

4) The filtration is completed.

The exon libraries were sequenced using the method of the invention. The sequencing strategy was: hiseqXTen, PE150.

The base map of the raw off-line data is shown in FIG. 8, where FIG. 8 shows the percentage of each base at different positions along the read, where the abscissa in FIG. 8 is the position of the base along the read (position along reads) to characterize how many bases are at the read; the ordinate is the percentage used to characterize the percentage value of the corresponding base (A, T, C, G) at each position of the read. It can be seen that the original base map of the library resulted in the appearance of adaptor-fixed sequences due to partial inserts smaller than the sequencing read length so that adaptor contamination resulted in tail base separation.

The exon library a was filtered by a conventional linker-contaminated filtering method (i.e., filtering only sequencing reads that contain one base mismatch and have a degree of match of more than 50%), and the obtained base map is shown in fig. 9, with an alignment rate (Mapping rate) of 94.15%. The library A is filtered by adopting the filtering method of the joint pollution, and the obtained base graph is shown in figure 10, and the comparison rate is 99.52 percent. Wherein the comparison rate is as follows: and (3) after filtering the exon libraries by different methods, matching the processed sequencing data with the hg19 reference genome. The higher the contrast ratio, the better the removal of linker contaminating sequences.

From the above experimental data, after performing library filtration treatment on the library a according to the traditional adapter-contaminated filtration method, although the base separation phenomenon is improved, the base separation problem still exists at the position of about 15bp at the end of the base map, which again explains that reads (the adapter of the library is 34bp, and less than 50% of the mismatches of one base are less than 16 bp) with less than 50% of adapter-contaminated matching rate cannot be effectively removed by using the method, and the randomness of the base is affected, and at the same time, the Mapping rate of the library and the reference genome is only 94.15% due to the residue of the adapter-contaminated exogenous sequence.

After the library A is filtered according to the joint pollution filtering method used by the invention, a base image obviously restores a base equilibrium state, namely the tail end does not have bifurcation; meanwhile, the contrast ratio is improved to 99.52 percent. The data have obvious advantages in terms of base balance and alignment.

Example 2

The transcriptome libraries were sequenced using the methods of the invention. Sequencing strategy: hiseq 4000, PE150.

Base mapping of raw off-line data is shown in FIG. 11, which shows that the library raw base mapping results in the appearance of adaptor-fixed sequences due to partial insert smaller than the sequencing read length, so that adaptor contamination results in tail base separation.

The library B was filtered by a conventional linker-contaminated filtration method (i.e., only sequencing reads that were mismatched by one base and matched at a degree of more than 50%) to obtain a base map as shown in FIG. 12 with an alignment of 90.36%.

The library B was filtered by the linker-contaminated filtration method of the present invention, and the obtained base map is shown in FIG. 13, with a comparison rate of 93.59%.

From the above data representation, there is still a base separation problem at the end of the base map after filtering library B according to the traditional adapter-contaminated filtering method, and the Mapping rate of the library to the reference genome is only 90.36%.

After the library B is filtered according to the linker contamination filtering method used by the invention, the base map obviously restores the base equilibrium state, and the comparison rate is improved to 93.59 percent. The data have obvious advantages in terms of base balance and alignment.

Example 3

The high GC library was sequenced using the method of the invention. Sequencing strategy: novaSeq, PE150.

Base mapping of the raw off-line data is shown in FIG. 14, which shows that the base separation starts at the tail of about 50bp due to linker contamination caused by linker immobilization sequence due to partial insert smaller than the sequencing read length.

The library C was filtered by conventional linker-contaminated filtration (i.e., only one base mismatch and more than 50% match sequencing reads) to obtain a base map as shown in FIG. 15 with an alignment of 92.43%

The library C was filtered by the linker-contaminated filtration method of the present invention, and the obtained base map is shown in FIG. 16, with a comparison rate of 99.87%.

From the above data representation, the base separation problem still exists at the end of the base map after the library C is filtered according to the traditional adapter-contaminated filtering method, while the base map obviously restores the base equilibrium state after the library C is filtered according to the adapter-contaminated filtering method used in the present invention. The data show that the efficiency of the filtration from linker contamination is more clearly superior in base balance.

As can be seen from the contents of examples 1 to 3, the method and the device for filtering the adaptor contamination provided by the present invention are suitable for adaptor contamination data of different sequencing platforms and different library types; the method can effectively and comprehensively filter the linker-contaminated reads, ensure the base balance of the data after the linker contamination filtration, and improve the accuracy of sequencing data; compared with the traditional joint pollution filtering method, the method has obvious advantages.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically coupled, may be electrically coupled or may be in communication with each other; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless expressly stated or limited otherwise, the first feature "on" or "under" the second feature may be directly contacting the second feature or the first and second features may be indirectly contacting each other through intervening media. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature "under," "beneath," and "under" a second feature may be directly under or obliquely under the second feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without being mutually inconsistent.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method of sequencing data analysis, wherein the sequencing data comprises suspected contaminating sequencing reads that contain a linker matching region, the method comprising:

determining an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising:

a linker matching region; and

a linker adjacent region that is contiguous with the 5' end of the linker matching region;

determining a linker-adjacent region corresponding sequence based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing reads, the linker-adjacent region corresponding sequence being located at a 5' end of the corresponding sequencing read and having a same length as the linker-adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of a same insert, respectively;

determining whether the suspected contaminating sequencing reads are contaminated with a linker based on a level of matching of the linker adjacent corresponding sequence to the linker adjacent.

2. The sequencing data analysis method of claim 1, wherein the linker matching region is less than 50% of the length of the linker.

3. The method for sequencing data analysis of claim 1, wherein the level of matching of the linker-matching region to the linker is less than 50%.

4. The method for sequencing data analysis according to claim 1, wherein the level of matching of the adaptor-matching region to the adaptor is less than 50% in the case of a single base match error.

5. The sequencing data analysis method of claim 1 or 2, wherein the total length of the linker matching region and the linker adjacent region is 19 to 24bp.

6. The sequencing data analysis method of claim 1, wherein the total length of the linker-matching region and the linker-adjacent region is 20bp.

7. The sequencing data analysis method of claim 1, wherein the sequencing reads are 150 to 250bp in length.

8. The method for sequencing data analysis of claim 1, wherein the sequencing reads are DNA.

9. The method of claim 1, wherein the linker matching region of the suspected contaminating sequencing reads is aligned with the linker before determining the sequence corresponding to the linker-adjacent region, and if there is one base mismatch and the matching level is above 50%, it is determined that the suspected contaminating sequencing reads are contaminated with the linker and the following operations are not performed:

determining a sequence corresponding to a linker-adjacent region based on a sequence of a corresponding sequencing read of the suspected-contaminated sequencing read and determining whether the suspected-contaminated sequencing read is contaminated with a linker based on a level of matching of the sequence corresponding to the linker-adjacent region with the linker-adjacent region.

10. The method for sequencing data analysis of claim 1, wherein a complete match of the sequence corresponding to the linker vicinity is determined to be contaminating the suspected contaminating sequencing reads with the linker.

11. A sequencing data analysis device, wherein the sequencing data comprises suspected contaminating sequencing reads that contain a linker matching region, the sequencing data analysis device comprising:

a window sequence determination module that determines an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising:

a joint matching region; and

a corresponding sequence determination module, connected to the window sequence determination module, that determines, based on a sequence of a corresponding sequencing read of the suspected-to-contaminate sequencing reads, a corresponding sequence of a joint-adjacent region that is located at the 5' end of the corresponding sequencing read and has the same length as the joint-adjacent region, where the corresponding sequencing read and the suspected-to-contaminate sequencing read are derived from two complementary strands of the same insert, respectively;

a dirty read determination module coupled to the corresponding sequence determination module, the dirty read determination module determining whether the suspected dirty sequencing read is contaminated with a linker based on a level of matching of the sequence corresponding to the linker vicinity.

12. The sequencing data analysis apparatus of claim 11, wherein the linker matching region has a length that is less than 50% of the length of the linker.

13. The sequencing data analysis device of claim 11, wherein the adapter matching region matches the adapter at a level of less than 50%.

14. The sequencing data analysis device of claim 11, wherein the adapter matching region matches the adapter at a level of less than 50% in the event of a single base match error.

15. The sequencing data analysis apparatus of claim 11 or 12, wherein the linker matching region and the linker adjacent region are 19 to 24bp in length.

16. The sequencing data analysis apparatus of claim 11, wherein the linker matching region and the linker adjacent region are 20bp in length.

17. The sequencing data analysis apparatus of claim 11, wherein the sequencing reads are 150 to 250bp in length.

18. The sequencing data analysis device of claim 11, wherein the sequencing reads are DNA.

19. The sequencing data analysis device of claim 11, wherein the linker matching region of the suspected contaminating sequencing reads is compared with the linker sequence before determining the sequence corresponding to the linker-adjacent region, and if there is one base mismatch and the matching level is above 50%, it is determined that the suspected contaminating sequencing reads are contaminated with linkers, and the following operations are not performed:

determining a sequence corresponding to a joint-adjacent region based on a sequence of a corresponding sequencing read of the suspected-contaminated sequencing read and determining whether the suspected-contaminated sequencing read is contaminated with a joint based on a matching level of the sequence corresponding to the joint-adjacent region and the joint-adjacent region.

20. The sequencing data analysis apparatus of claim 11, wherein a complete match of the sequence corresponding to the region adjacent to the joint determines that the suspected contaminated sequencing read is contaminated with the joint.

21. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 10 when executing the program.

22. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 10.

23. A high throughput sequencing method, comprising:

constructing a sequencing library, thereby obtaining sequencing data;

determining whether a sequencing read is polluted by a joint based on the sequencing data analysis method of any one of claims 1 to 10, and filtering the sequencing data to obtain the filtered sequencing data.

24. The method of claim 23, further comprising:

25. The method of claim 23, further comprising removing low quality sequencing reads.

26. The method of claim 23, further comprising:

aligning the filtered sequencing data to a reference genome.