CN110827920A - Sequencing data analysis method and equipment and high-throughput sequencing method - Google Patents

Sequencing data analysis method and equipment and high-throughput sequencing method Download PDF

Info

Publication number
CN110827920A
CN110827920A CN201810921895.0A CN201810921895A CN110827920A CN 110827920 A CN110827920 A CN 110827920A CN 201810921895 A CN201810921895 A CN 201810921895A CN 110827920 A CN110827920 A CN 110827920A
Authority
CN
China
Prior art keywords
linker
sequencing
sequence
matching
read
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810921895.0A
Other languages
Chinese (zh)
Other versions
CN110827920B (en
Inventor
刘舒
刘晨
刘莉玲
黄金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Bgi Medical Laboratory Co Ltd
Original Assignee
Wuhan Bgi Medical Laboratory Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Bgi Medical Laboratory Co Ltd filed Critical Wuhan Bgi Medical Laboratory Co Ltd
Priority to CN201810921895.0A priority Critical patent/CN110827920B/en
Publication of CN110827920A publication Critical patent/CN110827920A/en
Application granted granted Critical
Publication of CN110827920B publication Critical patent/CN110827920B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the field of gene sequencing, in particular to a sequencing data analysis method and equipment and a high-throughput sequencing method. The sequencing data comprises suspected contaminating sequencing reads that contain a linker matching region, the method comprising: determining an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising: a joint matching zone and a joint adjacent zone; determining a sequence corresponding to a linker adjacent region based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing reads; determining whether the suspected contaminating sequencing reads are contaminated with a linker based on a level of matching of the linker adjacent corresponding sequence to the linker adjacent. By utilizing the method and the device, sequencing reads with joint pollution can be effectively and comprehensively removed, the base balance of data after the joint pollution filtration is ensured, and the accuracy of the data can be improved.

Description

Sequencing data analysis method and equipment and high-throughput sequencing method
Technical Field
The invention relates to the field of gene sequencing, in particular to a sequencing data analysis method and equipment and a high-throughput sequencing method.
Background
After the second generation sequencing raw data is downloaded, data filtering processing is usually performed first before use, including reads (reads) for removing linker contamination, reads of low quality, reads for sequencing read N, and the like.
Linker-contaminated reads are reads in which linker sequences are detected at the end of sequencing when some of the inserts constructed from the library are smaller than the sequencing read length, and the inserts containing linker sequences are linker-contaminated reads. Since the adaptor sequence is not the sequence of the actual insert in the sample, it needs to be removed after the sequencing is completed, so as not to affect the randomness of the base of the sample and the accuracy of information analysis.
However, how to filter the reads contaminated with the joint is in need of further improvement.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
The inventor of the invention finds out in the research process that:
when the base sequence obtained by sequencing the tail end of the original reading segment is matched with the adaptor sequence, if the comparison result shows that: when the degree of matching the adaptor can reach more than 50% after a mismatch of one base, the read is considered to be the adaptor-contaminated read, and the read is removed entirely, and this method for filtering the adaptor-contaminated read has many problems and disadvantages, which are shown in the following aspects:
first, this filtering method is not effective in removing all of the linker-contaminated reads. For example, when the length of the linker is 34bp, only sequences at least matching the length of the linker by more than 16bp (containing one base mismatch and matching more than 50%) can be filtered out by the filtering method; while reads contaminated with linkers below 15bp cannot be removed.
Moreover, if the degree of matching is simply reduced by reducing the linker contamination (e.g., by more than 25%), i.e., if sequences with linker lengths of 8bp or more are matched, then the sequences of the sample itself may be filtered out due to poor sequence specificity of 8bp (8 bp can be found to match a large number of genomic sequences), and thus normal reads (i.e., reads without linker contamination) may be mistakenly killed. This can cause inaccurate filtering, and still cannot filter clean for the adaptor smaller than 8bp match.
Second, this filtering method can result in base separation of the filtered sequencing data. Because partial reading of the adaptor pollution still exists in the filtered data, and the adaptor pollution is an exogenous fixed sequence, the balance of genome base of the sample can be broken, so that the content of A is different from that of T, and the content of C is different from that of G.
Third, the accuracy and alignment of sequencing data is affected. Because partial reading of the adaptor pollution still remains in the filtered data, the adaptor exogenous sequence introduced by the adaptor pollution cannot be matched with the reference genome, so that the accuracy and the alignment rate of sequencing data are influenced.
Therefore, the inventor creatively designs a method which can effectively and comprehensively remove linker-contaminated reads by using an insert sequence and adopting a sliding type matching principle, ensures the base balance of sequencing data, and improves the accuracy and the alignment rate of the data. The method can break through the read which completely depends on the linker sequence to determine the linker pollution, and the finally obtained sequencing data is more accurate and the comparison rate is higher by adopting the method.
To this end, according to a first aspect of the invention, there is provided a method of sequencing data analysis, the sequencing data comprising suspected contaminating sequencing reads, the suspected contaminating sequencing reads comprising a linker matching region, the method comprising: determining an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising: a linker matching region; and a linker adjacent region contiguous with the 5' end of the linker matching region; determining a linker-adjacent region corresponding sequence based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing reads, the linker-adjacent region corresponding sequence being located at a 5' end of the corresponding sequencing read and having a same length as the linker-adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of a same insert, respectively; determining whether the suspected contaminating sequencing reads are contaminated with a linker based on a level of matching of the linker adjacent corresponding sequence to the linker adjacent.
The method of the invention is used for determining whether the sequencing reads are polluted by the joint, so that the reads polluted by the joint are removed in the high-throughput sequencing process, the reads polluted by the joint can be effectively and comprehensively removed, the accuracy of finally obtained sequencing data can be improved, and the base balance of the finally obtained sequencing data is ensured.
According to the embodiment of the invention, the above sequencing data analysis method can be further added with the following technical characteristics:
according to an embodiment of the present invention, in the above sequencing data analysis method, the length of the linker matching region is less than 50% of the length of the linker. The analysis of sequencing data using the method of the invention is particularly useful for the determination and removal of linker-contaminated reads when the length of the linker-matching region is less than 50% of the linker length, and all linker-contaminated reads can be determined in a very short time when analyzed using the method of the invention.
According to an embodiment of the present invention, in the above sequencing data analysis method, the level of matching of the linker-matching region to the linker is less than 50%.
According to the present invention, similarly, in the above sequencing data analysis method, the matching level of the adaptor matching region to the adaptor is less than 50% in the case of a single base matching error.
According to an embodiment of the present invention, in the above sequencing data analysis method, the length of the linker matching region and the linker adjacent region is 19-24 bp.
According to an embodiment of the present invention, in the above sequencing data analysis method, the length of the linker-matching region and the linker-adjacent region is 20 bp.
According to an embodiment of the present invention, in the above sequencing data analysis method, the length of the sequencing read is 150-250 bp.
According to an embodiment of the invention, in the above sequencing data analysis method, the sequencing reads are DNA.
According to an embodiment of the present invention, in the above sequencing data analysis method, before determining the sequence corresponding to the linker-adjacent region, the linker matching region of the suspected contaminating sequencing reads is compared with the linker sequence, and if one base mismatch is tolerated and the matching level is above 50%, it is determined that the suspected contaminating sequencing reads are contaminated by the linker.
According to an embodiment of the present invention, in the above sequencing data analysis method, if the sequence corresponding to the linker-adjacent region completely matches the linker-adjacent region, it is determined that the suspected contaminated sequencing read is contaminated by the linker.
According to a second aspect of the present invention, there is provided a sequencing data analysis apparatus, the sequencing data comprising suspected contaminating sequencing reads, the suspected contaminating sequencing reads containing a linker matching region, the sequencing data analysis apparatus comprising:
a window sequence determination module that determines an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising: a linker matching region; and a linker adjacent region contiguous with the 5' end of the linker matching region;
a corresponding sequence determination module connected to the window sequence determination module, the corresponding sequence determination module determining a corresponding sequence of a linker adjacent region based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing read, the corresponding sequence of the linker adjacent region being located at the 5' end of the corresponding sequencing read and having the same length as the linker adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of the same insert, respectively;
a dirty read determination module coupled to the corresponding sequence determination module, the dirty read determination module determining whether the suspected dirty sequencing read is contaminated with a linker based on a level of matching of the sequence corresponding to the linker vicinity.
According to an embodiment of the present invention, the above sequencing data analysis apparatus may further have the following technical features:
according to an embodiment of the present invention, in the above sequencing data analysis apparatus, the length of the linker matching region is less than 50% of the length of the linker.
According to an embodiment of the present invention, in the above sequencing data analysis apparatus, the level of matching of the linker-matching region to the linker is less than 50%.
According to the present invention, similarly, in the above sequencing data analysis apparatus, the matching level of the adaptor matching region to the adaptor is less than 50% in the case of a single base matching error.
According to an embodiment of the present invention, in the above sequencing data analysis apparatus, the length of the linker matching region and the linker adjacent region is 19 to 24 bp.
According to an embodiment of the present invention, in the above sequencing data analysis apparatus, the length of the linker matching region and the linker adjacent region is 20 bp.
According to an embodiment of the present invention, in the above sequencing data analysis apparatus, the length of the sequencing read is 150-250 bp.
According to an embodiment of the present invention, in the above sequencing data analysis apparatus, the sequencing reads are DNA.
According to an embodiment of the present invention, in the above sequencing data analysis apparatus, before determining the sequence corresponding to the linker-adjacent region, the linker matching region of the suspected contaminating sequencing reads is compared with the linker sequence, and if one base mismatch is tolerated and the matching level is 50% or more, it is determined that the suspected contaminating sequencing reads are contaminated with a linker.
According to an embodiment of the present invention, in the above sequencing data analysis apparatus, if the sequence corresponding to the joint vicinity completely matches the joint vicinity, it is determined that the suspected contaminated sequencing read is contaminated by the joint.
According to a third aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, when executing the program, implementing the method according to any of the embodiments of the first aspect of the present invention.
According to a fourth aspect of the present invention there is provided a computer scale storage medium having stored thereon a computer program which, when executed by a processor, performs the method according to any of the embodiments of the first aspect of the present invention.
According to a fifth aspect of the present invention, there is provided a high throughput sequencing method comprising: constructing a sequencing library of a sample to be detected, thereby obtaining original data of the sample to be detected; based on the sequencing data analysis method of any one of the first aspect of the invention, whether the sequencing reads are contaminated by the linker is determined, so that the original data of the sample to be tested is filtered, and the filtered sequencing data is obtained.
According to the embodiment of the present invention, the high throughput sequencing method described above may further have the following technical features:
according to an embodiment of the present invention, the high throughput sequencing method described above further includes:
and comparing the joint matching region of the suspected pollution sequencing read with the joint sequence, if one base is mismatched and the matching level is more than 50%, determining that the suspected pollution sequencing read is polluted by the joint, and directly filtering to remove the sequencing read.
According to an embodiment of the present invention, the above high throughput sequencing method further comprises removing low quality sequencing reads.
According to an embodiment of the present invention, the above high throughput sequencing method further comprises: aligning the filtered sequencing data to a reference genome.
The beneficial effects obtained by the invention are as follows: by adopting the method and the device, sequencing reads with joint pollution can be effectively and comprehensively removed, the base balance of data after the joint pollution filtration is ensured, and the accuracy of the data can be improved.
Drawings
Fig. 1 is a schematic diagram of a sequencing data analysis apparatus according to an embodiment of the present invention.
FIG. 2 is a schematic illustration of a joint-contaminated read provided in accordance with an embodiment of the present invention.
FIG. 3 is a schematic illustration of a joint-contaminated read provided in accordance with an embodiment of the present invention.
FIG. 4 is a schematic illustration of a joint-contaminated read provided in accordance with an embodiment of the present invention.
FIG. 5 is a schematic illustration of a joint-contaminated read provided in accordance with an embodiment of the present invention.
FIG. 6 is a schematic illustration of a joint-contaminated read provided in accordance with an embodiment of the present invention.
FIG. 7 is a flow diagram of a method for filtering reads that are contaminated with splices, according to one embodiment of the invention.
FIG. 8 is a base map of raw offboard data provided in accordance with an embodiment of the present invention.
FIG. 9 is a base map obtained by a method of filtering out sequencing reads that tolerate one base mismatch and have a match of greater than 50%, according to an embodiment of the present invention.
FIG. 10 is a base map of a filter obtained by the method of the present invention according to one embodiment of the present invention.
FIG. 11 is a base map of raw offboard data provided in accordance with an embodiment of the present invention.
FIG. 12 is a base map obtained by a method of filtering out sequencing reads that tolerate one base mismatch and have a match of greater than 50%, according to an embodiment of the present invention.
FIG. 13 is a base map of a filter obtained by the method of the present invention according to one embodiment of the present invention.
FIG. 14 is a base map of raw offboard data provided in accordance with one embodiment of the present invention.
FIG. 15 is a base map obtained by a method of filtering out sequencing reads that tolerate one base mismatch and have a match of greater than 50%, according to an embodiment of the present invention.
FIG. 16 is a base map of a filter obtained by the method of the present invention according to one embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
To facilitate understanding of those skilled in the art, certain terms are described and explained herein, and it is to be understood by those skilled in the art that such explanations and descriptions are not to be considered as limiting.
Reads, also known as sequencing reads or sequencing reads, are used herein to refer to nucleic acid fragments obtained during the pooling process, as is commonly understood by those skilled in the art.
In this context, adaptor-contaminated reads are also referred to as adaptor-contaminated reads, and refer to the situation where adaptor sequences are detected at the end of sequencing when the insert is smaller than the sequencing read length during library construction, and the insert containing the adaptor sequence is the adaptor-contaminated read.
As used herein, the "adaptor-matching region" refers to a fragment of the read that ends from the adaptor sequence. The linker matching region is from a portion of a linker sequence.
As used herein, the "linker adjacent region" refers to the portion of the read that is contiguous with the 5' end of the linker matching region. The linker adjacent region is derived from a nucleic acid fragment of the test sample itself.
As used herein, the term "analysis window sequence" refers to a sequence used to analyze sequencing reads for contamination, including linker matching regions and linker adjacent regions.
Sequencing data analysis method
According to one aspect of the invention, there is provided a method of sequencing data analysis, the sequencing data comprising suspected contaminating sequencing reads, the suspected contaminating sequencing reads comprising a linker matching region, the method comprising: determining an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising: a linker matching region; and a linker adjacent region contiguous with the 5' end of the linker matching region; determining a linker-adjacent region corresponding sequence based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing reads, the linker-adjacent region corresponding sequence being located at a 5' end of the corresponding sequencing read and having a same length as the linker-adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of a same insert, respectively; determining whether the suspected contaminating sequencing reads are contaminated with a linker based on a level of matching of the linker adjacent corresponding sequence to the linker adjacent.
The sequencing data analysis method of the present invention is illustrated by taking fig. 5 as an example to facilitate understanding, wherein the read1 is a suspected contamination sequencing read to be analyzed, and the read1 is an analysis window sequence determined according to the suspected contamination sequencing read. The 3' end of read1 is the adapter matching region (A) and the contiguous region of the adapter is the adapter adjoining region (I1). Read2 is the corresponding sequencing read of the suspected contaminating sequencing read, and read2 and read1 are two complementary strands derived from the same insert, i.e., insert 1 and insert 2 are two complementary strands of the same double-stranded DNA fragment. Determining a linker contig region corresponding sequence (I2) based on the length of the linker contig region, the linker contig region corresponding sequence being located at the 5' end of the corresponding sequencing read and having the same length as the linker contig region. If I1 and I2 are fully reverse complementary, then the read1 is contaminated with adapter.
The sequencing data analysis method breaks through the principle that conservative base number of the linker sequence is utilized to set the linker pollution filtering parameter, and creatively analyzes sequencing data by means of random insertion sequence of sequencing actual data.
The inventor of the invention carries out classification analysis on the reading of the joint pollution and comprises the following conditions:
in the first case, if the insert is shorter, the adaptor-contaminated reads comprise almost the full length of the adaptor sequence, as shown in FIG. 2.
In the second case, if the insert is short, the linker contaminating reads contain a partial fragment of the linker sequence that accounts for more than 50% of the linker sequence, as shown in FIG. 3.
In the third case, if the insert is slightly shorter than the sequencing read length, i.e., the linker contaminating reads contain a small fraction of the fragments of the linker sequence, as shown in FIG. 4.
In the first and second cases, if a single base mismatch is tolerated and the length of the linker-matching region is determined to be more than 50% of the length of the linker, then the suspected contaminating sequencing reads can be directly determined to be contaminated with linker and can be directly removed. Therefore, in a preferred embodiment of the present invention, the linker matching region of the suspected contaminating sequencing reads is directly aligned with the linker sequence before determining the sequence corresponding to the linker adjacent region, and if there is one base mismatch and the matching level is above 50%, the suspected contaminating sequencing reads are determined to be contaminated by the linker. And after obtaining the adaptor matching region of the suspected contamination sequencing read, directly comparing the adaptor matching region with the adaptor sequence, and if one base is mismatched and the matching level is more than 50%, directly determining that the suspected contamination sequencing read is contaminated by the adaptor. Without having to perform the following steps, including: determining a sequence of adjacent to said joint, and determining whether said suspected to contaminate sequencing reads are contaminated by a joint based on a level of matching of said sequence of adjacent to said joint. Therefore, the sequencing accuracy can be guaranteed, meanwhile, the computing resources are saved to the maximum extent, and the operation efficiency is improved.
In the third case where the length of the linker matching region is less than 50% of the length of the linker, which is the case where there are fewer bases contaminated by the linker and the removal of linker-contaminated reads cannot be satisfied in the above manner, the present invention inventively determines that the suspected contaminated sequencing reads are contaminated by the linker by determining the matching levels of the linker-adjacent regions and the sequences corresponding to the linker-adjacent regions.
That is, as shown in FIG. 5, when the length of the linker sequence of reads is less than 50% of the length of the linker, the present invention sets the filtering condition by the reverse complement of the read1 and read2 data insertion sequences:
(1) reading 1 starts from the position (A) capable of matching the adapter sequence, picks up the adapter (adapter matching region) and then the insertion sequence, and stops when the total length is 20bp, so as to obtain an I1 sequence (the I1 sequence is a partial insertion sequence at the 3' end of the reading 1);
(2) obtaining an I2 sequence of a length corresponding to the I1 sequence (the insertion sequence is read from the 5' end of read2 and is I1 in a corresponding length);
(3) screening for linker-contaminated reads:
1) if the sequences of I1 and I2 are completely reverse complementary, the reads are the reads polluted by the joint, and further filtration is carried out; i.e. reads contaminated with the following joints will be removed directly.
2) The I1 and I2 sequences are not in reverse complementary relationship, so that the data are retained and returned to the original sequencing data for subsequent data analysis.
If the I1 and I2 sequences are not reverse complementary, i.e., as shown in FIG. 6, the tail sequence A of read1 is only the same as a small percentage of the bases of the adapter sequence (the suspected adapter-contaminated read is analyzed in the first step), but not the adapter-contaminated read is analyzed in the second step (the insert is larger than the sequencing read length, so there is no sequencing, i.e., there is no reverse complementary relationship between I1 and I2), this data will remain, i.e., this filtering will not mistakenly kill normal adapter reads.
Therefore, the method for sequencing data analysis provided by the invention is particularly suitable for analyzing the suspected contamination sequencing read with the matched length of the joint being less than 50% of the length of the joint.
In one embodiment of the invention, the suspected contaminating sequencing reads are determined to be contaminated with the linker if the sequence corresponding to the linker vicinity completely matches the linker vicinity.
According to an embodiment of the invention, the length of the sequencing reads is 150-250 bp. According to an embodiment of the invention, the sequencing reads are DNA.
In addition, it is considered that in the process of judging the suspected contamination sequencing reads, an excessively long region (i.e., the determined region adjacent to the linker) selected from the linker matching region to the 5' end of the insertion sequence will cause the consumption of computing resources and has no practical significance for filtering; too short may cause too much mismatch. Therefore, in order to select an optimal value, the interval between the linker matching region and the linker adjacent region is selected to be 19-25bp for debugging:
in the experimental process, the off-line data of the same library is taken as an experimental data sample, and intervals of 19bp, 20bp, 21bp, 22bp, 23bp, 24bp and 25bp are selected for filtration test respectively. The results are shown in the following table:
TABLE 1 results for different sliding intervals
Figure BDA0001764444990000091
Wherein the reads with a linker in Table 1 refer to the ratio of detected reads with a linker to the total sequencing reads; low quality read fraction refers to the ratio of detected low quality reads to total sequencing reads; read occupancy for the appearance of a complete repeat refers to the ratio of reads from PCR repeats to total sequencing reads. The occurrence of a duty ratio of reading N means: the ratio of N is greater than the ratio of 5% reads to total sequencing reads. The alignment ratio on the genome is also called mapping ratio of the genome, and refers to the ratio of data mapping on the genome. CPU core time (min) refers to the total time required for contamination analysis of these sequences.
As can be seen from the results in Table 1, the results obtained were all superior between 19bp and 24 bp. And by integrating the computing resources and the final filtering effect, 20bp can be selected as the optimal total length of the adjacent region of the joint and the matching region.
Thus, in one embodiment of the present invention, the linker matching region and the linker adjacent region are 19 to 24bp in length. In a preferred embodiment of the present invention, the linker matching region and the linker adjacent region are 20bp in length.
Sequencing data analysis equipment
According to an aspect of the present invention, the present invention provides a sequencing data analysis apparatus, wherein the sequencing data includes a suspected contamination sequencing read, the suspected contamination sequencing read includes a linker matching region, the sequencing data analysis apparatus is shown in fig. 1, and includes a window sequence determination module 201, a corresponding sequence determination module 202, and a contamination reading determination module 203, and the window sequence determination membrane block, the corresponding sequence determination module, and the contamination reading determination module are connected in sequence; the window sequence determination module determines an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising: a linker matching region; and a linker adjacent region contiguous with the 5' end of the linker matching region; the corresponding sequence determination module determines a corresponding sequence of a joint adjacent region based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing reads, the corresponding sequence of the joint adjacent region being located at the 5' end of the corresponding sequencing read and having the same length as the joint adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of the same insert, respectively; the dirty read determination module determines whether the suspected dirty sequencing read is contaminated with a linker based on a level of matching of the sequence corresponding to the linker vicinity.
In one embodiment of the invention, the sequencing data analysis apparatus of the invention is adapted for analysis of suspected contaminating sequencing reads having a linker matching region length of less than 50% of the linker length.
It should be noted that, the above description of the advantages and technical features of the sequencing data analysis method in any embodiment of the present invention is also applicable to the sequencing data analysis apparatus in this embodiment of the present invention, and will not be repeated herein.
High throughput sequencing method
The sequencing data analysis method or the sequencing data analysis equipment can be used for filtering and removing the joint pollution sequence in the high-throughput sequencing process. According to an embodiment of the invention, the high throughput sequencing method comprises: constructing a sequencing library, thereby obtaining sequencing data; based on the sequencing data analysis method, whether the sequencing read is polluted by the joint or not is determined, so that the sequencing data is filtered, and the filtered sequencing data is obtained.
According to the embodiment of the invention, if the adaptor matching region of the suspected contamination sequencing read is compared with the adaptor sequence, if one base is mismatched and the matching level is more than 50%, the suspected contamination sequencing read is determined to be contaminated by the adaptor, and the sequencing read is directly filtered and removed.
The sequencing data is obtained by preparing a sequencing library of the nucleic acid sequence of a sample to be tested and performing computer sequencing. According to an embodiment of the invention, obtaining the sequencing data comprises: obtaining nucleic acid in a sample to be detected, preparing a sequencing library of the nucleic acid, and sequencing the sequencing library. The preparation method of the sequencing library is carried out according to the requirements of the selected sequencing method, the sequencing method can select but not limited to Hisq2000/2500 sequencing platform of Illumina, Ion Torrent platform of Life Technologies, BGISEQ platform of BGI and single molecule sequencing platform according to the difference of the selected sequencing platform, the sequencing mode can select single-ended sequencing or double-ended sequencing, and the obtained off-line data is a sequencing and reading fragment called reads (reads).
When matching or aligning sequences, known alignment software, such as SOAP, BWA, TeraMap, etc., can be used, but the present embodiment is not limited thereto. In the alignment process, according to the setting of alignment parameters, at most n base mismatches (mismatches) are allowed for a pair or a read, for example, n is set to 1 or 2, if more than n bases in a read are mismatched, it is considered that the pair of reads cannot be aligned to a reference sequence, or if all the mismatched n bases are located in one read of the pair of reads, it is considered that the read in the pair of reads cannot be aligned to the reference sequence.
The scheme of the invention will be explained with reference to the examples. It will be appreciated by those skilled in the art that the following examples are illustrative of the invention only and should not be taken as limiting the scope of the invention. The examples, where specific techniques or conditions are not indicated, are to be construed according to the techniques or conditions described in the literature in the art or according to the product specifications. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products commercially available.
Example 1
Example 1 provides a method of filtering a contaminated read of a joint, as shown in fig. 7, comprising the steps of:
1) screening original data of a sequencing off-line machine according to the principle that whether the tails of reads can be matched with a connector sequence or not;
2) firstly, performing first-step filtration on reads which can be matched with the joint pollution according to a filtration principle of containing one base mismatch and matching more than 50%;
3) performing a second filtering step by using reads with matching length more than or equal to 1bp and matching degree less than 50% in reverse complementary relationship of sliding forward from the linker sequence to length 20bp, I1 (corresponding length Insert at tail of sliding Read 1) and I2 (corresponding length Insert reading from front end of Read 25 and I1);
4) the filtration is completed.
The exon libraries were sequenced using the method of the invention. The sequencing strategy was: HiseqXTen, PE 150.
The base map of the raw off-line data is shown in FIG. 8, where FIG. 8 shows the percentage of each base at different positions along the read, where the abscissa in FIG. 8 is the position (position along the read) of the base to characterize how many bases are at the read; the ordinate is the percentage used to characterize the percentage value of the corresponding base (A, T, C, G) at each position of the read. It can be seen that the original base map of the library resulted in the appearance of adaptor-fixed sequences due to partial inserts smaller than the sequencing read length so that adaptor contamination resulted in tail base separation.
The exon library a was filtered by a conventional linker-contaminated filtering method (i.e. filtering only sequencing reads that contain one base mismatch and match more than 50%), and the resulting base map is shown in fig. 9, with an alignment (Mappingrate) of 94.15%. The library A was filtered by the linker-contaminated filtration method of the present invention, and the obtained base map is shown in FIG. 10, with a comparison rate of 99.52%. Wherein the comparison rate is as follows: after filtering the exon libraries by different methods, the processed sequencing data were matched with the hg19 reference genome. The higher the contrast ratio, the better the removal of linker contaminating sequences.
From the above experimental data, after performing library filtration treatment on the library a according to the traditional adapter-contaminated filtration method, although the base separation phenomenon is improved, the base separation problem still exists at the position of about 15bp at the end of the base map, which again explains that reads (the adapter of the library is 34bp, and less than 50% of the mismatches of one base are less than 16bp) with less than 50% of adapter-contaminated matching rate cannot be effectively removed by using the method, and the randomness of the base is affected, and at the same time, the Mapping rate of the library and the reference genome is only 94.15% due to the residue of the adapter-contaminated exogenous sequence.
After the library A is filtered according to the linker pollution filtering method used by the invention, the base diagram obviously restores the base equilibrium state, namely the tail end has no bifurcation; meanwhile, the contrast ratio is improved to 99.52 percent. The data have obvious advantages in terms of base balance and alignment.
Example 2
The transcriptome libraries were sequenced using the methods of the invention. Sequencing strategy: hiseq 4000, PE 150.
Base mapping of raw off-line data is shown in FIG. 11, which shows that the library raw base mapping results in the appearance of adaptor-fixed sequences due to partial insert smaller than the sequencing read length, so that adaptor contamination results in tail base separation.
The library B was filtered by conventional linker-contaminated filtering (i.e., only one base mismatch and more than 50% match sequencing reads) to obtain a base map as shown in FIG. 12 with an alignment of 90.36%.
The library B was filtered by the linker-contaminated filtration method of the present invention, and the resulting base map is shown in FIG. 13, with an alignment of 93.59%.
From the above data, it appears that the base separation problem still remains at the end of the base map after filtering library B according to the traditional adapter-contaminated filtering method, and the Mapping rate of the library to the reference genome is only 90.36%.
After the library B is filtered according to the linker pollution filtering method used by the invention, the base equilibrium state of the base map is obviously recovered, and the comparison rate is improved to 93.59%. The data have obvious advantages in terms of base balance and alignment.
Example 3
The high GC library was sequenced using the method of the invention. Sequencing strategy: NovaSeq, PE 150.
Base mapping of the raw off-line data is shown in FIG. 14, which shows that the base separation starts at the tail of about 50bp due to linker contamination caused by linker immobilization sequence due to partial insert smaller than the sequencing read length.
The library C was filtered by conventional linker-contaminated filtration (i.e., only one base mismatch tolerant sequencing reads with a match of more than 50%) to obtain a base pattern as shown in FIG. 15 with an alignment of 92.43%
The library C was filtered by the linker-contaminated filtration method of the present invention, and the obtained base map is shown in FIG. 16, with a comparison rate of 99.87%.
From the above data representation, the base separation problem still exists at the end of the base map after the library C is filtered according to the traditional adapter-contaminated filtering method, while the base map obviously restores the base equilibrium state after the library C is filtered according to the adapter-contaminated filtering method used in the present invention. The data show that the efficiency of the filtration from linker contamination is more clearly superior in base balance.
As can be seen from the contents of examples 1 to 3, the linker contamination filtering method and device provided by the invention are suitable for different sequencing platforms and linker contamination data of different library types; the method can effectively and comprehensively filter the linker-contaminated reads, ensure the base balance of the data after the linker contamination filtration, and improve the accuracy of sequencing data; compared with the traditional joint pollution filtering method, the method has obvious advantages.
In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically coupled, may be electrically coupled or may be in communication with each other; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A method of sequencing data analysis, wherein the sequencing data comprises suspected contaminating sequencing reads that contain a linker matching region, the method comprising:
determining an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising:
a linker matching region; and
a linker adjacent region that is contiguous with the 5' end of the linker matching region;
determining a linker-adjacent region corresponding sequence based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing reads, the linker-adjacent region corresponding sequence being located at a 5' end of the corresponding sequencing read and having a same length as the linker-adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of a same insert, respectively;
determining whether the suspected contaminating sequencing reads are contaminated with a linker based on a level of matching of the linker adjacent corresponding sequence to the linker adjacent.
2. The sequencing data analysis method of claim 1, wherein the length of the linker matching region is less than 50% of the length of the linker;
optionally, the adapter matching region matches the adapter to a level of less than 50%;
optionally, the adaptor matching region matches the adaptor at a level of less than 50% in the event of a single base match error.
3. The method for sequencing data analysis according to claim 1 or 2, wherein the total length of the linker-matching region and the linker-adjacent region is 19 to 24 bp;
optionally, the total length of the linker-matching region and the linker-adjacent region is 20 bp;
optionally, the length of the sequencing read is 150-250 bp;
optionally, the sequencing reads are DNA.
4. The method for analyzing sequencing data of any one of claims 1 to 3, wherein a linker matching region of the suspected contaminating sequencing reads is compared with the linker sequence before determining the sequence corresponding to the linker adjacent region, and if one base mismatch is tolerated and the matching level is above 50%, it is determined that the suspected contaminating sequencing reads are contaminated with a linker;
optionally, the sequence corresponding to the linker vicinity completely matches the linker vicinity, then the suspected contaminating sequencing read is determined to be contaminated with the linker.
5. A sequencing data analysis device, wherein the sequencing data comprises suspected contaminating sequencing reads that contain a linker matching region, the sequencing data analysis device comprising:
a window sequence determination module that determines an analysis window sequence based on the sequence of the suspected contaminating sequencing reads, the analysis window sequence comprising:
a linker matching region; and
a linker adjacent region that is contiguous with the 5' end of the linker matching region;
a corresponding sequence determination module connected to the window sequence determination module, the corresponding sequence determination module determining a corresponding sequence of a linker adjacent region based on a sequence of a corresponding sequencing read of the suspected contaminating sequencing read, the corresponding sequence of the linker adjacent region being located at the 5' end of the corresponding sequencing read and having the same length as the linker adjacent region, the corresponding sequencing read and the suspected contaminating sequencing read being derived from two complementary strands of the same insert, respectively;
a dirty read determination module coupled to the corresponding sequence determination module, the dirty read determination module determining whether the suspected dirty sequencing read is contaminated with a linker based on a level of matching of the sequence corresponding to the linker vicinity.
6. The sequencing data analysis apparatus of claim 5, wherein the length of the linker matching region is less than 50% of the length of the linker;
optionally, the adapter matching region matches the adapter to a level of less than 50%;
optionally, the adaptor matching region matches the adaptor at a level of less than 50% in the event of a single base match error.
7. The sequencing data analysis device of claim 5 or 6, wherein the linker matching region and the linker adjacent region are 19-24 bp in length;
optionally, the linker matching region and the linker adjacent region are 20bp in length;
optionally, the length of the sequencing read is 150-250 bp;
optionally, the sequencing reads are DNA;
optionally, before determining the sequence corresponding to the linker adjacent region, aligning the linker matching region of the suspected contaminating sequencing reads with the linker sequence, and if one base mismatch is tolerated and the matching level is above 50%, determining that the suspected contaminating sequencing reads are contaminated by the linker;
optionally, the sequence corresponding to the linker vicinity completely matches the linker vicinity, then the suspected contaminating sequencing read is determined to be contaminated with the linker.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the method of any one of claims 1 to 4.
9. A computer scale storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method of any of claims 1-4.
10. A high throughput sequencing method, comprising:
constructing a sequencing library, thereby obtaining sequencing data;
determining whether a sequencing read is contaminated by a linker based on the sequencing data analysis method of any one of claims 1 to 4, thereby filtering the sequencing data to obtain filtered sequencing data;
optionally, further comprising:
comparing the adaptor matching region of the suspected pollution sequencing read with the adaptor sequence, if one base is mismatched and the matching level is more than 50%, determining that the suspected pollution sequencing read is polluted by the adaptor, and directly filtering to remove the sequencing read;
optionally, further comprising removing low quality sequencing reads;
optionally, further comprising:
aligning the filtered sequencing data to a reference genome.
CN201810921895.0A 2018-08-14 2018-08-14 Sequencing data analysis method and equipment and high-throughput sequencing method Active CN110827920B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810921895.0A CN110827920B (en) 2018-08-14 2018-08-14 Sequencing data analysis method and equipment and high-throughput sequencing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810921895.0A CN110827920B (en) 2018-08-14 2018-08-14 Sequencing data analysis method and equipment and high-throughput sequencing method

Publications (2)

Publication Number Publication Date
CN110827920A true CN110827920A (en) 2020-02-21
CN110827920B CN110827920B (en) 2022-11-22

Family

ID=69547173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810921895.0A Active CN110827920B (en) 2018-08-14 2018-08-14 Sequencing data analysis method and equipment and high-throughput sequencing method

Country Status (1)

Country Link
CN (1) CN110827920B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111471754A (en) * 2020-05-14 2020-07-31 北京安智因生物技术有限公司 Universal high-throughput sequencing joint and application thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000039333A1 (en) * 1998-12-23 2000-07-06 Jones Elizabeth Louise Sequencing method using magnifying tags
CN102021153A (en) * 1996-10-01 2011-04-20 杰龙公司 Human telomerase catalytic subunit
CN105653899A (en) * 2014-09-30 2016-06-08 深圳华大基因研究院 Method and system for determining mitochondria genome sequence information of various samples at the same time
CN106156536A (en) * 2015-04-15 2016-11-23 深圳华大基因科技有限公司 The method and system that sample immune group storehouse sequencing data is processed
JP6262922B1 (en) * 2017-02-16 2018-01-17 花王株式会社 Methods for evaluating the genotoxicity of substances

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102021153A (en) * 1996-10-01 2011-04-20 杰龙公司 Human telomerase catalytic subunit
WO2000039333A1 (en) * 1998-12-23 2000-07-06 Jones Elizabeth Louise Sequencing method using magnifying tags
CN105653899A (en) * 2014-09-30 2016-06-08 深圳华大基因研究院 Method and system for determining mitochondria genome sequence information of various samples at the same time
CN106156536A (en) * 2015-04-15 2016-11-23 深圳华大基因科技有限公司 The method and system that sample immune group storehouse sequencing data is processed
JP6262922B1 (en) * 2017-02-16 2018-01-17 花王株式会社 Methods for evaluating the genotoxicity of substances

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MARTIN KIRCHER 等: "Analysis of high-throughput ancient DNA sequencing data", 《ANCIENT DNA》 *
陈实富: "循环肿瘤DNA测序的数据分析方法", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111471754A (en) * 2020-05-14 2020-07-31 北京安智因生物技术有限公司 Universal high-throughput sequencing joint and application thereof
CN111471754B (en) * 2020-05-14 2021-01-29 北京安智因生物技术有限公司 Universal high-throughput sequencing joint and application thereof

Also Published As

Publication number Publication date
CN110827920B (en) 2022-11-22

Similar Documents

Publication Publication Date Title
AU2019204917B2 (en) Size-based analysis of fetal dna fraction in maternal plasma
US11371074B2 (en) Method and system for determining copy number variation
CN106715711B (en) Method for determining probe sequence and method for detecting genome structure variation
US10468121B2 (en) Phasing and linking processes to identify variations in a genome
CN109920480B (en) Method and device for correcting high-throughput sequencing data
Mikheenko et al. Sequencing of individual barcoded cDNAs using Pacific Biosciences and Oxford Nanopore Technologies reveals platform-specific error patterns
CN110827920B (en) Sequencing data analysis method and equipment and high-throughput sequencing method
CN109461473B (en) Method and device for acquiring concentration of free DNA of fetus
CN108728515A (en) A kind of analysis method of library construction and sequencing data using the detection ctDNA low frequencies mutation of duplex methods
EP2977466B1 (en) Detecting chromosomal aneuploidy
CN115637288A (en) Method for detecting copy number change of SMN1 and SMN2 genes and application thereof
CN111192635B (en) Analysis method for circular RNA identification and expression quantification
CN112639129A (en) Method and apparatus for determining the genetic status of a new mutation in an embryo
KR101977976B1 (en) Method for increasing read data analysis accuracy in amplicon based NGS by using primer remover
CN110283893B (en) Method, device, storage medium, processor and kit for detecting CAG (computer aided detection) repetitive sequence copy number of target gene
CN118028455A (en) Method and device for detecting genotyping based on alpha-globin gene copy number variation
CN115394359A (en) Method for identifying human embryonic cell chromosome variation and application
CN115725720A (en) Primer combination, kit and system for detecting SLC25A13IVS16 region variation
CN117672354A (en) Method and apparatus for comparing quality of complete genome assembly of closely related species of mammals

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant