CN114530199A

CN114530199A - Method and device for detecting low-frequency mutation based on double sequencing data and storage medium

Info

Publication number: CN114530199A
Application number: CN202210061903.5A
Authority: CN
Inventors: 浦丹; 陈慧敏; 向旭东; 李�杰; 张扬; 舒坤贤
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-05-24

Abstract

The invention claims a method, a device and a storage medium for detecting low-frequency mutation based on double sequencing data, the invention screens read family by reducing the threshold value of the number of reading sections contained in the read family to be 1 and fully utilizing the characteristics of reading section complementation, the read family with the number of reading sections more than or equal to 2 is reserved, or the read family with only 1 reading section is reserved, and the reading section can be complemented with SSCS sequences generated by the read family with the number of reading sections more than or equal to 2, or two read families with one reading section but complementation are reserved. And determining consistent bases and mass fractions thereof at each position by using Bayesian theorem for the 3 types of read family, generating a single-stranded consistent sequence SSCS according to the consistent bases, and further forming DCS by using the two complementary SSCSs. Finally, DCS were aligned again to the reference genome, identifying low frequency mutations on reads and sequencing errors. The invention can effectively inhibit high-throughput sequencing data errors and improve the accuracy of low-frequency mutation detection.

Description

Method and device for detecting low-frequency mutation based on double sequencing data and storage medium

Technical Field

The invention relates to the field of bioinformatics, in particular to a method and a device for detecting low-frequency mutation based on double sequencing data.

Background

In DNA samples such as tumor biopsies and circulating cell-free nucleic acids, mutations may be present at very low frequency (less than 0.01%) in the measured somatic DNA molecules. The detection of the extremely low frequency somatic mutation has wide application prospect in the aspects of early diagnosis, monitoring and prognosis of tumors, forensic identification, prenatal diagnosis and the like. The development of Next-generation sequencing (NGS) technology has changed the scale and depth of research in the fields of biological and medical science. Because NGS has the characteristics of large scale, high flux, low cost and the like, the NGS not only can realize the analysis of large-scale genome, but also can effectively recognize somatic variation. However, the high error rate of NGS (error rate of about 10)^-3-10^-2) The true mutations with frequencies lower than the sequencing error rate are masked, so that the detection of low frequency mutations remains a challenge as follows. First, detection of low frequency somatic mutations requires deep sequencing. However, increasing the sequencing depth increases the sequencing cost. Second, in the case of sufficient amount of sequencing template and sufficient sequencing depth, mutations at very low frequencies are still difficult to detect due to artifacts accumulated in NGS workflow. These artifacts may result from DNA base damage during sample preparation, erroneous base incorporation by DNA polymerases during enrichment and library amplification, and errors in final sequencing reads. In order to improve the recognition capability of low-frequency variation, scientists propose a series of NGS error correction methods. Double sequencing based on a molecular tag (UMI) can effectively inhibit high-throughput sequencing errors, and is a method capable of detecting and quantifying extremely-low-frequency mutation. When preparing the library, the method adds a special label at two ends of the original DNA templateAnd (3) carrying out PCR amplification and NGS sequencing on the sequence and the library, and then carrying out sequencing data analysis. During sequencing data analysis, a plurality of reads (reads) which are respectively identified by the same tag sequence in a sense strand and a negative strand and are expanded from the same DNA template are combined and gathered together to form a single-strand consensus sequence (SSCS) of the sense strand and the anti-sense strand; the generated sense strand SSCS is compared with the complementary antisense strand SSCS to generate a double-stranded consensus sequence (DCS), and the DCS is compared with the reference genome again to identify mutations or sequencing errors. Because the UMI-based dual sequencing method utilizes the pairing principle of a sense strand and an antisense strand to further correct errors, the inhibition effect of NGS sequencing errors is greatly improved. However, since only read families containing reads equal to or greater than 3 are reserved for generating SSCS, the SSCS has low efficiency in generating DCS, resulting in low sequencing data utilization, and the method requires higher sequencing depth than conventional NGS.

Disclosure of Invention

The invention aims to solve the problem that high-throughput sequencing error rate is high, so that low-frequency mutation detection is difficult, and provides a method, a device and a storage medium for detecting low-frequency mutation based on double sequencing data. The method respectively obtains read families of a sense strand and a negative strand by comparing with a reference genome, not only keeps the read families with the number of the sense strand and the negative strand more than or equal to 2, but also keeps the read families with only 1 read, and the read families can complement SSCS sequences generated by the read families with the number of the read families more than or equal to 2, and keeps two sets of read families with only one read but complementary reads, so as to improve the data utilization rate. Meanwhile, the probability that each base at each position is a real base is calculated by adopting Bayes' theorem, the base with the highest probability is selected as a consistent base and a single-stranded consistent sequence SSCS is generated, and the quality corresponding to the base is recalculated according to the probability so as to further improve the accuracy of sequencing data analysis. The method can effectively remove and improve the utilization rate of sequencing original data, reduce data waste, improve the sequencing accuracy and reduce the sequencing depth, provides a reliable biological information analysis process and tool for the detection of low-frequency mutation, and is expected to provide an effective detection means for medical detection and clinical diagnosis so as to promote the development of individualized medical treatment.

In view of the above, the technical solution adopted by the present invention is a method for detecting low-frequency mutation based on dual sequencing data, comprising the steps of:

(1) the sequencing raw data is subjected to a washing process, including removal of low quality reads and bases, repeated reads, and contaminating adaptor sequences, resulting in washed data.

(2) Comparing the cleaned data to a reference genome, respectively establishing read family of a sense strand and an antisense strand according to a comparison result, reducing the threshold value of the number of reads contained in the read family to 1, screening the read family by utilizing the read family complementary characteristic, and constructing effective read families.

(3) Calculating the probability that each base at each position is a real base by adopting a Bayesian algorithm for the constructed effective read families, selecting the base with the highest probability as a consistent base and generating a single-stranded consistent sequence SSCS, and recalculating the quality corresponding to the base according to the probability.

(4) And (4) further forming DCS by the SSCS generated in the step (3).

(5) And comparing the DCS with the reference genome again, and identifying low-frequency mutation and sequencing errors on the reads according to the comparison result.

Specifically, the effective read families in the step (2) include the following three types: 1) reading family with the number of reading stages more than or equal to 2; 2) the read family only contains 1 read family, and the read family can be complemented with the SSCS sequence generated by the read families with the number of read families more than or equal to 2; 3) each containing only one read, but two sets of read families with complementary reads.

Specifically, the Bayesian algorithm in the step (3) first accurately calculates the probability that each base at each position is a real base, selects the base with the highest probability as a consistent base and generates a single-stranded consistent sequence SSCS, and then recalculates the quality corresponding to the base according to the probability.

More preferably, the first and second liquid crystal compositions are,the Bayesian algorithm specifically comprises the following steps: the prior probability is determined by determining the prior probability to be 1-10 if the aligned bases are identical to the possible true bases^-q/10(ii) a If the aligned base is not consistent with the possible real base, the prior probability is 10^-q/10(ii)/3, wherein q is the base quality value and the distribution is described by p (b, bi, qi); for the 4 possible bases A, G, C or T, the posterior probability was calculated according to the following equation (1):

calculating the probability (b belongs to { A, G, C, T }) when the consistent base I is b for the base at each position in the SSCS, wherein the base type with the maximum probability value is the real base; simultaneously, according to a formula (2), recalculating the base quality by using the posterior probability value obtained by the calculation to obtain an error-corrected consistent reading;

q_c＝-10log₁₀(1-P[I＝b_c|{(b_i,q_i)}]) Formula (2)

Specifically, the generated single-stranded consensus sequence SSCS comprises: and (I) if the number of sequencing reads of the read family is more than or equal to 2, reserving the group of read family, accurately calculating the probability that each base at each position is a real base by adopting a Bayesian algorithm, selecting the base with the highest probability as a consistent base and generating a single-stranded consistent sequence SSCS, and recalculating the quality corresponding to the base according to the probability.

(II) if the number of sequencing reads of the read family is equal to 1 and the reads can be complementary with SSCS sequences generated by read families with the number of reads being more than or equal to 2, reserving the read family, determining the consistent base of each position on the read by using a Bayesian algorithm, generating a single-stranded consistent sequence SSCS, and recalculating the base quality according to the probability. (III) if the two groups of read family only contain one read, but the read family sequences are complementary, the read family is reserved, the consistent base of each position on the read is determined by a Bayesian algorithm, a single-stranded consistent sequence SSCS is generated, and the base quality is recalculated according to the probability.

Further, the generating the DCS includes:

comparing the sequences of the sense strand SSCS and the antisense strand SSCS generated in (I) for analysis, if the bases at the same position match, the base at the position is not changed, and calculating the average base mass of the positions on the sense strand and the antisense strand as the mass of the base; if the bases at the same position do not match, the base at the position is represented by 'N', and a low base quality (such as 10) is assigned, finally forming a double-stranded consensus sequence DCS, but if the occurrence ratio of 'N' in the DCS is more than 50%, removing the DCS; if the resulting SSCS does not have a complementary strand, but contains two or more reads in its read family, then the SSCS is also retained. The formation method for generating DCS in (II) is as follows: if the bases at the same position of the sense strand SSCS and the antisense strand SSCS sequences match, the bases at that position are not changed, and the average base mass at that position on the sense and antisense strands is calculated as the mass of that base; if the bases at the same position do not match, the base at the position is represented by 'N', and a low base quality (such as 10) is assigned, finally forming a double-stranded consensus sequence DCS, but if the occurrence ratio of 'N' in the DCS is more than 50%, removing the DCS; for the comparative analysis of the two sets of read family SSCS sequences in (III), if the bases of the same positions of the sense strand SSCS and the antisense strand SSCS sequences are matched, the bases of the positions are not changed, and the average base mass of the positions of the sense strand and the antisense strand is calculated as the mass of the bases; if the bases at the same position do not match, the base at that position is denoted by 'N', and a low base quality (e.g., 10) is assigned, eventually forming a double-stranded consensus sequence DCS, but if the proportion of 'N' occurring in the DCS is greater than 50%, the DCS is removed.

The present invention also provides a computer-readable storage medium storing a computer program enabling the above-described detection of low-frequency mutations based on double sequencing data.

The main advantages of the invention include:

1. the invention not only keeps the read family with the number of reading sections more than or equal to 2, but also keeps the read family containing 1 reading section, and the reading sections can complement the SSCS sequence generated by the read families with the number of reading sections more than or equal to 2, and keeps two groups of read families which only contain one reading section and are complementary with the reading sections. The threshold value of the number of reading segments contained in the read family is reduced to 1, and the read family is screened by fully utilizing the complementary characteristics of the reading segments, so that the utilization rate of original sequencing data is greatly improved.

2. The Bayesian algorithm recalculates the probability that each base at each position is a real base, selects the base with the highest probability as the real base at the position, and recalculates the mass fraction corresponding to the base according to the probability.

3. The invention provides a high-throughput sequencing data analysis method and tool with higher sensitivity and better accuracy for trace templates and low-abundance gene mutation.

Drawings

FIG. 1 is a schematic diagram of a method for detecting low frequency mutations based on double sequencing data according to the present invention; (A) aligning the raw sequencing data to a reference genome; (B) generating read families according to the comparison result, filtering the generated read families, keeping effective read families with the number of reading sections being more than or equal to 2, keeping the number of sequencing reading sections being equal to 1, keeping read families with the number of reading sections being more than or equal to 2 and being complementary with SSCS sequences generated by the read families with the number of reading sections being more than or equal to 2, and keeping two groups of read families with only one reading section and complementary reading sections; (C) enabling read family to generate SSCS according to a Bayesian algorithm; (D) SSCS further generates DCS; (E) and (4) mutation identification, and obtaining mutation and sequencing error information in the double sequencing data.

FIG. 2 is a graph comparing error rates using the method of the present invention and a conventional double sequencing data analysis method.

Detailed Description

In order to clearly describe the technical contents of the present invention, the following description is further provided in conjunction with the embodiment examples.

The dual sequencing data (Access number: SRP140497) and the human genome reference sequence hg19 were downloaded from the national center for Biotechnology information NCBI (https:// www.ncbi.nlm.nih.gov /) website. In this embodiment, the data is respectively subjected to (a) a method for detecting low-frequency mutation based on double sequencing data of the method of the present invention, that is, 3 kinds of effective read similarities are retained, and a bayesian algorithm is used to determine the consistent bases and the base quality thereof at each position to generate SSCS and DCS, and (B) an analysis method of conventional double sequencing data, that is, only read similarities with read numbers of 3 or more are retained, and Q30 is used as a threshold value to perform quality control. And finally, comparing the results processed by the two methods. The specific method comprises the following steps:

A. the data are analyzed by the method of the invention

(1) Data cleansing

The sequencing raw data was first stripped of low quality bases, contaminating adaptor sequences and low quality reads using SOAPnuke or fastp. This step involves removing the first 3bp of each sequence in the original dataset (which was supplemented with 3bp during library preparation, including a tag sequence of 2bp and a single fixed base T), and writing a python program to add the tag sequence to the read ID for subsequent sorting.

(2) Generating read families

Sequencing data were aligned to the reference genomic sequence hg19 using software BWA (v0.7.15) and the sequences were sequenced using SAMtools (v1.3) to generate an ordered BAM file containing the sequence alignment data. And removing the unaligned reads and the reads at a plurality of positions in the alignment according to the comparison result of the BWA. And generating read families according to the comparison position, the molecular tag sequence, the CIGAR tag and the comparison direction. Then, the effective read similarities are generated by reducing the threshold of the number of reads included in the read similarities to 1 and by effectively utilizing the characteristics of the read complementation. The effective read families generated by the invention comprise the following 3 types: (1) if the number of sequencing reads of the read family is more than or equal to 2; (2) if the number of the sequencing reads of the read family is equal to 1, the reads can be complementary with the SSCS sequence generated by the read families of which the number of the sequencing reads is more than or equal to 2; (3) if both sets of read families contain only one read, the read sequences are complementary.

(3) Generating SSCS

In the process of generating SSCS at each read family, the determination of the base at each position is generally performed by most rules, i.e., calculating the ratio of each base at the position, reserving more than 70% of the bases, and considering the actual base at the position as the base, and using the higher base quality value as the final base quality value. This method is simple and easy to implement, but there is still a need to further improve the accuracy of the base quality values for random sequencing errors present in sequencing instruments and errors introduced by PCR amplification. In order to further correct sequencing data, the invention further determines consistent bases by adopting Bayesian theorem so as to correct the sequencing error rate. The implementation mode is as follows: and calculating the probability that each base at each position is a real base according to Bayes theorem, selecting the base with the highest probability as a consistent base, and calculating the quality of each base on the consistent reads according to the probability, so that each base of the consistent reads is more accurate and reliable. Since this case retains 3 read families for SSCS generation, it is described in three cases.

In the first case, if the number of sequencing reads of the read family is greater than or equal to 2, the set of read families is retained, the probability that each position is a real base is accurately calculated by adopting a Bayesian algorithm, and the base with the highest probability is designated as the real base of the position, so as to obtain a single-stranded consensus sequence SSCS, and then the base quality is recalculated. In the second case, if the number of sequencing reads of the read family is equal to 1 and the reads can be complementary to the SSCS sequence generated by the read families with the number of reads being greater than or equal to 2, the group of read families is reserved, the consistent base of each position on the read is determined by the Bayesian algorithm, the single-stranded consistent sequence SSCS is generated, and the base quality is recalculated according to the probability. In the third case, if two groups of read families only contain one read but the read sequences are complementary, the read family is retained, the consistent base at each position on the read is determined by a Bayesian algorithm, a single-stranded consistent sequence SSCS is generated, and the base quality is recalculated according to the probability. The SSCS generated in the above three cases is written to a BAM1 file.

(4) Generating a DCS

Comparing the sequence of the sense strand SSCS and the sequence of the antisense strand SSCS generated in the step (3) with read families with the number of sequencing reads of 2 or more. If the bases at the same position match, the bases at that position are unchanged, and the average base mass at that position on the sense and antisense strands is calculated as the mass of that base; if the bases at the same position do not match, the base at the position is represented by 'N', and a low base quality (such as 10) is assigned, finally forming a double-stranded consensus sequence DCS, but if the occurrence ratio of 'N' in the DCS is more than 50%, removing the DCS; if the resulting SSCS does not have a complementary strand, but contains two or more reads in its read family, then the SSCS is also retained. For the read family containing 1 sequencing read in step (3) and being complementary to the SSCS sequence generated by read families with the number of reads greater than or equal to 2, searching the SSCS complementary to the generated SSCS, if the bases at the same position of the sequence are matched, the base at the position is not changed, and calculating the average base mass of the position on the sense strand and the antisense strand as the mass of the base; if the bases at the same position do not match, the base at that position is denoted by 'N', and a low base quality (e.g., 10) is assigned, eventually forming a double-stranded consensus sequence DCS, but if the proportion of 'N' occurring in the DCS is greater than 50%, the DCS is removed. For two sets of read families, each of which contains only one sequencing read in step (3) but is complementary to the other, the SSCS sequences of the two sets of read families are compared and analyzed. If the bases at the same position of the sense strand SSCS and the antisense strand SSCS sequences match, the bases at that position are not changed, and the average base mass at that position on the sense and antisense strands is calculated as the mass of that base; if the bases at the same position do not match, the base at that position is denoted by 'N', and a low base quality (e.g., 10) is assigned, eventually forming a double-stranded consensus sequence DCS, but if the proportion of 'N' occurring in the DCS is greater than 50%, the DCS is removed. The DCS generated as described above is written to the BAM1 file described above.

(5) Mutation identification

The generated DCS pool was further aligned with the reference genome using software BWA (v0.7.15) to obtain mutation and sequencing error information in the double sequencing data.

B. The data were analyzed by conventional multiplex sequencing methods

(1) Data cleansing

(2) Generating read family

The sequencing data were aligned with the reference genomic sequence hg19 using software BWA (v0.7.15) and the sequences were sequenced using SAMtools (v1.3) to generate an ordered BAM file containing the sequence alignment data. And removing the unaligned reads and aligning the reads at the plurality of positions according to the comparison result of the BWA. And generating read families according to the comparison position, the molecular tag sequence, the CIGAR tag and the comparison direction. The threshold value of the number of reads contained in the read families is 3. Therefore, the effective read families generated in the method are the read families with the sequencing read number being more than or equal to 3.

(3) Generating SSCS

In the generation of SSCSs for each read family, the determination of the base at each position is performed by applying most rules, i.e., calculating the ratio of each base at the position, reserving a ratio of more than 70% bases, and considering the actual base at the position as the base, and using the higher base quality value as the final base quality value. And performing consistency processing on all the reads in any read family to generate consistency reads. The method comprises the following steps: at least 3 reads and more than 3 reads are required in each read family, consistency comparison is carried out on the bases on each read family in each read family, more than or equal to 70% of consistent sites are true, and otherwise, N is used for replacing base information. And removing the reads with the number of N being more than or equal to 30%. Eventually the SSCS of the sense and antisense strands, respectively, are formed.

(4) Generating a DCS

The SSCS sequences generated by the sense strand and the antisense strand are further compared and analyzed to form DCS.

(5) Mutation identification

The generated DCS pool was further aligned with the reference genome using software BWA (v0.7.15) to obtain information on mutations and sequencing errors in the double sequencing data.

The error rates of the two analysis methods are compared, the result is shown in fig. 2, and the percentage of sequencing errors obtained after the sequencing data are analyzed by the method is 0.000497183%; meanwhile, the percentage of sequencing errors obtained after analyzing the above sequencing data by using the conventional multiple sequencing method was 0.000512317%. This indicates that the present invention enables high quality sequencing error suppression.

Claims

1. The method for detecting the low-frequency mutation based on the double sequencing data is characterized by comprising the following steps of:

(1) cleaning the sequencing original data to obtain cleaned data;

(2) comparing the cleaned data to a reference genome, respectively establishing read family of a sense strand and an antisense strand according to a comparison result, reducing the threshold value of the number of reads contained in the read family to 1, screening the read family by utilizing the characteristics of read complementation, and constructing effective read families;

(3) accurately calculating the probability that each base at each position is a real base by adopting a Bayesian algorithm for the constructed effective read families, selecting the base with the highest probability as a consistent base, generating a single-stranded consistent sequence SSCS, and recalculating the quality corresponding to the base according to the probability;

(4) further forming double-stranded consistent sequence DCS by the SSCS generated in the step (3);

2. The method for detecting low frequency mutations based on dual sequencing data according to claim 1, wherein: the effective read families in the step (2) comprise the following three types: 1) reading family with the number of reading stages more than or equal to 2; 2) the read family only contains 1 read family, and the read family can be complemented with SSCS sequences generated by read families with the number of read families more than or equal to 2; 3) each containing only one read, but two sets of read families with complementary reads.

3. The method for detecting low frequency mutations based on duplex sequencing data according to claim 1 or 2, wherein: the generated single-stranded consensus sequence SSCS comprises: if the number of sequencing reads of the read family is more than or equal to 2, reserving the group of read family, accurately calculating the probability that each base at each position is a real base by adopting a Bayesian algorithm, selecting the base with the highest probability as a consistent base and generating a single-stranded consistent sequence SSCS, and recalculating the quality corresponding to the base according to the probability;

(II) if the number of sequencing reads of the read family is equal to 1 and the reads can be complementary with SSCS sequences generated by read families with the number of reads being more than or equal to 2, reserving the read family, determining the consistent base of each position on the read by using a Bayesian algorithm, generating a single-stranded consistent sequence SSCS, and recalculating the base quality according to the probability;

(III) if the two groups of read family only contain one read, but the read family sequences are complementary, the read family is reserved, the consistent base of each position on the read is determined by a Bayesian algorithm, a single-stranded consistent sequence SSCS is generated, and the base quality is recalculated according to the probability.

4. The method for detecting low frequency mutations based on dual sequencing data of claim 3, wherein: the generating the DCS comprises:

comparing the sense strand SSCS and the antisense strand SSCS generated in (I) for analysis, if the bases at the same position match, the base at the position is not changed, and calculating the average base mass of the positions on the sense strand and the antisense strand as the mass of the base; if the bases at the same position do not match, the bases at the position are represented by 'N', and a low base quality is allocated, finally a double-stranded consistent sequence DCS is formed, but if the occurrence ratio of 'N' in the DCS is more than 50%, the DCS is removed; if the resulting SSCS has no complementary strand, but contains two or more reads in its read family, then the SSCS is also kept;

the method for generating DCS in (II) is as follows: if the bases at the same positions of the sense strand SSCS and the antisense strand SSCS sequences are matched, the bases at the positions are not changed, and the average base mass of the positions on the sense strand and the antisense strand is calculated as the mass of the bases; if the bases at the same position do not match, the bases at the position are represented by 'N', and a low base quality is allocated, finally a double-stranded consistent sequence DCS is formed, but if the occurrence ratio of 'N' in the DCS is more than 50%, the DCS is removed;

for the comparative analysis of the two sets of read family SSCS sequences in (III), if the bases of the same positions of the sense strand SSCS and the antisense strand SSCS sequences are matched, the bases of the positions are not changed, and the average base mass of the positions of the sense strand and the antisense strand is calculated as the mass of the bases; if the bases at the same position do not match, the base at the position is represented by 'N', and a low base quality is assigned, finally forming a double-stranded consensus sequence DCS, but if the occurrence ratio of 'N' in DCS is more than 50%, the piece of DCS is removed.

5. Device based on dual sequencing data detects low frequency mutation, its characterized in that: comprises that

(1) The double sequencing sequence cleaning unit is used for cleaning sequencing original data to obtain cleaned data;

(2) the read family generating unit is used for comparing the cleaned data to a reference genome, respectively establishing read families of a sense chain and an antisense chain according to a comparison result, reducing the threshold value of the number of reads contained in the read families to be 1, screening the read families by utilizing the characteristics of read complementation, and establishing effective read families;

(3) the single-stranded consistent sequence SSCS generation unit is used for accurately calculating the probability that each base at each position is a real base by adopting a Bayesian algorithm for the constructed effective read families, selecting the base with the highest probability as a consistent base and generating the single-stranded consistent sequence SSCS, and then recalculating the quality corresponding to the base according to the probability;

(4) a double-stranded consensus sequence DCS generation unit for further forming DCS from the SSCS;

(5) and the mutation identification unit is used for comparing the DCS with the reference genome again and identifying low-frequency mutation and sequencing errors on the reading section according to the comparison result.

6. The apparatus for detecting low frequency mutation based on dual sequencing data of claim 5, wherein: the read family generating unit generates three effective family: 1) reading family with the number of reading stages more than or equal to 2; 2) only contains the read family of 1 read segment, and the read segment can be complemented with the SSCS sequence generated by the read families with the read family number more than or equal to 2; 3) each containing only one read, but two sets of read families with complementary reads.

7. The apparatus for detecting low frequency mutation based on dual sequencing data of claim 5, wherein: the single-stranded consensus sequence SSCS generation unit specifically comprises the following three treatments:

in the first case, if the number of sequencing reads of the read family is more than or equal to 2, the group of read families is reserved, the probability that each base at each position is a real base is accurately calculated by adopting a Bayesian algorithm, the base with the highest probability is designated as the consistent base at the position, a single-stranded consistent sequence SSCS is generated, and then the quality corresponding to the base is recalculated;

in the second case, if the number of sequencing reads of the read family is equal to 1 and the reads can be complemented with the SSCS sequence generated by the read families with the number of reads being more than or equal to 2, the group of read families is reserved, the consistent base of each position on the read is determined by the Bayesian algorithm and the SSCS sequence is generated, and then the base quality is recalculated according to the probability

In the third case, if two groups of read families only contain one read but the read sequences are complementary, the read family is retained, the consistent base at each position on the read is determined by a Bayesian algorithm, a single-stranded consistent sequence SSCS is generated, and the base quality is recalculated according to the probability.

8. The apparatus for detecting low frequency mutation based on dual sequencing data of claim 7, wherein: the double-chain consistency sequence DCS generation unit generates DCS which comprises the following three steps:

performing comparative analysis on the sequences of the sense strand SSCS and the antisense strand SSCS generated in the first case, if the bases at the same position match, the base at that position is not changed, and calculating the average base mass of that position on the sense strand and the antisense strand as the mass of that base; if the bases at the same position do not match, the bases at the position are represented by 'N', and a low base quality is allocated, finally a double-stranded consistent sequence DCS is formed, but if the occurrence ratio of 'N' in the DCS is more than 50%, the DCS is removed; if the generated SSCS has no complementary strand, but contains two or more reads in the read family, the SSCS is also continuously kept;

the method for generating DCS in the second case is as follows: if the bases at the same position of the sense strand SSCS and the antisense strand SSCS sequences match, the bases at that position are not changed, and the average base mass at that position on the sense and antisense strands is calculated as the mass of that base; if the bases at the same position do not match, the bases at the position are represented by 'N', and a low base quality is allocated, finally a double-stranded consistent sequence DCS is formed, but if the occurrence ratio of 'N' in the DCS is more than 50%, the DCS is removed;

for the third case, two sets of read family SSCS sequences are compared and analyzed, if the bases of the same positions of the sense strand SSCS and the antisense strand SSCS sequences are matched, the bases of the positions are not changed, and the average base mass of the positions of the sense strand and the antisense strand is calculated as the mass of the bases; if the bases at the same position do not match, the base at the position is represented by 'N', and a low base quality is assigned, finally forming a double-stranded consensus sequence DCS, but if the occurrence ratio of 'N' in DCS is more than 50%, the piece of DCS is removed.

9. A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed, may implement the method of detecting low frequency mutations based on dual sequencing data of any of claims 1-5.