CN114530199A - Method and device for detecting low-frequency mutation based on double sequencing data and storage medium - Google Patents

Method and device for detecting low-frequency mutation based on double sequencing data and storage medium Download PDF

Info

Publication number
CN114530199A
CN114530199A CN202210061903.5A CN202210061903A CN114530199A CN 114530199 A CN114530199 A CN 114530199A CN 202210061903 A CN202210061903 A CN 202210061903A CN 114530199 A CN114530199 A CN 114530199A
Authority
CN
China
Prior art keywords
read
base
sscs
dcs
family
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210061903.5A
Other languages
Chinese (zh)
Inventor
浦丹
陈慧敏
向旭东
李�杰
张扬
舒坤贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210061903.5A priority Critical patent/CN114530199A/en
Publication of CN114530199A publication Critical patent/CN114530199A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Abstract

The invention claims a method, a device and a storage medium for detecting low-frequency mutation based on double sequencing data, the invention screens read family by reducing the threshold value of the number of reading sections contained in the read family to be 1 and fully utilizing the characteristics of reading section complementation, the read family with the number of reading sections more than or equal to 2 is reserved, or the read family with only 1 reading section is reserved, and the reading section can be complemented with SSCS sequences generated by the read family with the number of reading sections more than or equal to 2, or two read families with one reading section but complementation are reserved. And determining consistent bases and mass fractions thereof at each position by using Bayesian theorem for the 3 types of read family, generating a single-stranded consistent sequence SSCS according to the consistent bases, and further forming DCS by using the two complementary SSCSs. Finally, DCS were aligned again to the reference genome, identifying low frequency mutations on reads and sequencing errors. The invention can effectively inhibit high-throughput sequencing data errors and improve the accuracy of low-frequency mutation detection.

Description

Method and device for detecting low-frequency mutation based on double sequencing data and storage medium
Technical Field
The invention relates to the field of bioinformatics, in particular to a method and a device for detecting low-frequency mutation based on double sequencing data.
Background
In DNA samples such as tumor biopsies and circulating cell-free nucleic acids, mutations may be present at very low frequency (less than 0.01%) in the measured somatic DNA molecules. The detection of the extremely low frequency somatic mutation has wide application prospect in the aspects of early diagnosis, monitoring and prognosis of tumors, forensic identification, prenatal diagnosis and the like. The development of Next-generation sequencing (NGS) technology has changed the scale and depth of research in the fields of biological and medical science. Because NGS has the characteristics of large scale, high flux, low cost and the like, the NGS not only can realize the analysis of large-scale genome, but also can effectively recognize somatic variation. However, the high error rate of NGS (error rate of about 10)-3-10-2) The true mutations with frequencies lower than the sequencing error rate are masked, so that the detection of low frequency mutations remains a challenge as follows. First, detection of low frequency somatic mutations requires deep sequencing. However, increasing the sequencing depth increases the sequencing cost. Second, in the case of sufficient amount of sequencing template and sufficient sequencing depth, mutations at very low frequencies are still difficult to detect due to artifacts accumulated in NGS workflow. These artifacts may result from DNA base damage during sample preparation, erroneous base incorporation by DNA polymerases during enrichment and library amplification, and errors in final sequencing reads. In order to improve the recognition capability of low-frequency variation, scientists propose a series of NGS error correction methods. Double sequencing based on a molecular tag (UMI) can effectively inhibit high-throughput sequencing errors, and is a method capable of detecting and quantifying extremely-low-frequency mutation. When preparing the library, the method adds a special label at two ends of the original DNA templateAnd (3) carrying out PCR amplification and NGS sequencing on the sequence and the library, and then carrying out sequencing data analysis. During sequencing data analysis, a plurality of reads (reads) which are respectively identified by the same tag sequence in a sense strand and a negative strand and are expanded from the same DNA template are combined and gathered together to form a single-strand consensus sequence (SSCS) of the sense strand and the anti-sense strand; the generated sense strand SSCS is compared with the complementary antisense strand SSCS to generate a double-stranded consensus sequence (DCS), and the DCS is compared with the reference genome again to identify mutations or sequencing errors. Because the UMI-based dual sequencing method utilizes the pairing principle of a sense strand and an antisense strand to further correct errors, the inhibition effect of NGS sequencing errors is greatly improved. However, since only read families containing reads equal to or greater than 3 are reserved for generating SSCS, the SSCS has low efficiency in generating DCS, resulting in low sequencing data utilization, and the method requires higher sequencing depth than conventional NGS.
Disclosure of Invention
The invention aims to solve the problem that high-throughput sequencing error rate is high, so that low-frequency mutation detection is difficult, and provides a method, a device and a storage medium for detecting low-frequency mutation based on double sequencing data. The method respectively obtains read families of a sense strand and a negative strand by comparing with a reference genome, not only keeps the read families with the number of the sense strand and the negative strand more than or equal to 2, but also keeps the read families with only 1 read, and the read families can complement SSCS sequences generated by the read families with the number of the read families more than or equal to 2, and keeps two sets of read families with only one read but complementary reads, so as to improve the data utilization rate. Meanwhile, the probability that each base at each position is a real base is calculated by adopting Bayes' theorem, the base with the highest probability is selected as a consistent base and a single-stranded consistent sequence SSCS is generated, and the quality corresponding to the base is recalculated according to the probability so as to further improve the accuracy of sequencing data analysis. The method can effectively remove and improve the utilization rate of sequencing original data, reduce data waste, improve the sequencing accuracy and reduce the sequencing depth, provides a reliable biological information analysis process and tool for the detection of low-frequency mutation, and is expected to provide an effective detection means for medical detection and clinical diagnosis so as to promote the development of individualized medical treatment.
In view of the above, the technical solution adopted by the present invention is a method for detecting low-frequency mutation based on dual sequencing data, comprising the steps of:
(1) the sequencing raw data is subjected to a washing process, including removal of low quality reads and bases, repeated reads, and contaminating adaptor sequences, resulting in washed data.
(2) Comparing the cleaned data to a reference genome, respectively establishing read family of a sense strand and an antisense strand according to a comparison result, reducing the threshold value of the number of reads contained in the read family to 1, screening the read family by utilizing the read family complementary characteristic, and constructing effective read families.
(3) Calculating the probability that each base at each position is a real base by adopting a Bayesian algorithm for the constructed effective read families, selecting the base with the highest probability as a consistent base and generating a single-stranded consistent sequence SSCS, and recalculating the quality corresponding to the base according to the probability.
(4) And (4) further forming DCS by the SSCS generated in the step (3).
(5) And comparing the DCS with the reference genome again, and identifying low-frequency mutation and sequencing errors on the reads according to the comparison result.
Specifically, the effective read families in the step (2) include the following three types: 1) reading family with the number of reading stages more than or equal to 2; 2) the read family only contains 1 read family, and the read family can be complemented with the SSCS sequence generated by the read families with the number of read families more than or equal to 2; 3) each containing only one read, but two sets of read families with complementary reads.
Specifically, the Bayesian algorithm in the step (3) first accurately calculates the probability that each base at each position is a real base, selects the base with the highest probability as a consistent base and generates a single-stranded consistent sequence SSCS, and then recalculates the quality corresponding to the base according to the probability.
More preferably, the first and second liquid crystal compositions are,the Bayesian algorithm specifically comprises the following steps: the prior probability is determined by determining the prior probability to be 1-10 if the aligned bases are identical to the possible true bases-q/10(ii) a If the aligned base is not consistent with the possible real base, the prior probability is 10-q/10(ii)/3, wherein q is the base quality value and the distribution is described by p (b, bi, qi); for the 4 possible bases A, G, C or T, the posterior probability was calculated according to the following equation (1):
Figure BDA0003478703480000031
calculating the probability (b belongs to { A, G, C, T }) when the consistent base I is b for the base at each position in the SSCS, wherein the base type with the maximum probability value is the real base; simultaneously, according to a formula (2), recalculating the base quality by using the posterior probability value obtained by the calculation to obtain an error-corrected consistent reading;
qc=-10log10(1-P[I=bc|{(bi,qi)}]) Formula (2)
Specifically, the generated single-stranded consensus sequence SSCS comprises: and (I) if the number of sequencing reads of the read family is more than or equal to 2, reserving the group of read family, accurately calculating the probability that each base at each position is a real base by adopting a Bayesian algorithm, selecting the base with the highest probability as a consistent base and generating a single-stranded consistent sequence SSCS, and recalculating the quality corresponding to the base according to the probability.
(II) if the number of sequencing reads of the read family is equal to 1 and the reads can be complementary with SSCS sequences generated by read families with the number of reads being more than or equal to 2, reserving the read family, determining the consistent base of each position on the read by using a Bayesian algorithm, generating a single-stranded consistent sequence SSCS, and recalculating the base quality according to the probability. (III) if the two groups of read family only contain one read, but the read family sequences are complementary, the read family is reserved, the consistent base of each position on the read is determined by a Bayesian algorithm, a single-stranded consistent sequence SSCS is generated, and the base quality is recalculated according to the probability.
Further, the generating the DCS includes:
comparing the sequences of the sense strand SSCS and the antisense strand SSCS generated in (I) for analysis, if the bases at the same position match, the base at the position is not changed, and calculating the average base mass of the positions on the sense strand and the antisense strand as the mass of the base; if the bases at the same position do not match, the base at the position is represented by 'N', and a low base quality (such as 10) is assigned, finally forming a double-stranded consensus sequence DCS, but if the occurrence ratio of 'N' in the DCS is more than 50%, removing the DCS; if the resulting SSCS does not have a complementary strand, but contains two or more reads in its read family, then the SSCS is also retained. The formation method for generating DCS in (II) is as follows: if the bases at the same position of the sense strand SSCS and the antisense strand SSCS sequences match, the bases at that position are not changed, and the average base mass at that position on the sense and antisense strands is calculated as the mass of that base; if the bases at the same position do not match, the base at the position is represented by 'N', and a low base quality (such as 10) is assigned, finally forming a double-stranded consensus sequence DCS, but if the occurrence ratio of 'N' in the DCS is more than 50%, removing the DCS; for the comparative analysis of the two sets of read family SSCS sequences in (III), if the bases of the same positions of the sense strand SSCS and the antisense strand SSCS sequences are matched, the bases of the positions are not changed, and the average base mass of the positions of the sense strand and the antisense strand is calculated as the mass of the bases; if the bases at the same position do not match, the base at that position is denoted by 'N', and a low base quality (e.g., 10) is assigned, eventually forming a double-stranded consensus sequence DCS, but if the proportion of 'N' occurring in the DCS is greater than 50%, the DCS is removed.
The present invention also provides a computer-readable storage medium storing a computer program enabling the above-described detection of low-frequency mutations based on double sequencing data.
The main advantages of the invention include:
1. the invention not only keeps the read family with the number of reading sections more than or equal to 2, but also keeps the read family containing 1 reading section, and the reading sections can complement the SSCS sequence generated by the read families with the number of reading sections more than or equal to 2, and keeps two groups of read families which only contain one reading section and are complementary with the reading sections. The threshold value of the number of reading segments contained in the read family is reduced to 1, and the read family is screened by fully utilizing the complementary characteristics of the reading segments, so that the utilization rate of original sequencing data is greatly improved.
2. The Bayesian algorithm recalculates the probability that each base at each position is a real base, selects the base with the highest probability as the real base at the position, and recalculates the mass fraction corresponding to the base according to the probability.
3. The invention provides a high-throughput sequencing data analysis method and tool with higher sensitivity and better accuracy for trace templates and low-abundance gene mutation.
Drawings
FIG. 1 is a schematic diagram of a method for detecting low frequency mutations based on double sequencing data according to the present invention; (A) aligning the raw sequencing data to a reference genome; (B) generating read families according to the comparison result, filtering the generated read families, keeping effective read families with the number of reading sections being more than or equal to 2, keeping the number of sequencing reading sections being equal to 1, keeping read families with the number of reading sections being more than or equal to 2 and being complementary with SSCS sequences generated by the read families with the number of reading sections being more than or equal to 2, and keeping two groups of read families with only one reading section and complementary reading sections; (C) enabling read family to generate SSCS according to a Bayesian algorithm; (D) SSCS further generates DCS; (E) and (4) mutation identification, and obtaining mutation and sequencing error information in the double sequencing data.
FIG. 2 is a graph comparing error rates using the method of the present invention and a conventional double sequencing data analysis method.
Detailed Description
In order to clearly describe the technical contents of the present invention, the following description is further provided in conjunction with the embodiment examples.
The dual sequencing data (Access number: SRP140497) and the human genome reference sequence hg19 were downloaded from the national center for Biotechnology information NCBI (https:// www.ncbi.nlm.nih.gov /) website. In this embodiment, the data is respectively subjected to (a) a method for detecting low-frequency mutation based on double sequencing data of the method of the present invention, that is, 3 kinds of effective read similarities are retained, and a bayesian algorithm is used to determine the consistent bases and the base quality thereof at each position to generate SSCS and DCS, and (B) an analysis method of conventional double sequencing data, that is, only read similarities with read numbers of 3 or more are retained, and Q30 is used as a threshold value to perform quality control. And finally, comparing the results processed by the two methods. The specific method comprises the following steps:
A. the data are analyzed by the method of the invention
(1) Data cleansing
The sequencing raw data was first stripped of low quality bases, contaminating adaptor sequences and low quality reads using SOAPnuke or fastp. This step involves removing the first 3bp of each sequence in the original dataset (which was supplemented with 3bp during library preparation, including a tag sequence of 2bp and a single fixed base T), and writing a python program to add the tag sequence to the read ID for subsequent sorting.
(2) Generating read families
Sequencing data were aligned to the reference genomic sequence hg19 using software BWA (v0.7.15) and the sequences were sequenced using SAMtools (v1.3) to generate an ordered BAM file containing the sequence alignment data. And removing the unaligned reads and the reads at a plurality of positions in the alignment according to the comparison result of the BWA. And generating read families according to the comparison position, the molecular tag sequence, the CIGAR tag and the comparison direction. Then, the effective read similarities are generated by reducing the threshold of the number of reads included in the read similarities to 1 and by effectively utilizing the characteristics of the read complementation. The effective read families generated by the invention comprise the following 3 types: (1) if the number of sequencing reads of the read family is more than or equal to 2; (2) if the number of the sequencing reads of the read family is equal to 1, the reads can be complementary with the SSCS sequence generated by the read families of which the number of the sequencing reads is more than or equal to 2; (3) if both sets of read families contain only one read, the read sequences are complementary.
(3) Generating SSCS
In the process of generating SSCS at each read family, the determination of the base at each position is generally performed by most rules, i.e., calculating the ratio of each base at the position, reserving more than 70% of the bases, and considering the actual base at the position as the base, and using the higher base quality value as the final base quality value. This method is simple and easy to implement, but there is still a need to further improve the accuracy of the base quality values for random sequencing errors present in sequencing instruments and errors introduced by PCR amplification. In order to further correct sequencing data, the invention further determines consistent bases by adopting Bayesian theorem so as to correct the sequencing error rate. The implementation mode is as follows: and calculating the probability that each base at each position is a real base according to Bayes theorem, selecting the base with the highest probability as a consistent base, and calculating the quality of each base on the consistent reads according to the probability, so that each base of the consistent reads is more accurate and reliable. Since this case retains 3 read families for SSCS generation, it is described in three cases.
In the first case, if the number of sequencing reads of the read family is greater than or equal to 2, the set of read families is retained, the probability that each position is a real base is accurately calculated by adopting a Bayesian algorithm, and the base with the highest probability is designated as the real base of the position, so as to obtain a single-stranded consensus sequence SSCS, and then the base quality is recalculated. In the second case, if the number of sequencing reads of the read family is equal to 1 and the reads can be complementary to the SSCS sequence generated by the read families with the number of reads being greater than or equal to 2, the group of read families is reserved, the consistent base of each position on the read is determined by the Bayesian algorithm, the single-stranded consistent sequence SSCS is generated, and the base quality is recalculated according to the probability. In the third case, if two groups of read families only contain one read but the read sequences are complementary, the read family is retained, the consistent base at each position on the read is determined by a Bayesian algorithm, a single-stranded consistent sequence SSCS is generated, and the base quality is recalculated according to the probability. The SSCS generated in the above three cases is written to a BAM1 file.
(4) Generating a DCS
Comparing the sequence of the sense strand SSCS and the sequence of the antisense strand SSCS generated in the step (3) with read families with the number of sequencing reads of 2 or more. If the bases at the same position match, the bases at that position are unchanged, and the average base mass at that position on the sense and antisense strands is calculated as the mass of that base; if the bases at the same position do not match, the base at the position is represented by 'N', and a low base quality (such as 10) is assigned, finally forming a double-stranded consensus sequence DCS, but if the occurrence ratio of 'N' in the DCS is more than 50%, removing the DCS; if the resulting SSCS does not have a complementary strand, but contains two or more reads in its read family, then the SSCS is also retained. For the read family containing 1 sequencing read in step (3) and being complementary to the SSCS sequence generated by read families with the number of reads greater than or equal to 2, searching the SSCS complementary to the generated SSCS, if the bases at the same position of the sequence are matched, the base at the position is not changed, and calculating the average base mass of the position on the sense strand and the antisense strand as the mass of the base; if the bases at the same position do not match, the base at that position is denoted by 'N', and a low base quality (e.g., 10) is assigned, eventually forming a double-stranded consensus sequence DCS, but if the proportion of 'N' occurring in the DCS is greater than 50%, the DCS is removed. For two sets of read families, each of which contains only one sequencing read in step (3) but is complementary to the other, the SSCS sequences of the two sets of read families are compared and analyzed. If the bases at the same position of the sense strand SSCS and the antisense strand SSCS sequences match, the bases at that position are not changed, and the average base mass at that position on the sense and antisense strands is calculated as the mass of that base; if the bases at the same position do not match, the base at that position is denoted by 'N', and a low base quality (e.g., 10) is assigned, eventually forming a double-stranded consensus sequence DCS, but if the proportion of 'N' occurring in the DCS is greater than 50%, the DCS is removed. The DCS generated as described above is written to the BAM1 file described above.
(5) Mutation identification
The generated DCS pool was further aligned with the reference genome using software BWA (v0.7.15) to obtain mutation and sequencing error information in the double sequencing data.
B. The data were analyzed by conventional multiplex sequencing methods
(1) Data cleansing
The sequencing raw data was first stripped of low quality bases, contaminating adaptor sequences and low quality reads using SOAPnuke or fastp. This step involves removing the first 3bp of each sequence in the original dataset (which was supplemented with 3bp during library preparation, including a tag sequence of 2bp and a single fixed base T), and writing a python program to add the tag sequence to the read ID for subsequent sorting.
(2) Generating read family
The sequencing data were aligned with the reference genomic sequence hg19 using software BWA (v0.7.15) and the sequences were sequenced using SAMtools (v1.3) to generate an ordered BAM file containing the sequence alignment data. And removing the unaligned reads and aligning the reads at the plurality of positions according to the comparison result of the BWA. And generating read families according to the comparison position, the molecular tag sequence, the CIGAR tag and the comparison direction. The threshold value of the number of reads contained in the read families is 3. Therefore, the effective read families generated in the method are the read families with the sequencing read number being more than or equal to 3.
(3) Generating SSCS
In the generation of SSCSs for each read family, the determination of the base at each position is performed by applying most rules, i.e., calculating the ratio of each base at the position, reserving a ratio of more than 70% bases, and considering the actual base at the position as the base, and using the higher base quality value as the final base quality value. And performing consistency processing on all the reads in any read family to generate consistency reads. The method comprises the following steps: at least 3 reads and more than 3 reads are required in each read family, consistency comparison is carried out on the bases on each read family in each read family, more than or equal to 70% of consistent sites are true, and otherwise, N is used for replacing base information. And removing the reads with the number of N being more than or equal to 30%. Eventually the SSCS of the sense and antisense strands, respectively, are formed.
(4) Generating a DCS
The SSCS sequences generated by the sense strand and the antisense strand are further compared and analyzed to form DCS.
(5) Mutation identification
The generated DCS pool was further aligned with the reference genome using software BWA (v0.7.15) to obtain information on mutations and sequencing errors in the double sequencing data.
The error rates of the two analysis methods are compared, the result is shown in fig. 2, and the percentage of sequencing errors obtained after the sequencing data are analyzed by the method is 0.000497183%; meanwhile, the percentage of sequencing errors obtained after analyzing the above sequencing data by using the conventional multiple sequencing method was 0.000512317%. This indicates that the present invention enables high quality sequencing error suppression.

Claims (9)

1. The method for detecting the low-frequency mutation based on the double sequencing data is characterized by comprising the following steps of:
(1) cleaning the sequencing original data to obtain cleaned data;
(2) comparing the cleaned data to a reference genome, respectively establishing read family of a sense strand and an antisense strand according to a comparison result, reducing the threshold value of the number of reads contained in the read family to 1, screening the read family by utilizing the characteristics of read complementation, and constructing effective read families;
(3) accurately calculating the probability that each base at each position is a real base by adopting a Bayesian algorithm for the constructed effective read families, selecting the base with the highest probability as a consistent base, generating a single-stranded consistent sequence SSCS, and recalculating the quality corresponding to the base according to the probability;
(4) further forming double-stranded consistent sequence DCS by the SSCS generated in the step (3);
(5) and comparing the DCS with the reference genome again, and identifying low-frequency mutation and sequencing errors on the reads according to the comparison result.
2. The method for detecting low frequency mutations based on dual sequencing data according to claim 1, wherein: the effective read families in the step (2) comprise the following three types: 1) reading family with the number of reading stages more than or equal to 2; 2) the read family only contains 1 read family, and the read family can be complemented with SSCS sequences generated by read families with the number of read families more than or equal to 2; 3) each containing only one read, but two sets of read families with complementary reads.
3. The method for detecting low frequency mutations based on duplex sequencing data according to claim 1 or 2, wherein: the generated single-stranded consensus sequence SSCS comprises: if the number of sequencing reads of the read family is more than or equal to 2, reserving the group of read family, accurately calculating the probability that each base at each position is a real base by adopting a Bayesian algorithm, selecting the base with the highest probability as a consistent base and generating a single-stranded consistent sequence SSCS, and recalculating the quality corresponding to the base according to the probability;
(II) if the number of sequencing reads of the read family is equal to 1 and the reads can be complementary with SSCS sequences generated by read families with the number of reads being more than or equal to 2, reserving the read family, determining the consistent base of each position on the read by using a Bayesian algorithm, generating a single-stranded consistent sequence SSCS, and recalculating the base quality according to the probability;
(III) if the two groups of read family only contain one read, but the read family sequences are complementary, the read family is reserved, the consistent base of each position on the read is determined by a Bayesian algorithm, a single-stranded consistent sequence SSCS is generated, and the base quality is recalculated according to the probability.
4. The method for detecting low frequency mutations based on dual sequencing data of claim 3, wherein: the generating the DCS comprises:
comparing the sense strand SSCS and the antisense strand SSCS generated in (I) for analysis, if the bases at the same position match, the base at the position is not changed, and calculating the average base mass of the positions on the sense strand and the antisense strand as the mass of the base; if the bases at the same position do not match, the bases at the position are represented by 'N', and a low base quality is allocated, finally a double-stranded consistent sequence DCS is formed, but if the occurrence ratio of 'N' in the DCS is more than 50%, the DCS is removed; if the resulting SSCS has no complementary strand, but contains two or more reads in its read family, then the SSCS is also kept;
the method for generating DCS in (II) is as follows: if the bases at the same positions of the sense strand SSCS and the antisense strand SSCS sequences are matched, the bases at the positions are not changed, and the average base mass of the positions on the sense strand and the antisense strand is calculated as the mass of the bases; if the bases at the same position do not match, the bases at the position are represented by 'N', and a low base quality is allocated, finally a double-stranded consistent sequence DCS is formed, but if the occurrence ratio of 'N' in the DCS is more than 50%, the DCS is removed;
for the comparative analysis of the two sets of read family SSCS sequences in (III), if the bases of the same positions of the sense strand SSCS and the antisense strand SSCS sequences are matched, the bases of the positions are not changed, and the average base mass of the positions of the sense strand and the antisense strand is calculated as the mass of the bases; if the bases at the same position do not match, the base at the position is represented by 'N', and a low base quality is assigned, finally forming a double-stranded consensus sequence DCS, but if the occurrence ratio of 'N' in DCS is more than 50%, the piece of DCS is removed.
5. Device based on dual sequencing data detects low frequency mutation, its characterized in that: comprises that
(1) The double sequencing sequence cleaning unit is used for cleaning sequencing original data to obtain cleaned data;
(2) the read family generating unit is used for comparing the cleaned data to a reference genome, respectively establishing read families of a sense chain and an antisense chain according to a comparison result, reducing the threshold value of the number of reads contained in the read families to be 1, screening the read families by utilizing the characteristics of read complementation, and establishing effective read families;
(3) the single-stranded consistent sequence SSCS generation unit is used for accurately calculating the probability that each base at each position is a real base by adopting a Bayesian algorithm for the constructed effective read families, selecting the base with the highest probability as a consistent base and generating the single-stranded consistent sequence SSCS, and then recalculating the quality corresponding to the base according to the probability;
(4) a double-stranded consensus sequence DCS generation unit for further forming DCS from the SSCS;
(5) and the mutation identification unit is used for comparing the DCS with the reference genome again and identifying low-frequency mutation and sequencing errors on the reading section according to the comparison result.
6. The apparatus for detecting low frequency mutation based on dual sequencing data of claim 5, wherein: the read family generating unit generates three effective family: 1) reading family with the number of reading stages more than or equal to 2; 2) only contains the read family of 1 read segment, and the read segment can be complemented with the SSCS sequence generated by the read families with the read family number more than or equal to 2; 3) each containing only one read, but two sets of read families with complementary reads.
7. The apparatus for detecting low frequency mutation based on dual sequencing data of claim 5, wherein: the single-stranded consensus sequence SSCS generation unit specifically comprises the following three treatments:
in the first case, if the number of sequencing reads of the read family is more than or equal to 2, the group of read families is reserved, the probability that each base at each position is a real base is accurately calculated by adopting a Bayesian algorithm, the base with the highest probability is designated as the consistent base at the position, a single-stranded consistent sequence SSCS is generated, and then the quality corresponding to the base is recalculated;
in the second case, if the number of sequencing reads of the read family is equal to 1 and the reads can be complemented with the SSCS sequence generated by the read families with the number of reads being more than or equal to 2, the group of read families is reserved, the consistent base of each position on the read is determined by the Bayesian algorithm and the SSCS sequence is generated, and then the base quality is recalculated according to the probability
In the third case, if two groups of read families only contain one read but the read sequences are complementary, the read family is retained, the consistent base at each position on the read is determined by a Bayesian algorithm, a single-stranded consistent sequence SSCS is generated, and the base quality is recalculated according to the probability.
8. The apparatus for detecting low frequency mutation based on dual sequencing data of claim 7, wherein: the double-chain consistency sequence DCS generation unit generates DCS which comprises the following three steps:
performing comparative analysis on the sequences of the sense strand SSCS and the antisense strand SSCS generated in the first case, if the bases at the same position match, the base at that position is not changed, and calculating the average base mass of that position on the sense strand and the antisense strand as the mass of that base; if the bases at the same position do not match, the bases at the position are represented by 'N', and a low base quality is allocated, finally a double-stranded consistent sequence DCS is formed, but if the occurrence ratio of 'N' in the DCS is more than 50%, the DCS is removed; if the generated SSCS has no complementary strand, but contains two or more reads in the read family, the SSCS is also continuously kept;
the method for generating DCS in the second case is as follows: if the bases at the same position of the sense strand SSCS and the antisense strand SSCS sequences match, the bases at that position are not changed, and the average base mass at that position on the sense and antisense strands is calculated as the mass of that base; if the bases at the same position do not match, the bases at the position are represented by 'N', and a low base quality is allocated, finally a double-stranded consistent sequence DCS is formed, but if the occurrence ratio of 'N' in the DCS is more than 50%, the DCS is removed;
for the third case, two sets of read family SSCS sequences are compared and analyzed, if the bases of the same positions of the sense strand SSCS and the antisense strand SSCS sequences are matched, the bases of the positions are not changed, and the average base mass of the positions of the sense strand and the antisense strand is calculated as the mass of the bases; if the bases at the same position do not match, the base at the position is represented by 'N', and a low base quality is assigned, finally forming a double-stranded consensus sequence DCS, but if the occurrence ratio of 'N' in DCS is more than 50%, the piece of DCS is removed.
9. A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed, may implement the method of detecting low frequency mutations based on dual sequencing data of any of claims 1-5.
CN202210061903.5A 2022-01-19 2022-01-19 Method and device for detecting low-frequency mutation based on double sequencing data and storage medium Pending CN114530199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210061903.5A CN114530199A (en) 2022-01-19 2022-01-19 Method and device for detecting low-frequency mutation based on double sequencing data and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210061903.5A CN114530199A (en) 2022-01-19 2022-01-19 Method and device for detecting low-frequency mutation based on double sequencing data and storage medium

Publications (1)

Publication Number Publication Date
CN114530199A true CN114530199A (en) 2022-05-24

Family

ID=81621013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210061903.5A Pending CN114530199A (en) 2022-01-19 2022-01-19 Method and device for detecting low-frequency mutation based on double sequencing data and storage medium

Country Status (1)

Country Link
CN (1) CN114530199A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831233A (en) * 2023-02-07 2023-03-21 杭州联川基因诊断技术有限公司 mTag-based targeted sequencing data preprocessing method, equipment and medium
CN117437978A (en) * 2023-12-12 2024-01-23 北京旌准医疗科技有限公司 Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014041380A1 (en) * 2012-09-11 2014-03-20 Kps Zrt. Method and computer program product for detecting mutation in a nucleotide sequence
CN106599616A (en) * 2017-01-03 2017-04-26 上海派森诺医学检验所有限公司 duplex-seq-based ultralow-frequency mutation site detection analysis method
CN109439729A (en) * 2018-12-27 2019-03-08 上海鲸舟基因科技有限公司 Detect connector, connector mixture and the correlation method of low frequency variation
CN113373524A (en) * 2020-05-11 2021-09-10 南京世和基因生物技术股份有限公司 ctDNA sequencing tag joint, library, detection method and kit

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014041380A1 (en) * 2012-09-11 2014-03-20 Kps Zrt. Method and computer program product for detecting mutation in a nucleotide sequence
CN106599616A (en) * 2017-01-03 2017-04-26 上海派森诺医学检验所有限公司 duplex-seq-based ultralow-frequency mutation site detection analysis method
CN109439729A (en) * 2018-12-27 2019-03-08 上海鲸舟基因科技有限公司 Detect connector, connector mixture and the correlation method of low frequency variation
CN113373524A (en) * 2020-05-11 2021-09-10 南京世和基因生物技术股份有限公司 ctDNA sequencing tag joint, library, detection method and kit

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831233A (en) * 2023-02-07 2023-03-21 杭州联川基因诊断技术有限公司 mTag-based targeted sequencing data preprocessing method, equipment and medium
CN117437978A (en) * 2023-12-12 2024-01-23 北京旌准医疗科技有限公司 Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device

Similar Documents

Publication Publication Date Title
US11371074B2 (en) Method and system for determining copy number variation
EP2926288B1 (en) Accurate and fast mapping of targeted sequencing reads
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
BR112020013636A2 (en) method to facilitate the prenatal diagnosis of a genetic disorder from a maternal sample associated with the pregnant woman, method for identifying contamination associated with at least one between preparation of sequencing library and high-throughput sequencing and method for characterization associated with at least one between sequencing library preparation and sequencing
AU2023251452A1 (en) Validation methods and systems for sequence variant calls
CN110010197B (en) Method, device and storage medium for detecting single nucleotide variation based on blood circulation tumor DNA
CN106462670A (en) Rare variant calls in ultra-deep sequencing
CN114530199A (en) Method and device for detecting low-frequency mutation based on double sequencing data and storage medium
DE202013012824U1 (en) Systems for the detection of rare mutations and a copy number variation
US20210104297A1 (en) Systems and methods for determining tumor fraction in cell-free nucleic acid
EP2923293B1 (en) Efficient comparison of polynucleotide sequences
WO2021016441A1 (en) Systems and methods for determining tumor fraction
IL258999A (en) Methods for detecting copy-number variations in next-generation sequencing
EP4035161A1 (en) Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data
CN111868832A (en) Method for identifying copy number abnormality
CN109920480B (en) Method and device for correcting high-throughput sequencing data
CN113674803A (en) Detection method of copy number variation and application thereof
US20180322242A1 (en) A System and Method for Compensating Noise in Sequence Data for Improved Accuracy and Sensitivity of DNA Testing
CN111370065B (en) Method and device for detecting cross-sample contamination rate of RNA
CN116469462A (en) Ultra-low frequency DNA mutation identification method and device based on double sequencing
US20190139627A1 (en) System for Increasing the Accuracy of Non Invasive Prenatal Diagnostics and Liquid Biopsy by Observed Loci Bias Correction at Single Base Resolution
US11127485B2 (en) Techniques for fine grained correction of count bias in massively parallel DNA sequencing
CN110684830A (en) RNA analysis method for paraffin section tissue
CN114974416B (en) Method and device for detecting adjacent polynucleotide variation
CN117316271A (en) Method and detection system for screening copy number variation of blood tumor specimen based on second-generation sequencing technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination