CN116469462A - Ultra-low frequency DNA mutation identification method and device based on double sequencing - Google Patents

Ultra-low frequency DNA mutation identification method and device based on double sequencing Download PDF

Info

Publication number
CN116469462A
CN116469462A CN202310271366.1A CN202310271366A CN116469462A CN 116469462 A CN116469462 A CN 116469462A CN 202310271366 A CN202310271366 A CN 202310271366A CN 116469462 A CN116469462 A CN 116469462A
Authority
CN
China
Prior art keywords
sequence
sequencing
base
double
sscs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310271366.1A
Other languages
Chinese (zh)
Inventor
浦丹
于飞
向旭东
秦森彪
邱鑫煜
李鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202310271366.1A priority Critical patent/CN116469462A/en
Publication of CN116469462A publication Critical patent/CN116469462A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a method and a device for identifying ultralow frequency DNA mutation based on double sequencing, which comprises the following steps: (1) Evaluating the quality of the original sequencing data, reducing data noise, and providing effective data for subsequent analysis; (2) Grouping the read frames according to the barcode to form read family, extracting the barcode on each read frame in the read family for error correction, and returning the corrected barcode to the read frame; (4) Performing intra-group multi-sequence comparison on read family, calculating consensus quality score of the current position according to the frequency of main bases at each position, and generating a single-chain consistency sequence to exclude mutation introduced in library preparation or PCR process; (5) Constructing a double-chain consistency sequence according to the single-chain consistency sequence, and further eliminating asymmetric mutation sites in the sequence; the invention can effectively improve the data utilization rate, effectively inhibit sequencing errors and improve the accuracy of low-frequency and even ultra-low frequency mutation detection.

Description

Ultra-low frequency DNA mutation identification method and device based on double sequencing
Technical Field
The invention relates to the field of bioinformatics, in particular to an ultralow frequency mutation identification method and device based on double sequencing.
Background
The development of high-throughput sequencing technology (Next-generation sequencing, NGS) has revolutionized the traditional sequencing technology, which is low in cost and can sequence hundreds of thousands to millions of DNA molecules in parallel at a time, as compared to traditional sequencing technology. NGS revolutionized the biological research and clinical field by detecting important genetic variants. In particular, the study of rare somatic variations can provide important clues to the precise biological state. For example, detection of rare variations in cancer biology can provide an effective strategy for tumor treatment; detection of rare variations associated with drug resistance or organ transplant rejection can provide an important basis for early diagnosis of disease. However, these rare variations containing important disease information are typically less than 1% in frequency, requiring highly sensitive, highly accurate NGS techniques for analysis. NGS is currently widely used for mutation detection, but the high error rate (0.1-1%) of NGS results in that it remains a great challenge to detect rare variations at frequencies below 1%. To overcome this problem, scientists have proposed a dual sequencing (duplex sequencing) technique. This technique involves adding a random barcode tag sequence to double-stranded DNA, such that each DNA strand carries a unique molecular identification sequence (Unique Molecular Index, UMI). When high throughput sequencing data is analyzed, the sense and antisense strand sequences with the same barcode tag are clustered and single strand identical sequences SSCS of the sense and antisense strands are generated, respectively, and then the sense and antisense strands SSCS are compared to generate double strand identical sequences DCS. Finally, the filtered DNA is compared with a reference genome to distinguish the true mutation from the error introduced in the sequencing process. The dual sequencing technology has been widely used in the identification of low frequency mutations due to high accuracy. However, low frequency mutation identification based on duplex sequencing still currently has the following problems:
1. the data utilization rate is low: in the existing method for identifying DNA mutation based on double-chain sequencing, due to the fact that errors exist in sequences and the barcode labels, reads containing the error barcode cannot be clustered correctly, a large number of single cases are formed, the method cannot be used for mutation identification, and data waste is caused.
2. Alignment with the reference genome is required: after quality control analysis, the original double sequencing data obtained by the conventional double sequencing method needs to be compared with a reference genome to obtain position information on the gene, so that the gene is annotated. However, the use of a reference genome can skew the results toward the reference genome, affecting the results of de novo assembly or affecting the identification of indels or other alleles that differ significantly from the reference genome, thus leading to alignment difficulties.
Disclosure of Invention
Aiming at the defects, the invention provides an ultralow frequency mutation identification method and device based on double sequencing, which aim to solve the problem that NGS is difficult to accurately identify low frequency mutation due to high throughput sequencing error rate and low data utilization rate. The method is corrected by the barcode, and the problem that the sequence cannot be grouped into read family to form a single reading segment due to the fact that the barcode has errors is solved. The invention does not need to be compared with a reference genome, but adopts a multi-sequence comparison method to obtain read family of the sense strand and the negative strand, thereby reducing the comparison difficulty. Meanwhile, a core sequence region (30 bases, the center offset is +/-5) is extracted from each ready, the frequency of occurrence of four bases A/T/C/G on each position of the sequence is counted, the highest frequency is used as a main base/consistent base, the consensus quality score of the current position is calculated according to the frequency of the main base, and a single-chain consistent sequence SSCS is generated, so that the accuracy of mutation identification is further improved. The method can effectively improve the utilization rate of double sequencing data, improve the sequencing accuracy, reduce the complexity of data processing and the sequencing depth, and provides a reliable biological information analysis flow and tool for the detection of low-frequency mutation.
In view of this, the technical scheme adopted by the invention is as follows: an ultralow frequency DNA mutation identification method based on double sequencing comprises the following steps:
(1) And performing quality control on the original double sequencing data, checking sequencing results such as alkali matrix value, GC distribution and the like, and cleaning the double sequencing original data to remove low-quality and polluted sequences, thereby obtaining cleaned sequencing data.
(2) UMI clustering, correcting the barcode at two ends of the sequence, grouping the cleaned sequence according to the barcode labels, extracting the barcode at two ends of the sequence, converting the original barcode data into a more compact representation mode by establishing FM indexes for all the barcode, simultaneously retaining some characteristics of the original data, comparing all the barcode with the indexes, setting the maximum editing distance threshold of the barcode as 1, visualizing the comparison result by using a networkx network, correcting the barcode according to the editing clusters, and putting the corrected barcode back into the sequence.
(3) The multi-sequence comparison can improve the accuracy of the comparison. And (3) comparing sequences in the read family group corrected in the step (2) in multiple sequences, comparing sequences in the read family, determining common sections of the sequences, acquiring the arrangement condition of bases at each position according to the comparison result, respectively establishing the read family of the sense strand and the antisense strand, and screening the read family by utilizing the characteristic of complementation of the read.
(4) Generating a single-stranded identical sequence SSCS, if the family size in the step (3) is more than or equal to 3, reserving the group of read family, otherwise discarding, extracting a 'core' sequence area from each reading for the reserved read family, counting the occurrence frequency of four bases A/T/C/G at each position of the sequence, taking the highest frequency as a main base, calculating the consensus quality score (CQS, consensus Quality Score) of the current position according to the frequency of the main base, generating the single-stranded identical sequence SSCS, and forming the single-stranded identical sequence SSCS for the identical sequence for the antisense strand.
(5) Generating a double-stranded consensus sequence DCS, and generating a DCS sequence by using the single-stranded consensus sequence SSCS sequence generated in the step (4) and the SSCS sequence complementary to the single-stranded consensus sequence SSCS sequence.
(6) Mutation identification, namely cleaning the DCS sequence generated in the step (5), filtering the DCS sequence containing a large amount of N in the sequence, and comparing the DCS sequence with a reference genome to identify single nucleotide polymorphism, DNA insertion and deletion errors and sequencing errors on sequence fragments.
Specifically, in the network x network diagram in the step (2), each vertex corresponds to one barcode label, and there may be a base recognition error and a PCR error in the barcode nodes in the sequencing process, and the edges are formed by connecting the barcode labels separated by a single base difference, so that the edges are connected with two barcode labels with a single base difference, and a plurality of nodes are connected to form a network of UMI clusters.
Specifically, the correcting of the barcode means that the barcode at two ends of reads is extracted from the fastq file, indexes are created for all the barcode, then the barcode is compared with the indexes, two barcodes with the maximum editing distance value less than or equal to 1 are reserved, the two barcodes with the maximum editing distance equal to 1 are corrected according to the indexes, and the result is put back into the sequence.
Specifically, in step (4), a sequence with at least 90% base identity at each specific position of the read is used to create an SSCS sequence. The Consensus Quality Score (CQS) is firstly calculated accurately, the frequency of occurrence of four bases A/T/C/G at each position of a core region (30 bases, center offset is +/-5) extracted after multi-sequence alignment, the highest frequency is used as a main base/consensus base, and the consensus quality score of the current position is calculated according to the frequency of the main base, so that a single-chain consensus sequence SSCS is generated.
Further, the calculating consensus quality score specifically includes the following steps: after multiple sequence alignment, the most frequent bases at each position are combined to form a consensus sequenceColumns. In the combination process, extracting a 'core' sequence region from the sequence of each ready, selecting an offset for each ready by using the core region with the highest occurrence frequency, and calculating the quality score of a given position through the consensus quality score; in calculating the consensus quality score, only bases with base quality scores above the threshold Phred 20 and mutation frequencies above 10 are considered -Q/10 A calculation formula of consensus quality score:
where f is the maximum base frequency of the current site.
More specifically, the "core" sequence region refers to a base fragment having a length of 30.+ -.5 bp at the central position of the read.
Specifically, the most frequently occurring bases at each position are combined to form a consensus sequence obtained by determining the consensus quality score for the most frequently occurring base at each position of the sequence, otherwise, the base at the current position is replaced with an "N" and the position with a gap in the sequence is also considered a base.
More specifically, the Phred mass fraction of the gap position is a weighted average of adjacent 8 base mass fractions.
Still further, the DCS filtering includes: the generated double-chain consistency sequence DCS needs to compare the sense strand SSCS and the antisense strand SSCS from position to position, if bases on the sense strand and the antisense strand at the same position are complementary, the bases at the position are reserved, the base quality of the sense strand and the antisense strand is calculated, and the average value is taken as the base quality at the position; if the bases on the sense strand and the antisense strand at the same position are not complementary, then N is used to replace the base at that position; if one strand is a gap and the other strand is a non-gap at the same position, N is substituted for the base at that position, resulting in a double-stranded consensus sequence DCS. If an SSCS does not have a matching reverse strand consensus sequence, the sequence is filtered out.
The invention also provides an ultralow frequency DNA mutation recognition device based on double sequencing, which can execute the ultralow frequency DNA mutation recognition method based on double sequencing, and comprises the following steps:
and the data cleaning unit is used for performing quality control on the original double sequencing data, removing low-quality and polluted sequences and obtaining cleaned sequencing data.
UMI clustering unit, which is used to group the washed sequencing data according to the barcode label, extract the barcode, build index for all the barcode, compare the barcode with the index, use the networkx visualization to generate networkx network diagram after the comparison result, correct the barcode according to the editing cluster, and put the corrected barcode back into the sequence.
And the multi-sequence comparison unit is used for comparing the sequences in the corrected read family groups in multiple sequences, determining common sections of the sequences, acquiring the arrangement condition of bases at each position according to the comparison result, respectively establishing the read family of the sense strand and the antisense strand, and screening the read family by utilizing the characteristic of read complementation.
And a single-strand identical sequence SSCS generating unit, which is used for reserving the group of read identical sequences if the identical size is more than or equal to 3 for the sense strand, otherwise discarding, extracting a 'core' sequence area from each reading for the reserved read identical, counting the occurrence frequency of four bases A/T/C/G at each position of the sequence, taking the highest frequency as a main base, calculating the consensus quality score of the current position according to the frequency of the main base, and generating the single-strand identical sequence SSCS, wherein the single-strand identical sequence SSCS is formed for the identical sequence for the antisense strand.
And a double-stranded identical sequence DCS generation unit for generating a DCS sequence by using the generated single-stranded identical sequence SSCS sequence and the SSCS sequence complementary to the single-stranded identical sequence SSCS sequence.
And a mutation recognition unit for filtering the generated DCS sequence, and then comparing the filtered DCS sequence with a reference genome to recognize single nucleotide polymorphism, DNA insertion and deletion errors and sequencing errors on the sequence fragment.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor realizes the steps of the ultralow frequency DNA mutation identification method based on double sequencing when executing the computer program.
The invention finally provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of the above-described ultra-low frequency DNA mutation identification method based on double sequencing.
The main advantages of the invention include:
1. according to the invention, by adopting the UMI clustering method, the base fault tolerance value in the barcode is set to be 1 while the barcode is corrected, so that the data that reads are supposed to be distributed to corresponding reads family due to the existence of errors of the barcode is reserved to a certain extent, and the utilization rate of the original double sequencing data is greatly improved.
2. The invention adopts multi-sequence comparison to the sequences in the ready family after grouping, but the comparison with the reference genome is not performed by the traditional method, and the method effectively avoids the problem of difficult comparison caused by comparison with the reference genome.
3. The consensus quality score of the invention is calculated by calculating the base occurrence frequency of each position on the core region, selecting the base with the highest base occurrence frequency as the main base/consistent base at the position, and then calculating the consensus quality score corresponding to the base. The method only needs to calculate the consensus quality fraction of the core area, thereby reducing the time complexity of calculation.
Drawings
FIG. 1 is a schematic diagram of a dual sequencing-based ultralow frequency DNA mutation identification method and device of the invention;
FIG. 2 is a network diagram of a network generated by correcting a barcode according to the present invention;
table 1. Comparison of the method of the present invention with conventional double sequencing analysis methods.
Detailed Description
In order to achieve the purpose of the invention, as shown in fig. 1, the ultra-low frequency DNA mutation identification method based on double sequencing of the invention comprises the following steps:
(1) Downloading the dual sequencing data and the human reference genome sequence data;
(2) Quality control is carried out on the original double sequencing data, and effective guarantee is provided for subsequent analysis;
(3) UMI cluster analysis. Extracting the barcode at two ends of all sequences and correcting the barcode so as to reserve higher proportion of data;
(4) Performing multi-sequence alignment according to family barcode;
(5) The "core" sequence region is extracted from each read, the frequency of occurrence of the four bases A/T/C/G at each position of the sequence is counted, the most frequent base is taken as the main base/consensus base, and the consensus quality score for the main base is calculated to create a single-stranded consensus sequence.
(6) And creating a double-stranded consistent sequence according to the base complementation principle of the single-stranded consistent sequence, and filtering the double-stranded consistent sequence.
(7) Alignment was performed with the reference genome.
(8) The low frequency mutation is identified.
(9) And counting the number of SSCS and DCS, mutating site information, and outputting a network graph corrected by the barcode.
The method comprises the following specific steps:
(1) Dual sequencing data (Accession number: SRR 3749606) and the human reference genome sequence hg38 (human genome 38) were downloaded from the national center for Biotechnology information NCBI (https:// www.ncbi.nlm.nih.gov /) database.
(2) And (5) quality control.
2.1 The results of the sequencing were checked with FastQC on the original double sequencing data, including base matrix values, sequence mass scores, GC distribution, base balance, linker content, and the like.
2.2 Removing low quality bases, contaminated linker sequences and low quality reads to obtain washed sequencing data.
(3) UMI cluster analysis
3.1 In this embodiment, the barcode is 12 bases, the sequencing data are grouped according to the barcode tag, and then the barcode at both ends of the sequence is extracted.
3.2 FM indices are established for all of the barcode, and each of the barcode is compared with the index.
3.3 The result of the comparison is visualized by using a python network modeling tool, namely a network x network, each node corresponds to one barcode label, the edges are formed by connecting the barcode labels separated by single base differences, and the correct barcode is selected by selecting the node with the largest number of reading of the barcode label. The corrected barcode is put back into the sequence (figure 2).
(4) Multiple sequence alignment: and grouping the reads according to the bolts, grouping the same reads of the bolts into one family by the bolts, and performing multi-sequence alignment on the reads in the family.
4.1 Multiple sequence alignment of sequences within the corrected read family, and comparison of sequences within the read family to determine common segments of sequences.
4.2 According to the comparison result, the arrangement condition of the bases at each position is obtained.
4.3 A read family of sense and antisense strands, respectively, and screening the read family using the read complementation property.
(5) Single strand consensus sequence generation
5.1 For the sense strand, the family size is calculated according to the result of the multiple sequence alignment, and only 3 or more of the family size is retained, otherwise, the group of read family is discarded.
5.2 The "core" sequence region is extracted from each read of the remaining read family.
5.3 Using consensus quality score (CQS, consensus Quality Score) to precisely calculate the consensus quality score of the main base at each position in the core sequence, and selecting the base with the greatest frequency as the consensus base.
5.4 A single-stranded consensus sequence SSCS is generated based on the mass values of the consensus bases at each position.
5.5 For the antisense strand, the above 5 operations were performed as well.
5.6 Single-stranded consensus sequences SSCS1 and SSCS2 files are generated for the sense strand SSCS and the antisense strand SSCS, respectively.
(6) Double-stranded consensus sequence generation
6.1 The generated single-stranded consensus sequence SSCS sequence is removed from the barcode at both ends of the sequence.
6.2 A complementary single stranded consensus SSCS sequence generates a double stranded consensus DCS.
6.3 Filtering the generated double-chain consistency sequence, discarding the sequence if a large amount of N exists in the DCS sequence, and entering the subsequent step if the sequence is qualified.
6.4 A double-chain consistency sequence DCS file is generated.
(7) Alignment with reference genome
7.1 And (3) comparing the sequences obtained in the step (5) to the hg38 reference genome position by using BWA software to obtain a compared sam file.
7.2 Converted into a bam format file, then sorted, and indexed with SAMtools.
7.3 Comparison results were counted.
(8) Mutation identification. Mutations were identified using BCFtools, single Nucleotide Polymorphisms (SNPs), DNA insertion and deletion (Indels) errors, and sequencing errors on sequence fragments.
(9) And (5) carrying out statistics report. And carrying out related statistics and outputting a visual chart on the quality of the double sequencing data, the base balance, the constructed DCS/SSCS and the like.
The data were analyzed using conventional methods. The method comprises the following specific steps:
(1) Dual sequencing data (Accession number: SRR 3749606) and the human reference genome sequence hg38 (human genome 38) were downloaded from the national center for Biotechnology information NCBI (https:// www.ncbi.nlm.nih.gov /) database.
(2) And (5) quality control.
2.1 The results of the sequencing were checked with FastQC on the original double sequencing data, including base matrix values, sequence mass scores, GC distribution, base balance, linker content, and the like.
2.2 Low quality bases, contaminating adaptor sequences, and low quality reads are removed.
(3) barcode extraction
The barcode is extracted from paired-end sequencing reads, the fixed bases are deleted, the paired barcode tags are combined, and then added to the header of each read in the FASTQ file.
(4) read family generation
4.1 The BWA software is used for comparing with a reference genome to obtain the information such as the position of the read on the compared reference sequence, the matching quality and the like.
4.2 Outputting the comparison result to SAMtools software for sorting, and writing the output result into a bam file.
4.3 A read family generated based on the position of the alignment onto the reference genome, the barcode tag sequence, and the alignment orientation.
(5) Generation of single stranded consensus SSCS
5.1 For sense strand): the frequency of occurrence of each base at each position of the read family is calculated, and the base with the frequency of occurrence higher than 70% is taken as the real base at the position, and the base quality value corresponding to the base is the base matrix value of the read family. Sites with a base frequency of less than 70% are substituted with N. The reads with a number of N greater than or equal to 30% in the sequence are discarded.
5.2 At least 3 reads per read family and meets the requirement of the frequency of occurrence of bases in 5.1), such read family is preserved, and finally a single-stranded consensus sequence SSCS is generated.
5.3 The antisense strand was similarly subjected to the above 2 operations.
5.4 Single-stranded consensus sequences SSCS1 and SSCS2 files are generated for the sense strand SSCS and the antisense strand SSCS, respectively.
(6) Generation of double-stranded consensus sequence DCS
6.1 The generated single-stranded consensus sequence SSCS sequence is removed from the barcode at both ends of the sequence.
6.2 A complementary single stranded consensus SSCS sequence generates a double stranded consensus DCS.
6.3 Filtering the generated double-chain consistency sequence, discarding the sequence if a large amount of N exists in the DCS sequence, and entering the subsequent step if the sequence is qualified.
6.4 A double-chain consistency sequence DCS file is generated.
(7) Alignment with reference genome
7.1 And (3) comparing the sequences obtained in the step (6) to the hg38 reference genome position by using BWA software to obtain a compared sam file.
7.2 Converted into a bam format file, then sorted, and indexed with SAMtools.
7.3 Comparison results were counted.
(8) Mutation identification. And (3) identifying mutation by using BCFtools, and identifying Indel and SNP number and related site information.
Comparing the single strand identical sequences and double strand identical sequences generated by the above two methods, as shown in Table 1, the obtained DCS/SSCS ratio was 26.27% (115875/441011) after analyzing the above sequencing data by the method of the present invention, and 7.11% (73616/1035543) after analyzing the above sequencing data by the conventional method. This shows that the invention can greatly improve the utilization rate of the original double sequencing data.
TABLE 1 comparison of conventional Dual sequencing analysis methods with the methods of the present invention

Claims (10)

1. The ultra-low frequency DNA mutation identification method based on double sequencing is characterized by comprising the following steps of:
(1) Performing quality control on the original double sequencing data, and removing low-quality and polluted sequences to obtain cleaned sequencing data;
(2) UMI clustering, grouping the washed sequencing data according to the code label, extracting the code, establishing a code index, comparing the code with the index, visualizing a compared result by using networkx, correcting the code according to the editing cluster, and putting the corrected code back into the sequence;
(3) Performing multi-sequence comparison, namely performing multi-sequence comparison on sequences in the read family group corrected in the step (2), determining common sections of the sequences, acquiring the arrangement condition of bases at each position according to the comparison result, respectively establishing read family of a sense strand and an antisense strand, and screening the read family by utilizing the characteristic of read complementation;
(4) Generating a single-stranded consistent sequence SSCS, if the family size in the step (3) is more than or equal to 3, reserving the group of read family, otherwise, discarding, extracting a 'core' sequence area from each reading of the reserved read family, counting the occurrence frequency of four bases A/T/C/G at each position of the sequence, taking the highest frequency as a main base, calculating the consensus quality score of the current position according to the frequency of the main base, and generating the single-stranded consistent sequence SSCS, and forming the single-stranded consistent sequence SSCS for the same consistent sequence for the antisense strand;
(5) Generating a double-chain consistent sequence DCS, and generating a DCS sequence by the single-chain consistent sequence SSCS sequence generated in the step (4) and the SSCS sequence complementary with the single-chain consistent sequence SSCS sequence;
(6) Mutation identification, filtering the DCS sequence generated in the step (5), and comparing the filtered DCS sequence with a reference genome to identify single nucleotide polymorphism, DNA insertion and deletion errors and sequencing errors on sequence fragments.
2. The ultra-low frequency DNA mutation identification method based on double sequencing according to claim 1, wherein the method comprises the following steps: and (2) the networkx network graph corresponds to one barcode label at each vertex, and two barcode labels with single base differences are connected at the edge.
3. The ultra-low frequency DNA mutation identification method based on double sequencing according to claim 1 or 2, wherein: the correction of the barcode refers to correction of base substitution during PCR, base recognition errors during sequencing and additional artificial errors generated by insertion or deletion (Indel) of the barcode.
4. The ultra-low frequency DNA mutation identification method based on double sequencing according to claim 1, wherein the method comprises the following steps: in step (4), the read family, a sequence with at least 90% base identity at each particular position of the read is used to create an SSCS sequence.
5. The ultra-low frequency DNA mutation identification method based on double sequencing according to claim 1 or 4, wherein the method comprises the following steps: the method comprises the steps of calculating consensus quality scores, extracting 'core' sequence areas from sequences of each reads family, selecting offset for each read by using the core area with highest occurrence frequency, and calculating quality scores of given positions through the consensus quality scores; in calculating the consensus quality score, only bases with base quality scores above the threshold Phred 20 and mutation frequencies above 10 are considered -Q/10 A calculation formula of consensus quality score:
where f is the maximum base frequency of the current site.
6. The ultra-low frequency DNA mutation identification method based on double sequencing according to claim 5, wherein the method comprises the following steps: the core sequence region refers to a base fragment with the length of 30+/-5 bp at the central position of a reading segment. When calculating the consensus quality score, combining the bases with highest frequency at each position to form a consensus sequence, wherein the consensus sequence is obtained by determining the consensus quality score of the base with highest frequency at each position of the sequence, otherwise, the base at the current position is replaced by 'N', and the position with gap in the sequence is considered as the base.
7. The ultra-low frequency DNA mutation identification method based on double sequencing according to claim 1, wherein the method comprises the following steps: the DCS filtering includes: the generated double-chain consistency sequence DCS needs to compare the sense strand SSCS and the antisense strand SSCS from position to position, if bases on the sense strand and the antisense strand at the same position are complementary, the bases at the position are reserved, the base quality of the sense strand and the antisense strand is calculated, and the average value is taken as the base quality at the position; if the bases on the sense strand and the antisense strand at the same position are not complementary, then N is used to replace the base at that position; if one strand is a gap and the other strand is a non-gap at the same position, N is substituted for the base at that position, resulting in a double-stranded consensus sequence DCS. If an SSCS does not have a matching reverse strand consensus sequence, the sequence is filtered out.
8. An ultralow frequency DNA mutation recognition device based on double sequencing, characterized in that the ultralow frequency DNA mutation recognition method based on double sequencing of any one of claims 1 to 7 can be performed, comprising:
the data cleaning unit is used for performing quality control on the original double sequencing data, removing low-quality and polluted sequences and obtaining cleaned sequencing data;
UMI clustering unit, which is used to group the washed sequencing data according to the code label, extract the code, build the code index, compare the code with the index, and use the network x visualization to generate the network x network diagram after the comparison, correct the code according to the editing cluster, and put the corrected code back into the sequence;
the multi-sequence comparison unit is used for comparing the sequences in the corrected read family groups with each other in multiple sequences, determining common sections of the sequences, acquiring the arrangement condition of bases at each position according to the comparison result, respectively establishing read family of a sense strand and an antisense strand, and screening the read family by utilizing the characteristic of read complementation;
a single-strand identical sequence SSCS generating unit, which is used for reserving the group of read identical sequences if the identical size is more than or equal to 3 for the sense strand, otherwise discarding, extracting a 'core' sequence area from each reading for the reserved read identical, counting the occurrence frequency of four bases A/T/C/G at each position of the sequence, taking the highest frequency as a main base, calculating the consensus quality score of the current position according to the frequency of the main base, and generating a single-strand identical sequence SSCS, and forming the single-strand identical sequence SSCS for the identical sequence for the antisense strand;
a double-stranded identical sequence DCS generation unit for generating a DCS sequence by using the generated single-stranded identical sequence SSCS sequence and the SSCS sequence complementary with the same;
and a mutation recognition unit for filtering the generated DCS sequence, and then comparing the filtered DCS sequence with a reference genome to recognize single nucleotide polymorphism, DNA insertion and deletion errors and sequencing errors on the sequence fragment.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the processor implements the steps of the double sequencing based ultra low frequency DNA mutation identification method of any one of claims 1 to 8 when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the double sequencing-based ultralow frequency DNA mutation identification method as defined in any one of claims 1 to 8.
CN202310271366.1A 2023-03-20 2023-03-20 Ultra-low frequency DNA mutation identification method and device based on double sequencing Pending CN116469462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310271366.1A CN116469462A (en) 2023-03-20 2023-03-20 Ultra-low frequency DNA mutation identification method and device based on double sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310271366.1A CN116469462A (en) 2023-03-20 2023-03-20 Ultra-low frequency DNA mutation identification method and device based on double sequencing

Publications (1)

Publication Number Publication Date
CN116469462A true CN116469462A (en) 2023-07-21

Family

ID=87172584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310271366.1A Pending CN116469462A (en) 2023-03-20 2023-03-20 Ultra-low frequency DNA mutation identification method and device based on double sequencing

Country Status (1)

Country Link
CN (1) CN116469462A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437978A (en) * 2023-12-12 2024-01-23 北京旌准医疗科技有限公司 Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437978A (en) * 2023-12-12 2024-01-23 北京旌准医疗科技有限公司 Low-frequency gene mutation analysis method and device for second-generation sequencing data and application of low-frequency gene mutation analysis method and device

Similar Documents

Publication Publication Date Title
US10127351B2 (en) Accurate and fast mapping of reads to genome
US8271206B2 (en) DNA sequence assembly methods of short reads
CN109767810B (en) High-throughput sequencing data analysis method and device
CN111009286A (en) Method and apparatus for microbiological analysis of host samples
KR101828052B1 (en) Method and apparatus for analyzing copy-number variation (cnv) of gene
JP6066924B2 (en) DNA sequence data analysis method
CN110189796A (en) A kind of sheep full-length genome resurveys sequence analysis method
IL258999A (en) Methods for detecting copy-number variations in next-generation sequencing
CN112289376B (en) Method and device for detecting somatic cell mutation
CN112687344B (en) Human adenovirus molecule typing and tracing method and system based on metagenome
CN111081315A (en) Method for detecting homologous pseudogene variation
CN112599198A (en) Microorganism species and functional composition analysis method for metagenome sequencing data
CN115631789B (en) Group joint variation detection method based on pan genome
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
CN116469462A (en) Ultra-low frequency DNA mutation identification method and device based on double sequencing
CN107463797B (en) Biological information analysis method and device for high-throughput sequencing, equipment and storage medium
CN114530199A (en) Method and device for detecting low-frequency mutation based on double sequencing data and storage medium
CN109920480B (en) Method and device for correcting high-throughput sequencing data
CN109461473B (en) Method and device for acquiring concentration of free DNA of fetus
WO2019242445A1 (en) Detection method, device, computer equipment and storage medium of pathogen operation group
CN111696622B (en) Method for correcting and evaluating detection result of mutation detection software
CN109712671B (en) Gene detection device based on ctDNA, storage medium and computer system
WO2019132010A1 (en) Method, apparatus and program for estimating base type in base sequence
Roy et al. NGS-μsat: Bioinformatics framework supporting high throughput microsatellite genotyping from next generation sequencing platforms
CN114974432A (en) Screening method of biomarker and related application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination