CN107229839B - Indel detection method based on next generation sequencing data - Google Patents

Indel detection method based on next generation sequencing data Download PDF

Info

Publication number
CN107229839B
CN107229839B CN201710377194.0A CN201710377194A CN107229839B CN 107229839 B CN107229839 B CN 107229839B CN 201710377194 A CN201710377194 A CN 201710377194A CN 107229839 B CN107229839 B CN 107229839B
Authority
CN
China
Prior art keywords
variation
reads
sequence
range
mutation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710377194.0A
Other languages
Chinese (zh)
Other versions
CN107229839A (en
Inventor
袁细国
许向彦
杨利英
张军英
白俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201710377194.0A priority Critical patent/CN107229839B/en
Publication of CN107229839A publication Critical patent/CN107229839A/en
Application granted granted Critical
Publication of CN107229839B publication Critical patent/CN107229839B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The invention belongs to the technical field of genetic engineering, and discloses a method for detecting Indel based on next generation sequencing data, which comprises the following steps: comparing the original fastq data by using bwa comparison software to generate a sam file; performing hierarchical clustering on the two-dimensional points according to a set threshold value; for each pair of reads in the hang.sam file, taking out reads which are not normally compared; comparing the read with the intercepted reference sequence to determine the type, position and size of the variation; storing the variations using a hash structure; for a certain variation, a threshold is set according to the coverage of sequencing. According to the method, a variation range is determined through clustering, and Split read is extracted and compared with a reference sequence in the variation range, so that the comparison process is simple and the range is more accurate; hierarchical clustering is used, the limitation that the number of clusters is set in advance is broken through, and the operation is simple.

Description

Indel detection method based on next generation sequencing data
Technical Field
The invention belongs to the technical field of genetic engineering, and particularly relates to an Indel detection method based on next generation sequencing data.
Background
Next generation sequencing is a technique for sequencing DNA sequences. In the sequencing process, the complete sample DNA sequence is broken, fragments satisfying a specific length (usually hundreds of bp) are selected from the broken sample DNA sequence, and a sequence with a length of tens of to hundreds of bp is read at one end or two ends of each fragment. The length of the read sequence is usually far less than that of the tested sample DNA sequence, but the next generation of sequencing technology can simultaneously read a large number of short sequences, so that the total length of all the short sequences reaches several times to several tens times of the length of the sample DNA, thereby obtaining the sample DNA sequence. Indel (mutation and mutation) mutation is an important mutation phenomenon in the genome. Mainly expressed in both insertion and deletion states and associated with the occurrence of disease in humans. Currently, there are 4 main strategies for detecting INDEL mutations on the genome, which are: (1) readpair (also known as Pair-end Mapping, PEM for short, double-ended Mapping); (2) split read (SR for short): the reads are separated. Splitread is a special class of reads, the occurrence of which is usually caused by structural variations in the genome. The read does not keep the form of a continuous sequence in the mapping, but contains a gap with a certain length, so that the mapping difficulty is high; (3) read Depth (abbreviated RD, Read overlay Depth) and (4) de novo assembly-based methods. (PEM) comparing Pair-end reads to the reference sequence, if the insertion length of a Pair of reads is less than the mapping length, the Pair of reads can determine a deletion (deletion); on the contrary, if the length of a certain pair of reads insertion is greater than the mapping length, an insertion (insertion) can be determined; for the detection of sequence deletion, the length of the detected fragment is affected by the Standard Deviation (SD) of the length of the inserted fragment (the length of the inserted fragment refers to the length of the DNA fragment broken by ultrasonic waves selected at the stage of constructing a DNA sequencing library before sequencing, the fragments are also called sequencing fragments which are operations in the experimental process and do not refer to the variation of genome), and the larger the sequence deletion is, the more easily the sequence deletion is detected; for detection of sequence insertions, the length can only be in the range of the insert length, and the maximum length is also limited by the standard deviation of the insert length sequenced; the defect of the detection method is that the detected mutation position is not accurate enough and can not reach the bp level. SR firstly extracts pair-endreads with the following characteristics, one pair is normally compared to a reference sequence, the other pair cannot be compared, then a search range is determined by using the read position and the insertion length of the normal comparison, the best match between the unaligned read and the reference sequence is searched in the search range, the unaligned read is divided into two or three sections through the best matching point, and thus the positions of deletion and insert are determined; pindel is a software for mutation detection using the SR method. It is widely used by thousands of human genome projects and bioinformatics analysts. Pindel is theoretically able to detect deletion over all length ranges, and insertion of small fragments. One advantage of the Pindel approaches is that they can be accurate to a single base, but Pindel may miss these variations if repeated sequences are present within the variant region. The RD can measure the coverage of each site through samtools, sequencing reads are compared to a reference sequence, and if the coverage depth of a certain section is much lower than the average coverage depth, the section can be determined to be a deletion; the disadvantages are that only deletion can be detected, but not insertion, and the detection position is not accurate enough. The de novo assembly method can provide the best detection method for longinsert, but the assembly is still a troublesome matter, the repetitive sequences existing on the genome can seriously affect the quality of the assembly, and the application of the assembly method in the detection of genome variation is also greatly hindered.
In summary, the problems of the prior art are as follows: the existing detection method has the defects that the detection position is not accurate; the variant region is easy to miss when encountering a repetitive sequence; the omission of mutation detection is also easily caused by only depending on the read and insertion length of one alignment to determine a detection range.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an Indel detection method based on next generation sequencing data.
The invention is realized in such a way that the Indel detection method based on the next generation sequencing data comprises the following steps:
comparing original fastq data by using bwa comparison software to generate a sam file;
extracting a comparison position from each pair of reads in the discordant.sam file to serve as a two-dimensional point coordinate, and performing hierarchical clustering on the two-dimensional points according to a set threshold (the threshold is set as an insertion length), wherein the detailed clustering process is as follows:
1. firstly, initializing each two-dimensional point as a class, calculating the distance between every two classes, namely the similarity between every two classes, wherein the distance is the Euclidean distance, and two points A [ a ] are set1,a2],B[b1,b2]The Euclidean distance between A and B is
Figure BDA0001304304170000031
2. Setting a threshold value as an insertion length, finding two classes with the closest distance, classifying the two classes into one class if the distance between the two classes with the closest distance is less than the threshold value, and turning to 3; if the distance between the two classes closest to each other is greater than or equal to the threshold, terminating the iteration and turning to 4;
3. recalculating the similarity between the newly generated class and each old class, and turning to 2;
4. finishing;
thirdly, taking out the reads which are not normally compared for each pair of reads in the hang.sam file, inserting the reads into a clustering unit if the compared part of the reads is in the range represented by a certain clustering unit, and extracting the reads containing the variation information from the hang.sam file;
step four, each clustering unit determines a variation range (after hierarchical clustering is carried out on two-dimensional point data, each clustering unit has a plurality of two-dimensional points, the average value of the two-dimensional points is calculated, finally, a point [ a, b ] can be determined, namely the variation occurrence range from a to b), reads carrying variation information and contained in the variation range are extracted, a section of sequence on a reference sequence is intercepted according to the position on each read comparison and the variation range, and the read and the intercepted reference sequence are compared to determine the variation type, the variation position and the variation size;
step five, recording the mutation type deletion as 1, the insertion as 2, and a certain mutation can be expressed as "1 _ mutation position _ mutation size"; then store the variation using a prior art hash structure (hash structure is one of the computer data structures); for a variation, a threshold is set based on the coverage of sequencing, and if the number of reads supporting a variation is greater than the threshold, a variation can be identified.
Further, the sam file in the first step includes pair-end data in normal comparison and pair-end data in abnormal comparison.
Further, the step two is through hierarchical clustering and can accomplish clustering automatically by setting threshold, calculate its mean value to all points in each clustering unit, the point that contains in the clustering unit is as follows: a [ a1, B1], B [ a2, B2], wherein each clustering unit contains a range [ a, B ], a ═ a1+ a2)/2, and B ═ B1+ B2)/2, namely, the range [ a, B ] contains variation.
Further, the fourth step specifically includes: the read sequence is a sequence A, and the intercepted reference sequence is a sequence B; comparing the left ends of A and B, determining the position where the first base is different as a variation position, marking as q, then cutting the A sequence from the different position, taking the rest part of the A sequence as a window to start sliding, taking the initial position as the variation position, sliding a distance to the right each time, taking the score function of the window as the number of the base in comparison in the window, stopping sliding if the score of the window at a certain position is greater than the total number of the base in the window multiplied by 0.95, determining the variation type as deletion, recording the position at the moment, marking as w, and taking w-q as the size of the variation.
The invention has the advantages and positive effects that: determining a variation range through clustering, extracting Split read and comparing with a reference sequence in the variation range, so that the comparison process becomes simple, the comparison range is more accurate, and omission is prevented; hierarchical clustering is used, the limitation that the number of clusters is set in advance is broken through, and the requirement of the problem can be met by using hierarchical clustering; the operation is simple.
The invention can complete detection only by comparing the sam file with the reference sequence, and meanwhile, the test result of the invention is more accurate compared with other existing methods through the test of simulation data (the simulation result is shown in table 1). The invention can solve the problem that the detection of the INDEL variation position by the PEM technology is inaccurate; the problem of omission caused by detecting INDEL variation by an SR method is solved; the problem that the variation points may be missed by the repeated sequences in the prior art is solved.
Table 1: simulation data experiment results
Figure BDA0001304304170000041
Figure BDA0001304304170000051
Wherein, deletion is 5 added mutations, insertion is 1 added mutation, position is a mutation position, size is a mutation size, c _ position is a mutation position detected by the method, c _ size is a mutation size detected by the method, p _ position is a mutation position detected by the pindel method, and p _ size is a mutation size detected by the pindel method.
Drawings
FIG. 1 is a flow chart of an Indel detection method based on next generation sequencing data according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.
As shown in fig. 1, the Indel detection method based on next-generation sequencing data provided by the embodiment of the present invention includes the following steps:
s101: comparing the original fastq data by using bwa comparison software to generate a sam file;
s102: extracting a comparison position of each pair of reads in a disc file to serve as a two-dimensional point coordinate, and carrying out hierarchical clustering on the two-dimensional points according to a threshold set by the two-dimensional points;
s103: for each pair of reads in the hang.sam file, taking out the reads which are not normally compared, if the compared part of the reads is in the range represented by a certain clustering unit, inserting the reads into the clustering unit, and extracting the reads containing the variation information from the hang.sam file;
s104: each clustering unit can determine a variation range, extracts reads carrying variation information contained in the variation range, intercepts a section of sequence on a reference sequence according to the position of each read in comparison and the variation range, and compares the read with the intercepted reference sequence to determine the variation type, the variation position and the variation size;
s105: a certain mutation can be expressed as "1 _ mutation position _ mutation size" by taking mutation type deletion as 1 and insertion as 2; then utilizing a hash structure to store the variation; for a certain variation, a threshold is set according to the coverage of sequencing, and the variation can be determined if the support number of the certain variation reads is larger than the threshold.
The Indel detection method based on the next generation sequencing data provided by the embodiment of the invention specifically comprises the following steps:
(1) preprocessing of sam data
Comparing original fastq data by using bwa comparison software to generate an sam file, wherein the sam file contains both pair-end data in normal comparison and pair-end data in abnormal comparison, extracting the comparison position of each pair of normal comparison reads, obtaining a mapping length by calculating a difference value, extracting the reads of which the mapping length does not meet the condition (namely the mapping length is greater than the insertion length plus the variance or the mapping length is less than the insertion length minus the variance) to obtain discordant.
(2) Hierarchical clustering
Sam files in discordant in (1) are compared with each other to extract the comparison position of each pair of reads to be used as a two-dimensional point coordinate, the two-dimensional points are subjected to hierarchical clustering according to a self-set threshold (the threshold is set as the square of the insertion length by default), because how many variations are contained in the samples are not known before detection, the clustering can be automatically completed through the hierarchical clustering and the self-set threshold, then the average value of all the points in each clustering unit is calculated, and the points contained in a certain clustering unit are set as follows: a [ a1, B1], B [ a2, B2], each clustering unit finally contains a range [ a, B ], a is (a1+ a2)/2, and B is (B1+ B2)/2, namely, the range [ a, B ] contains variation.
(3) Inserting Split read into corresponding clustering unit
And (3) taking out the reads which are not normally compared for each pair of reads in the hang.sam file in the step (1), inserting the read into a certain clustering unit if the compared part of the reads is in the range represented by the clustering unit, and extracting the reads containing the variation information from the hang.sam file through the processing of the step.
(4) Setting a sliding window for comparison
(2) Step (3) can extract reads carrying variation information contained in the variation range, a section of sequence on the reference sequence is intercepted according to the position of each read in comparison and the variation range, and then the read and the intercepted reference sequence are compared to determine the variation type, the variation position and the variation size. The specific method comprises the following steps: and (2) setting the read sequence as a sequence A, the intercepted reference sequence as a sequence B, comparing the left ends of the A and the B, taking the position where the first base is different as a variation position and marking as q, then intercepting the A sequence from the different position, sliding by taking the rest part of the A sequence as a window, taking the start position as the variation position, sliding a distance to the right each time, taking a score function of the window as the number of the bases in comparison in the window, stopping sliding if the score of a certain position window is greater than the total number of the bases in the window and multiplied by 0.95, recording the position at the moment and marking as w, and taking w-q as the size of the variation.
(5) Screening and determination
The mutation type deletion is marked as 1, the insertion is marked as 2, such that a certain mutation can be represented as "1 _ mutation position _ mutation size", then the mutation is stored by using a hash structure, a hash key is a character string, a hash value is an integer type (representing the number of reads supporting the mutation), and finally, for a certain mutation, a threshold value is set according to the coverage of sequencing (the default value is set to be 0.02), namely, if the number of reads supporting the mutation is greater than two percent of the total number of reads, the mutation is output.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (3)

1. An Indel detection method based on next generation sequencing data is characterized by comprising the following steps:
comparing original fastq data by using bwa comparison software to generate a sam file;
extracting a comparison position from each pair of reads in the discordant.sam file to serve as a two-dimensional point coordinate, and carrying out hierarchical clustering on the two-dimensional points according to a set threshold;
thirdly, taking out the reads which are not normally compared for each pair of reads in the hang.sam file, inserting the reads into a clustering unit if the compared part of the reads is in the range represented by a certain clustering unit, and extracting the reads containing the variation information from the hang.sam file;
step four, each clustering unit determines a variation range, extracts reads carrying variation information contained in the variation range, intercepts a section of sequence on the reference sequence according to the position of each read in comparison and the variation range, and compares the read with the intercepted reference sequence to determine the variation type, the variation position and the variation size;
the fourth step specifically comprises: the read sequence is a sequence A, and the intercepted reference sequence is a sequence B; comparing the left ends of A and B, determining the position where the first base is different as a variation position, marking as q, then intercepting the A sequence from the different position, taking the rest part of the A sequence as a window to start sliding, taking the initial position as the variation position, sliding a distance to the right each time, taking the score function of the window as the number of the base in comparison in the window, stopping sliding if the score of the window at a certain position is greater than the total number of the base in the window and multiplied by 0.95, determining the variation type as deletion, recording the position at the moment, marking as w, and taking w-q as the size of the variation, similarly, sliding on the reference sequence B, and determining the variation type as insertion if a certain position is matched;
step five, recording the mutation type deletion as 1, the insertion as 2, and a certain mutation can be expressed as "1 _ mutation position _ mutation size"; then utilizing a hash structure to store the variation; and setting a threshold value according to the coverage of sequencing for a certain variation, and outputting the variation when the number of reads supporting the variation is larger than the threshold value.
2. The Indel detection method based on next-generation sequencing data of claim 1, wherein the sam file in the first step comprises pair-end data in normal alignment and pair-end data in abnormal alignment.
3. The Indel detection method based on next-generation sequencing data according to claim 1, wherein the clustering is automatically performed through hierarchical clustering and setting a threshold value, and the average value of all points in each clustering unit is obtained, and the points contained in the clustering unit are as follows: a [ a1, B1], B [ a2, B2], wherein each clustering unit contains a range [ a, B ], a ═ a1+ a2)/2, and B ═ B1+ B2)/2, namely, the range [ a, B ] contains variation.
CN201710377194.0A 2017-05-25 2017-05-25 Indel detection method based on next generation sequencing data Active CN107229839B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710377194.0A CN107229839B (en) 2017-05-25 2017-05-25 Indel detection method based on next generation sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710377194.0A CN107229839B (en) 2017-05-25 2017-05-25 Indel detection method based on next generation sequencing data

Publications (2)

Publication Number Publication Date
CN107229839A CN107229839A (en) 2017-10-03
CN107229839B true CN107229839B (en) 2020-05-22

Family

ID=59933495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710377194.0A Active CN107229839B (en) 2017-05-25 2017-05-25 Indel detection method based on next generation sequencing data

Country Status (1)

Country Link
CN (1) CN107229839B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992721B (en) * 2017-11-10 2020-03-31 深圳裕策生物科技有限公司 Method, apparatus and storage medium for detecting target region gene fusion
CN109266729B (en) * 2018-09-29 2020-11-27 中国科学院遗传与发育生物学研究所 Large fragment deletion detection method based on genome second-generation sequencing
CN110310704A (en) * 2019-05-08 2019-10-08 西安电子科技大学 A kind of copy number mutation detection method based on local outlier factor
CN110993023B (en) * 2019-11-29 2023-08-15 北京优迅医学检验实验室有限公司 Detection method and detection device for complex mutation
CN111261225B (en) * 2020-02-06 2022-08-16 西安交通大学 Reverse correlation complex variation detection method based on second-generation sequencing data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011105667A1 (en) * 2010-02-26 2011-09-01 숭실대학교 산학협력단 Query sequence genotype or subtype classification method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10777301B2 (en) * 2012-07-13 2020-09-15 Pacific Biosciences For California, Inc. Hierarchical genome assembly method using single long insert library
US9670530B2 (en) * 2014-01-30 2017-06-06 Illumina, Inc. Haplotype resolved genome sequencing
US9817944B2 (en) * 2014-02-11 2017-11-14 Seven Bridges Genomics Inc. Systems and methods for analyzing sequence data
CN105574361B (en) * 2015-11-05 2018-11-02 上海序康医疗科技有限公司 A method of detection genome copies number variation
CN105760712B (en) * 2016-03-01 2019-03-26 西安电子科技大学 A kind of copy number mutation detection method based on new-generation sequencing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011105667A1 (en) * 2010-02-26 2011-09-01 숭실대학교 산학협력단 Query sequence genotype or subtype classification method

Also Published As

Publication number Publication date
CN107229839A (en) 2017-10-03

Similar Documents

Publication Publication Date Title
CN107229839B (en) Indel detection method based on next generation sequencing data
CN110010193B (en) Complex structure variation detection method based on hybrid strategy
Eskin et al. Finding composite regulatory patterns in DNA sequences
US8335786B2 (en) Multi-media content identification using multi-level content signature correlation and fast similarity search
US8010502B2 (en) Methods and systems for data recovery
CN110808084B (en) Copy number variation detection method based on single-sample second-generation sequencing data
CN110299185B (en) Insertion variation detection method and system based on new generation sequencing data
CN114743594B (en) Method, device and storage medium for detecting structural variation
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
US10810239B2 (en) Sequence data analyzer, DNA analysis system and sequence data analysis method
CN108660200B (en) Method for detecting expansion of short tandem repeat sequence
CN114502744B (en) Copy number variation detection method and device based on blood circulation tumor DNA
KR20130071617A (en) System and method for detecting variety malicious code
CN111696622A (en) Method for correcting and evaluating detection result of mutation detection software
KR20210082390A (en) Systems and methods for grouping and collapsing sequencing reads
CN111767546A (en) Deep learning-based input structure inference method and device
CN113628683A (en) High-throughput sequencing mutation detection method, equipment, device and readable storage medium
Chen et al. Seqoptics: A protein sequence clustering method
US10319464B2 (en) Method and apparatus for identifying tandem repeats in a nucleotide sequence
JP7095805B2 (en) Method of detecting outliers of theoretical mass
KR100698466B1 (en) Method of Bottom-Up Protein Modifications Detection using Mass Shift List Table and Program Storage Device
CN115762633B (en) Genome structure variation genotype correction method based on three-generation sequencing
Lysiak et al. Interpreting Mass Spectra Differing from Their Peptide Models by Several Modifications
CN116386719A (en) Gene fusion detection method, device, equipment and storage medium
Pedersen et al. Fingerprinting Malware using Bioinformatics Tools Building a Classifier for the Zeus Virus (Computer Security track, Virus Detection)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant