CN107229839B

CN107229839B - Indel detection method based on next generation sequencing data

Info

Publication number: CN107229839B
Application number: CN201710377194.0A
Authority: CN
Inventors: 袁细国; 许向彦; 杨利英; 张军英; 白俊
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-05-25
Filing date: 2017-05-25
Publication date: 2020-05-22
Anticipated expiration: 2037-05-25
Also published as: CN107229839A

Abstract

The invention belongs to the technical field of genetic engineering, and discloses a method for detecting Indel based on next generation sequencing data, which comprises the following steps: comparing the original fastq data by using bwa comparison software to generate a sam file; performing hierarchical clustering on the two-dimensional points according to a set threshold value; for each pair of reads in the hang.sam file, taking out reads which are not normally compared; comparing the read with the intercepted reference sequence to determine the type, position and size of the variation; storing the variations using a hash structure; for a certain variation, a threshold is set according to the coverage of sequencing. According to the method, a variation range is determined through clustering, and Split read is extracted and compared with a reference sequence in the variation range, so that the comparison process is simple and the range is more accurate; hierarchical clustering is used, the limitation that the number of clusters is set in advance is broken through, and the operation is simple.

Description

Indel detection method based on next generation sequencing data

Technical Field

The invention belongs to the technical field of genetic engineering, and particularly relates to an Indel detection method based on next generation sequencing data.

Background

Next generation sequencing is a technique for sequencing DNA sequences. In the sequencing process, the complete sample DNA sequence is broken, fragments satisfying a specific length (usually hundreds of bp) are selected from the broken sample DNA sequence, and a sequence with a length of tens of to hundreds of bp is read at one end or two ends of each fragment. The length of the read sequence is usually far less than that of the tested sample DNA sequence, but the next generation of sequencing technology can simultaneously read a large number of short sequences, so that the total length of all the short sequences reaches several times to several tens times of the length of the sample DNA, thereby obtaining the sample DNA sequence. Indel (mutation and mutation) mutation is an important mutation phenomenon in the genome. Mainly expressed in both insertion and deletion states and associated with the occurrence of disease in humans. Currently, there are 4 main strategies for detecting INDEL mutations on the genome, which are: (1) readpair (also known as Pair-end Mapping, PEM for short, double-ended Mapping); (2) split read (SR for short): the reads are separated. Splitread is a special class of reads, the occurrence of which is usually caused by structural variations in the genome. The read does not keep the form of a continuous sequence in the mapping, but contains a gap with a certain length, so that the mapping difficulty is high; (3) read Depth (abbreviated RD, Read overlay Depth) and (4) de novo assembly-based methods. (PEM) comparing Pair-end reads to the reference sequence, if the insertion length of a Pair of reads is less than the mapping length, the Pair of reads can determine a deletion (deletion); on the contrary, if the length of a certain pair of reads insertion is greater than the mapping length, an insertion (insertion) can be determined; for the detection of sequence deletion, the length of the detected fragment is affected by the Standard Deviation (SD) of the length of the inserted fragment (the length of the inserted fragment refers to the length of the DNA fragment broken by ultrasonic waves selected at the stage of constructing a DNA sequencing library before sequencing, the fragments are also called sequencing fragments which are operations in the experimental process and do not refer to the variation of genome), and the larger the sequence deletion is, the more easily the sequence deletion is detected; for detection of sequence insertions, the length can only be in the range of the insert length, and the maximum length is also limited by the standard deviation of the insert length sequenced; the defect of the detection method is that the detected mutation position is not accurate enough and can not reach the bp level. SR firstly extracts pair-endreads with the following characteristics, one pair is normally compared to a reference sequence, the other pair cannot be compared, then a search range is determined by using the read position and the insertion length of the normal comparison, the best match between the unaligned read and the reference sequence is searched in the search range, the unaligned read is divided into two or three sections through the best matching point, and thus the positions of deletion and insert are determined; pindel is a software for mutation detection using the SR method. It is widely used by thousands of human genome projects and bioinformatics analysts. Pindel is theoretically able to detect deletion over all length ranges, and insertion of small fragments. One advantage of the Pindel approaches is that they can be accurate to a single base, but Pindel may miss these variations if repeated sequences are present within the variant region. The RD can measure the coverage of each site through samtools, sequencing reads are compared to a reference sequence, and if the coverage depth of a certain section is much lower than the average coverage depth, the section can be determined to be a deletion; the disadvantages are that only deletion can be detected, but not insertion, and the detection position is not accurate enough. The de novo assembly method can provide the best detection method for longinsert, but the assembly is still a troublesome matter, the repetitive sequences existing on the genome can seriously affect the quality of the assembly, and the application of the assembly method in the detection of genome variation is also greatly hindered.

In summary, the problems of the prior art are as follows: the existing detection method has the defects that the detection position is not accurate; the variant region is easy to miss when encountering a repetitive sequence; the omission of mutation detection is also easily caused by only depending on the read and insertion length of one alignment to determine a detection range.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an Indel detection method based on next generation sequencing data.

The invention is realized in such a way that the Indel detection method based on the next generation sequencing data comprises the following steps:

comparing original fastq data by using bwa comparison software to generate a sam file;

extracting a comparison position from each pair of reads in the discordant.sam file to serve as a two-dimensional point coordinate, and performing hierarchical clustering on the two-dimensional points according to a set threshold (the threshold is set as an insertion length), wherein the detailed clustering process is as follows:

1. firstly, initializing each two-dimensional point as a class, calculating the distance between every two classes, namely the similarity between every two classes, wherein the distance is the Euclidean distance, and two points A [ a ] are set₁,a₂]，B[b₁,b₂]The Euclidean distance between A and B is

2. Setting a threshold value as an insertion length, finding two classes with the closest distance, classifying the two classes into one class if the distance between the two classes with the closest distance is less than the threshold value, and turning to 3; if the distance between the two classes closest to each other is greater than or equal to the threshold, terminating the iteration and turning to 4;

3. recalculating the similarity between the newly generated class and each old class, and turning to 2;

4. finishing;

thirdly, taking out the reads which are not normally compared for each pair of reads in the hang.sam file, inserting the reads into a clustering unit if the compared part of the reads is in the range represented by a certain clustering unit, and extracting the reads containing the variation information from the hang.sam file;

step four, each clustering unit determines a variation range (after hierarchical clustering is carried out on two-dimensional point data, each clustering unit has a plurality of two-dimensional points, the average value of the two-dimensional points is calculated, finally, a point [ a, b ] can be determined, namely the variation occurrence range from a to b), reads carrying variation information and contained in the variation range are extracted, a section of sequence on a reference sequence is intercepted according to the position on each read comparison and the variation range, and the read and the intercepted reference sequence are compared to determine the variation type, the variation position and the variation size;

step five, recording the mutation type deletion as 1, the insertion as 2, and a certain mutation can be expressed as "1 _ mutation position _ mutation size"; then store the variation using a prior art hash structure (hash structure is one of the computer data structures); for a variation, a threshold is set based on the coverage of sequencing, and if the number of reads supporting a variation is greater than the threshold, a variation can be identified.

Further, the sam file in the first step includes pair-end data in normal comparison and pair-end data in abnormal comparison.

Further, the step two is through hierarchical clustering and can accomplish clustering automatically by setting threshold, calculate its mean value to all points in each clustering unit, the point that contains in the clustering unit is as follows: a [ a1, B1], B [ a2, B2], wherein each clustering unit contains a range [ a, B ], a ═ a1+ a2)/2, and B ═ B1+ B2)/2, namely, the range [ a, B ] contains variation.

Further, the fourth step specifically includes: the read sequence is a sequence A, and the intercepted reference sequence is a sequence B; comparing the left ends of A and B, determining the position where the first base is different as a variation position, marking as q, then cutting the A sequence from the different position, taking the rest part of the A sequence as a window to start sliding, taking the initial position as the variation position, sliding a distance to the right each time, taking the score function of the window as the number of the base in comparison in the window, stopping sliding if the score of the window at a certain position is greater than the total number of the base in the window multiplied by 0.95, determining the variation type as deletion, recording the position at the moment, marking as w, and taking w-q as the size of the variation.

The invention has the advantages and positive effects that: determining a variation range through clustering, extracting Split read and comparing with a reference sequence in the variation range, so that the comparison process becomes simple, the comparison range is more accurate, and omission is prevented; hierarchical clustering is used, the limitation that the number of clusters is set in advance is broken through, and the requirement of the problem can be met by using hierarchical clustering; the operation is simple.

The invention can complete detection only by comparing the sam file with the reference sequence, and meanwhile, the test result of the invention is more accurate compared with other existing methods through the test of simulation data (the simulation result is shown in table 1). The invention can solve the problem that the detection of the INDEL variation position by the PEM technology is inaccurate; the problem of omission caused by detecting INDEL variation by an SR method is solved; the problem that the variation points may be missed by the repeated sequences in the prior art is solved.

Table 1: simulation data experiment results

Wherein, deletion is 5 added mutations, insertion is 1 added mutation, position is a mutation position, size is a mutation size, c _ position is a mutation position detected by the method, c _ size is a mutation size detected by the method, p _ position is a mutation position detected by the pindel method, and p _ size is a mutation size detected by the pindel method.

Drawings

FIG. 1 is a flow chart of an Indel detection method based on next generation sequencing data according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.

As shown in fig. 1, the Indel detection method based on next-generation sequencing data provided by the embodiment of the present invention includes the following steps:

s101: comparing the original fastq data by using bwa comparison software to generate a sam file;

s102: extracting a comparison position of each pair of reads in a disc file to serve as a two-dimensional point coordinate, and carrying out hierarchical clustering on the two-dimensional points according to a threshold set by the two-dimensional points;

s103: for each pair of reads in the hang.sam file, taking out the reads which are not normally compared, if the compared part of the reads is in the range represented by a certain clustering unit, inserting the reads into the clustering unit, and extracting the reads containing the variation information from the hang.sam file;

s104: each clustering unit can determine a variation range, extracts reads carrying variation information contained in the variation range, intercepts a section of sequence on a reference sequence according to the position of each read in comparison and the variation range, and compares the read with the intercepted reference sequence to determine the variation type, the variation position and the variation size;

s105: a certain mutation can be expressed as "1 _ mutation position _ mutation size" by taking mutation type deletion as 1 and insertion as 2; then utilizing a hash structure to store the variation; for a certain variation, a threshold is set according to the coverage of sequencing, and the variation can be determined if the support number of the certain variation reads is larger than the threshold.

The Indel detection method based on the next generation sequencing data provided by the embodiment of the invention specifically comprises the following steps:

(1) preprocessing of sam data

Comparing original fastq data by using bwa comparison software to generate an sam file, wherein the sam file contains both pair-end data in normal comparison and pair-end data in abnormal comparison, extracting the comparison position of each pair of normal comparison reads, obtaining a mapping length by calculating a difference value, extracting the reads of which the mapping length does not meet the condition (namely the mapping length is greater than the insertion length plus the variance or the mapping length is less than the insertion length minus the variance) to obtain discordant.

(2) Hierarchical clustering

Sam files in discordant in (1) are compared with each other to extract the comparison position of each pair of reads to be used as a two-dimensional point coordinate, the two-dimensional points are subjected to hierarchical clustering according to a self-set threshold (the threshold is set as the square of the insertion length by default), because how many variations are contained in the samples are not known before detection, the clustering can be automatically completed through the hierarchical clustering and the self-set threshold, then the average value of all the points in each clustering unit is calculated, and the points contained in a certain clustering unit are set as follows: a [ a1, B1], B [ a2, B2], each clustering unit finally contains a range [ a, B ], a is (a1+ a2)/2, and B is (B1+ B2)/2, namely, the range [ a, B ] contains variation.

(3) Inserting Split read into corresponding clustering unit

And (3) taking out the reads which are not normally compared for each pair of reads in the hang.sam file in the step (1), inserting the read into a certain clustering unit if the compared part of the reads is in the range represented by the clustering unit, and extracting the reads containing the variation information from the hang.sam file through the processing of the step.

(4) Setting a sliding window for comparison

(2) Step (3) can extract reads carrying variation information contained in the variation range, a section of sequence on the reference sequence is intercepted according to the position of each read in comparison and the variation range, and then the read and the intercepted reference sequence are compared to determine the variation type, the variation position and the variation size. The specific method comprises the following steps: and (2) setting the read sequence as a sequence A, the intercepted reference sequence as a sequence B, comparing the left ends of the A and the B, taking the position where the first base is different as a variation position and marking as q, then intercepting the A sequence from the different position, sliding by taking the rest part of the A sequence as a window, taking the start position as the variation position, sliding a distance to the right each time, taking a score function of the window as the number of the bases in comparison in the window, stopping sliding if the score of a certain position window is greater than the total number of the bases in the window and multiplied by 0.95, recording the position at the moment and marking as w, and taking w-q as the size of the variation.

(5) Screening and determination

The mutation type deletion is marked as 1, the insertion is marked as 2, such that a certain mutation can be represented as "1 _ mutation position _ mutation size", then the mutation is stored by using a hash structure, a hash key is a character string, a hash value is an integer type (representing the number of reads supporting the mutation), and finally, for a certain mutation, a threshold value is set according to the coverage of sequencing (the default value is set to be 0.02), namely, if the number of reads supporting the mutation is greater than two percent of the total number of reads, the mutation is output.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An Indel detection method based on next generation sequencing data is characterized by comprising the following steps:

extracting a comparison position from each pair of reads in the discordant.sam file to serve as a two-dimensional point coordinate, and carrying out hierarchical clustering on the two-dimensional points according to a set threshold;

step four, each clustering unit determines a variation range, extracts reads carrying variation information contained in the variation range, intercepts a section of sequence on the reference sequence according to the position of each read in comparison and the variation range, and compares the read with the intercepted reference sequence to determine the variation type, the variation position and the variation size;

the fourth step specifically comprises: the read sequence is a sequence A, and the intercepted reference sequence is a sequence B; comparing the left ends of A and B, determining the position where the first base is different as a variation position, marking as q, then intercepting the A sequence from the different position, taking the rest part of the A sequence as a window to start sliding, taking the initial position as the variation position, sliding a distance to the right each time, taking the score function of the window as the number of the base in comparison in the window, stopping sliding if the score of the window at a certain position is greater than the total number of the base in the window and multiplied by 0.95, determining the variation type as deletion, recording the position at the moment, marking as w, and taking w-q as the size of the variation, similarly, sliding on the reference sequence B, and determining the variation type as insertion if a certain position is matched;

step five, recording the mutation type deletion as 1, the insertion as 2, and a certain mutation can be expressed as "1 _ mutation position _ mutation size"; then utilizing a hash structure to store the variation; and setting a threshold value according to the coverage of sequencing for a certain variation, and outputting the variation when the number of reads supporting the variation is larger than the threshold value.

2. The Indel detection method based on next-generation sequencing data of claim 1, wherein the sam file in the first step comprises pair-end data in normal alignment and pair-end data in abnormal alignment.

3. The Indel detection method based on next-generation sequencing data according to claim 1, wherein the clustering is automatically performed through hierarchical clustering and setting a threshold value, and the average value of all points in each clustering unit is obtained, and the points contained in the clustering unit are as follows: a [ a1, B1], B [ a2, B2], wherein each clustering unit contains a range [ a, B ], a ═ a1+ a2)/2, and B ═ B1+ B2)/2, namely, the range [ a, B ] contains variation.