CN113362892A

CN113362892A - Method for detecting and typing repetition number of short tandem repeat sequence

Info

Publication number: CN113362892A
Application number: CN202110669050.9A
Authority: CN
Inventors: 李梦; 郭茂平; 胡欢; 陈初光
Original assignee: Beijing Yuewei Gene Technology Co ltd
Current assignee: Beijing Yuewei Gene Technology Co ltd
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2021-09-07
Anticipated expiration: 2041-06-16
Also published as: CN113724783A; CN113724783B; CN113362892B

Abstract

The invention relates to the field of sequencing data generation analysis, and particularly provides a short tandem repeat number detection and typing method based on next generation sequencing. The detection method can cross the middle repetitive region, simultaneously align two flanking sequences and simultaneously consider the mismatching, the insertion and the deletion of the flanking sequences; the typing method is an STR typing method which is based on effective peak detection and combines the difference of the number of repeats to set a dynamic threshold value, and can be applied to the detection of the number of repeats of different next generation sequencing platforms.

Description

Method for detecting and typing repetition number of short tandem repeat sequence

Technical Field

The invention belongs to the field of letter generation analysis, and particularly relates to a method for detecting and typing a repeat number of a short tandem repeat sequence.

Background

Short Tandem Repeat (STR), also called microsatellite DNA, is a DNA sequence synthesized by taking 2-6 bases as a core in tandem, and the STR has the characteristics of high mutation rate, polymorphism, easiness in detection and the like, so that the short tandem repeat is widely applied to detection related to forensics. Since 1985, STR detection has been applied to the forensic field, usually by capillary electrophoresis coupled with fluorescent labeling. Specific primer design is carried out aiming at different STR loci, amplification products with different lengths and different fluorescent labels are obtained through amplification, and the different STR loci are distinguished through capillary electrophoresis.

However, this method based on capillary electrophoresis detection has limited throughput and cannot effectively distinguish fragments of more than 1,000bp in length. In addition, the color of the fluorescence channel used is generally not more than 6, which would otherwise cause leakage problems of fluorescence of different wavelengths. Therefore, the number of STRs detected based on capillary electrophoresis generally does not exceed 60. With the development of high throughput sequencing, more and more research is beginning to be directed to the use of high throughput sequencing to detect and identify molecular markers such as STRs. The second-generation sequencing has the advantages of large flux, multiple detection sites and the like, is very suitable for identity identification and genetic inference by using STR detection, and is also more and more widely applied to the field of forensic science. For example, the kingdom killer, which was recently earned in the united states, is to identify a criminal suspect by applying genetic inference based on second-generation sequencing to find the relative of the criminal suspect.

The STR is an important object for forensic detection, on one hand, the polymorphism of the STR enables the contained information to be richer, and the information content of one STR site is about 6 SNP sites on average; on the other hand, the judicial community has accumulated STR information for tens of millions of individuals over the past few decades, and many of them are criminal suspects. The detection of STR sites using second generation sequencing has many advantages over capillary electrophoresis. The second generation sequencing can obtain the information of the length of the STR repetitive region and the information of the DNA sequence of the repetitive region, and the two kinds of information are added together to further strengthen the function of identifying the STR individuals. Research has shown that there are types with the same length and different sequences in the population, and STR detection based on capillary electrophoresis cannot distinguish different types caused by the sequence difference; the amplicon fragment required by the second-generation sequencing is smaller than that of the first-generation sequencing, and is more suitable for STR detection of some degraded DNA; the second-generation sequencing detection typing does not depend on bin values, so that new and rare typing can be detected, and researches show that the STR repetitive region of some samples can be divided into the same bin by capillary electrophoresis when insertion or deletion of a few bp occurs, so that STR typing errors are brought; by adding different sequence labels to different samples, the second-generation sequencing can detect hundreds or even thousands of samples at the same time, and the detection flux is large; the second generation sequencing technology can fuse target molecule markers such as STR, SNP, mitochondrial DNA and the like into one panel, and can obtain a large amount of useful information in one panel. Tests that have begun to use second generation sequencing instead of capillary electrophoresis to detect STRs in forensic systems in some states and cities in the united states are currently expected to begin to be practiced in some states and cities in recent years.

Although China also conducts active research and investigation in the related field, and some companies have also introduced their own panel products, the existing methods for detecting STR based on second-generation sequencing still have shortcomings. The currently mainstream sequencer is the sequencer of the U.S. illumina company, and the sequencing length of the sequencer does not exceed 250bp generally, which provides a challenge for accurate typing of the long repetitive sequence STR. On the other hand, the second generation sequencing detection method for STR relies on the matching of flanking sequences on both sides, and if the flanking sequences are mismatched, inserted or deleted, the STR region cannot be accurately located, which affects the detection of the number of repeats. In addition, the current confidence-generation analysis method for mainstream STR typing is to check whether the number of reads of the second typing is 40% or more of that of the first typing, if so, the site is considered to be heterozygous, and two different types are provided (namely, the STR site has two repeat numbers). Since the data size of the second-generation sequencing is deviated along with the length of the fragment, and generally, the longer the fragment is, the smaller the data size is, when detecting the heterozygous STR locus, if the difference between the two typing repetition numbers is large, the number of reads on the two allels may be much different (sometimes, the ratio of the two is over 1: 10), and the STR typing with the large repetition number is missed according to the 40% threshold.

Therefore, the invention is provided in view of the problem that the current next generation sequencing still has a plurality of inaccurate typing for STR typing detection.

Disclosure of Invention

The invention aims to solve the problems in the prior art, and at least the following technical innovations are made in order to overcome the problems:

1. aiming at the possible mismatches, insertions and deletions of flanking sequences, the invention provides an improved Needleman-Wunsch algorithm which can realize the simultaneous alignment of two flanking sequences within O (MN) time, simultaneously considers the mismatches, insertions and deletions and crosses the middle repeated region.

2. Aiming at the problem of inaccurate STR typing of second-generation sequencing detection, the invention provides a novel STR typing method for setting a dynamic threshold value based on the detection of an effective peak value (the time complexity is O (T ^2)) and by combining the difference of the repetition number.

Specifically, the invention provides the following technical scheme:

the invention firstly provides a searching and comparing method of STR flanking sequences based on next generation sequencing, which comprises the following steps:

step 1) raw data processing step: comparing the fastq file with an STR reference sequence library;

step 2) construction of comparison algorithm:

a. knowing the left flanking sequence flank _ left, the right flanking sequence flank _ right and the sequence to be analyzed S1, splicing the flank _ left and the flank _ right into a sequence S2 in sequence;

b. initializing a score matrix: constructing a score matrix by taking S1 as a column and S2 as a row, and setting the penalty scores of the 1 st column as 0; setting punishment scores of two positions at the splicing position of the left flanking sequence and the right flanking sequence in the 1 st row S2 as-10 respectively, and setting punishment scores of the rest positions as-1;

c. filling a score matrix: setting a matching score match _ score equal to 1, a mismatch score mismatch _ score equal to-1, and a gap score gap _ score equal to-2; for each cell (i, j) in the matrix, where i represents the index of the matrix row and j represents the index of the matrix column, the score is calculated as:

if j ≠ x and j ≠ y: F _ ij ═ max (F _ (i-1, j-1) + S (a _ i, B _ j), F _ (i, j-1) + d, F _ (i-1, j) + d);

if j ═ x, F _ ij ═ max (F _ (i-1, j-1) +2S (a _ i, B _ j), F _ (i, j-1) +0, F _ (i-1, j) +2 xd);

if j is y, F ij max (F (i-1, j-1) +2S (a _ i, B _ j), F (i, j-1) +2 xd, F (i-1, j) + 0);

where x is the length of the sweep _ left, y is x +1, d is gap _ score, S (a _ i, B _ j) is match _ score or mismatch _ score;

step 3) STR site repeat region anchoring: reading each read aligned in the step 1), and dividing into two cases according to whether the STR reference sequence can be aligned: if the STR locus is compared, searching an STR repetitive region by using the flanking sequence of the STR locus and the comparison algorithm; if any STR locus is not compared, traversing the flanking sequences of all the STR loci, and searching an STR repeated region by using the comparison algorithm; if the result of the flanking sequence comparison does not exceed 2 mismatches, insertions and deletions in total, outputting STR loci and repetitive region sequences on the comparison, otherwise, discarding the read; finally, the sequence length of the repeat region anchored by each read is obtained.

Further, the method further comprises:

step 4): and calculating the number of the repeat of the STR locus on each read alignment according to the sequence length of the repeat region anchored by each read.

Further, the step 2) comprises:

d. acquiring a repetitive region sequence: selecting the first maximum score S of the last column in the matrix from bottom to top as the comparison score of the flanking sequences, and if the length of the sequence S2 is n, if n-S is less than or equal to 4, comparing S1 with S2 under the condition of allowing 2 mismatches, insertions or deletions; and (5) tracing back to the left by taking the coordinate of S as a source to obtain coordinates i1 and i2 of a time line when j is x and j is y, and taking a sequence in the middle of positions i1 and i2 on S1 as an STR locus repetitive region sequence.

The invention also provides an STR parting method for setting a dynamic threshold value based on detecting an effective peak value and combining the difference of the number of repetitions, which is characterized in that:

1) counting the number of Reads (RC) detected by each repeat number on each STR locus of the sample based on the repeat number of the STR loci compared by a single read of the second-generation sequencing;

2) selecting a first type: selecting typing according to the peak value position, regarding each STR locus, taking the maximum repeat number A of RC as a peak, judging the repeat number as a first typing, otherwise, judging that the STR locus does not detect any typing;

3) traversing and selecting a second subtype: after one peak is found, the RC value of the peak is assigned to 0, and the RC of the first parting is assigned to 0, namely RC (A) is assigned to 0; assuming that RC of the repetition number A, B is RC (a) and RC (B), respectively, the repetition number B is determined to be the peak if the following 3 conditions are satisfied:

rc (B)/rc (a) ≧ gradient threshold T of to-be-detected typing B, which is initial gradient +0.25 × (left side gradient or right side gradient) × (a-B);

b.RC(B)≥3；

RC (B) higher than RC of adjacent repeat number;

4) outputting a typing result, wherein if a first typing and a second typing which meet the requirements are found at the same time, the typing result is (A, B); if the first classification meeting the requirement is found but the second classification meeting the requirement is not found, the classification result is (A); if the first and second typing satisfying the requirement are not found, no typing is detected.

Further, in the step 2), RC of the maximum repetition number A is more than or equal to 20.

Further, the gradient threshold T of the to-be-detected type B is calculated as follows:

t is 0.28+0.25 × step x (A-B), wherein,

if a-B >0, step ═ 0.19;

if a-B <0, step ═ 0.33;

preferably, T is 0.02.

Further, the number of the STR loci repeated in the single read alignment in step 1) based on the second generation sequencing can be specifically determined by any one of the methods described above.

The invention also provides an STR typing system for performing dynamic threshold setting based on the detection of the effective peak value and the combination of the difference of the number of repetitions, and the system comprises the following modules for executing the steps of any one of the methods.

The invention also provides a computer-readable medium, in which a computer program is stored, which, when executed by a processor, implements any of the methods described above.

The invention also provides an electronic device, which comprises a processor and a memory, wherein one or more readable instructions are stored on the memory, and when the one or more readable instructions are executed by the processor, any one of the methods is realized.

Compared with the prior art, the invention has at least the following advantages:

1) the invention has the advantages of high accuracy, high speed and the like. The temporal complexity of the search for flanking sequences of the present invention is O (MN), while the temporal complexity of the search for typing is O (T)²) Where M is the sum of the lengths of the two flanking sequences, N is the length of read, and T is the number of total typing detected for a particular STR locus. When 1-2 mismatches, insertions or deletions occur in the flanking sequence, the improved Needleman-Wunsch algorithm can still detect the STR repetitive region according to the flanking sequence, thereby effectively avoiding the situation that the read containing the mismatched, inserted or deleted flanking sequence is missed by the conventional detection method, having higher data utilization rate and more accurate detection result.

2) The invention provides an algorithm for detecting peaks, wherein different reads quantity threshold values are set for a second section of an STR locus, the threshold values are reduced along with the increase of the number of the second section and increased along with the reduction of the number of the second section. By setting a dynamic threshold, the influence of the sequencing length on the typing detection can be effectively corrected, and the probability of detecting two sites of the heterozygous STR at the same time is improved.

3) The method provided by the invention obviously improves the application of the second-generation sequencing technology to the typing detection of STR, and has the advantages of strong objectivity and high accuracy of detection results, and popularization and use values.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of the second generation sequencing detection STR of the present invention;

FIG. 2 is a schematic diagram of a method for identifying flanking sequences by the modified NW algorithm;

FIG. 3 details of the modified NW algorithm and an example of a scoring matrix;

FIG. 4 effect of conventional sequencing length on the number of reads for longer typing;

FIG. 5 is a schematic diagram of the algorithm for second generation sequencing by peak detection STR typing according to the present invention;

FIG. 6 is an adjustment of the read quantity threshold parameter of the present invention;

FIG. 7 parameter adjustment of peak and gradient thresholds according to the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following terms or definitions are provided only to aid in understanding the present invention. These definitions should not be construed to have a scope less than understood by those skilled in the art.

Unless defined otherwise below, all technical and scientific terms used in the detailed description of the present invention are intended to have the same meaning as commonly understood by one of ordinary skill in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present invention.

As used herein, the terms "comprising," "including," "having," "containing," or "involving" are inclusive or open-ended and do not exclude additional unrecited elements or method steps. The term "consisting of …" is considered to be a preferred embodiment of the term "comprising". If in the following a certain group is defined to comprise at least a certain number of embodiments, this should also be understood as disclosing a group which preferably only consists of these embodiments.

Where an indefinite or definite article is used when referring to a singular noun e.g. "a" or "an", "the", this includes a plural of that noun.

The terms "about" and "substantially" in the present invention denote an interval of accuracy that can be understood by a person skilled in the art, which still guarantees the technical effect of the feature in question. The term generally denotes a deviation of ± 10%, preferably ± 5%, from the indicated value.

Furthermore, the terms first, second, third, (a), (b), (c), and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

The general idea of the method comprises the following two aspects:

the invention firstly provides a searching and comparing method for STR flanking sequences. This method can consider mismatch, insertion and deletion, and search the algorithm of STR site two flanking sequences at the same time, including the following steps:

1. raw data processing

1) And (3) data collection, storing an original file in a BCL format output by the second-generation sequencer in a target path, splitting the BCL file into files in a fastq format by using BCL2fastq, and storing the files in a specified path.

2) And (3) comparing sequences, arranging all STR loci, extending 300bp upwards and downwards respectively to serve as reference sequence libraries, and comparing the fastq file with the reference sequence libraries by using BWA to obtain an output file in an SAM format.

2. Search for flanking sequences in the read:

3) an alignment algorithm is constructed (in the following 5 steps):

a) given the left flank sequence flank _ left, the right flank sequence flank _ right, and the read sequence to be analyzed S1, let x be the left flank sequence length (i.e., x be the length of the left flank sequence), let y be x +1, and splice flank _ left, flank _ right into the sequence S2 in order. The x and y values obtained here represent the positions at which the flanking sequences are joined.

b) Initializing a score matrix: constructing a score matrix by taking S1 as a column and S2 as a row, and setting the penalty scores of the 1 st column to be 0 (namely, the values of the first column of the matrix are 0); the penalty score is set to-10 for the flanking sequence junction in row 1 (i.e., the x and y positions) and to-1 for the remaining positions (i.e., -1 for the values in the first row of the matrix, except for the flanking sequence junction at-10).

c) Filling a score matrix: assuming that the match score match _ score is 1, the mismatch score mismatch _ score is-1, and the gap score gap _ score is-2; for each cell (i, j) in the matrix, where i represents the index of the matrix row and j represents the index of the matrix column, the score is calculated as:

if j is not equal to x and j is not equal to y, F_ij＝max(F_i-1,j-1+S(A_i,B_j),F_i,j-1+d,F_i-1,j+d)；

If j is x, F_ij＝max(F_i-1，j-1+2S(A_i,B_j),F_i,j-1+0,F_i-1,j+2×d)；

If j is y, F_ij＝max(F_i-1,j-1+2S(A_i,B_j),F_i,j-1+2×d,F_i-1,j+0)；

Wherein d is gap _ score, S (a)_i,B_j) Match _ score or mismatch _ score.

d) Acquiring a repetitive region sequence: selecting the first maximum score S of the last column from bottom to top as an alignment score, assuming that the length of the sequence S2 is n, if n-S is less than or equal to 4, under the condition of allowing 2 mismatches, insertions or deletions, the sequence S1 and the sequence S2 can be aligned; and (3) going back to the left by taking the coordinate of S as a source, obtaining coordinates i1 and i2 of a time line when j is x and j is y, and finally, taking a sequence in the middle of positions i1 and i2 on S1 as an STR locus repetitive region sequence.

4) STR site repeat region anchoring: reading each read of the SAM file, and classifying into two cases according to whether it can compare the upper STR reference sequence: 1. if the STR locus is compared, finding out an STR repetitive region by using the flanking sequence of the STR locus and the algorithm; 2. if any STR locus is not compared, traversing the flanking sequences of all the STR loci, and finding an STR repeated region by using the algorithm; if the comparison result does not exceed 2 mismatches, insertions or deletions in total, outputting STR loci and repetitive region sequences on the comparison, otherwise, discarding the read; finally, the sequence length of the repeat region anchored by each read is obtained.

5) And calculating the number of the repeat of the STR locus on each read alignment according to the sequence length of the repeat region anchored by each read.

Secondly, further aiming at all reads and the number of repetitions of one STR locus, the invention also relates to a novel method for STR typing based on detecting effective peaks and setting a dynamic threshold value by combining the difference of the number of repetitions, and specifically comprises the following steps:

after the search ratios of flanking sequences are concluded, the following is a new method of typing based on peak detection and dynamic threshold:

6) and counting the number of Reads (RC) detected at each repeat number of each STR locus of the sample based on the comparison of the repeat numbers of the STR loci on the single read of the second-generation sequencing.

7) Selecting a first type: and selecting typing according to the peak position, wherein for each STR locus, the maximum repetition number A of RC is a peak, and if the value of RC (A) is too low, the locus is considered to have no typing detected.

8) Traversing and selecting a second subtype: in order to avoid the interference of the peak which is found to the subsequent peak finding, the RC value of the peak is assigned to be 0 after one peak is found, namely RC (A) is assigned to be 0; and if the following two conditions are met, judging the repetition number B as a peak:

a. a gradient threshold T of the to-be-detected type B is calculated, T ═ initial gradient +0.25 × (left or right gradient) × (a-B). Where A represents the first type detected, B represents the type being detected, and the ACR value represents the ratio between the RC of the B type and the A type. RC of the repetition number B is higher than the gradient threshold T of the typing.

b. The RC of the repetition number B needs to be higher than the RC of the adjacent repetition number.

9) Outputting a typing result, and if a first typing and a second typing which meet the requirements are simultaneously found A, B, the typing result is (A, B); if a first subtype A meeting the requirement is found but a second subtype meeting the requirement is not found, the subtype result is (A); if the first and second typing satisfying the requirement are not found, no typing is detected.

The following are specific embodiments.

Example 1 inventive Process architecture construction

The flow of the overall analysis of the second-generation sequencing STR is shown in figure 1, and in the flow, firstly, the BCL file of the offline data is split into fastq, and then, the reads are compared to obtain the STR locus to which each read belongs. And (3) for all reads aligned to a certain locus, matching the flanking sequences, and if the flanking sequences are matched, calculating the number of repeats contained in the repeated region by using a repeat number calculation formula. Then counting the number of reads of each type in each locus, and finally obtaining correct type by using a type division algorithm. The key step is the detection of the repeat region and the determination of the type, as follows.

1. STR flanking sequence searching/comparing method construction

When using flanking sequence matched reads, the flanking sequences may contain mismatches, insertions and deletions, for example 2 SNPs around D13S 317. To take these factors into account, the present invention improves the Needleman-Wunsch algorithm (NW algorithm) to be able to search for both left and right flanking sequences simultaneously and to be able to cross the middle repeat region of unknown repeat number. FIG. 2 shows the manner in which the flanking sequences are identified, with the top of the figure being the structure of the read and where the flanking sequences are ideally located, the middle being the point of improvement, and the bottom being an example of an alignment of the two flanking sequences with the read.

Specifically, the improved Needleman-Wunsch alignment algorithm is as follows:

given the left flank sequence flank _ left, the right flank sequence flank _ right, and the read sequence to be analyzed S1, let x be the left flank sequence length (i.e., x be the length of the left flank sequence), let y be x +1, and splice flank _ left, flank _ right into the sequence S2 in order. The x and y values obtained here represent the positions at which the flanking sequences are joined.

The modified Needleman-Wunsch alignment algorithm is demonstrated by an example, assuming that flank _ left is TCA, flank _ right is TC, and S1 is TCATTTC, then S2 is TCATC, x is 3, and y is 4 (see fig. 3 for details of the improved NW algorithm and scoring matrices, with the case scoring matrices on the left and the case scoring matrices on the right in fig. 3 for specific improvements to the NW algorithm).

Initializing a score matrix: the penalty score of column 1 is 0; the punishment of the connection position of the 1 st row of the flanking sequence is-10, and the punishment of the rest positions is-1;

filling a score matrix: assuming that the match score match _ score is 1, the mismatch score mismatch _ score is-1, and the gap score gap _ score is-2; for each cell (i, j), the score calculation formula is:

if j! X and j! F as y_ij＝max(F_i-1,j-1+S(A_i,B_j),F_i,j-1+d,F_i-1,j+d)

If j is x, F_ij＝max(F_i-1,j-1+2S(A_i,B_j),F_i,j-1+0,F_i-1,j+2×d)

If j is y, F_ij＝max(F_i-1,j-1+2S(A_i,B_j),F_i,j-1+2×d,F_i-1,j+0)

Wherein d is gap _ score, S (a)_i,B_j) Match _ score or mismatch _ score. The specific populated score matrix is shown in fig. 3.

Acquiring a repetitive region sequence: selecting the first maximum score S of the last column from bottom to top as an alignment score, assuming that the length n of the sequence S2 is equal to length (S2), if n-S is equal to or less than 4, S1 and S2 can be aligned if 2 mismatches, insertions or deletions are allowed; and (3) going back to the left by taking the coordinate of S as a source, obtaining the coordinates i1 and i2 of i when j is x and j is y, and finally taking a sequence (TT) between i1 and i2 on S1 as an STR locus repetitive region sequence.

STR site repeat region anchoring: reading each read of the SAM file, and classifying into two cases according to whether it can compare the upper STR reference sequence: 1. if the STR locus is compared, finding out an STR repetitive region by using the flanking sequence of the STR locus and the algorithm; 2. if any STR locus is not compared, traversing the flanking sequences of all the STR loci, and finding an STR repeated region by using the algorithm; if the comparison result does not exceed 2 mismatches, insertions or deletions in total, outputting STR loci and repetitive region sequences on the comparison, otherwise, discarding the read; finally, the sequence length of the repeat region anchored by each read is obtained.

And calculating the number of the repeat of the STR locus on each read alignment according to the sequence length of the repeat region anchored by each read.

2. STR parting method construction

After the length of the repeat region of each read in the STR locus is detected, the next step is to find the correct typing. The traditional approach is to look at the percentage of the number of reads that the second highest type accounts for the number of reads of the first highest type (ACR value), and if the ACR value is higher than 40% or 50%, the second highest type is generally considered a correct type. However, in capillary electrophoresis and next-generation sequencing, the longer the repeat region, the lower the read number or fluorescence intensity of the typing. Furthermore, this trend varies with the length of the read sequencing, as shown in FIG. 4. FIG. 4 is a graph of the effect of sequencing read length on the number of reads for longer typing where the ratio of sequencing read length to the number of reads between two typing (ACR value) has been plotted. For two allels at the same STR locus, the shorter the sequencing read length, the greater the difference in the number of reads.

To this end, the present invention has developed an algorithm for peak-based detection typing, which is shown in fig. 5, fig. 5 is a schematic diagram of an algorithm for second generation sequencing STR typing by peak detection, wherein the second row of the graph emphasizes the effect of removing the previously found peaks when two typing neighbors. Specifically, the highest read score must be the peak, and the next step is to assign the read number of the highest peak to 0, in order to avoid the situation where the second highest correct score is adjacent to the highest score. And searching the peak position again, and circulating in sequence.

However, due to sequencing and PCR errors, there will usually be reads at incorrect typing positions, so the present invention adds the number of reads as a threshold, except for a peak for a certain typing, and the number of reads must exceed the threshold. In order to take into account the fact that the larger the difference in the number of reads between the types, the larger the length, the gradient threshold is used, i.e. the threshold varies with the variation in the difference in the number of repetitions of the type to be detected and the type of the highest number of reads, the lower the threshold if the number of repetitions of the type to be detected is larger than the number of repetitions of the highest type, and the higher the threshold otherwise.

In the second generation sequencing, because the imbalance between PCR and sequencing often results in an insufficient number of reads for a certain sample or a certain locus, the position is excluded when the number of reads is less than a certain threshold. The present invention uses actual samples to find a more ideal threshold of 20. As shown in fig. 6, the variation in the inconsistency rate is large when the threshold value of the number of reads is less than 20, and is small when the threshold value of the number of reads is greater than 20. Wherein the number of detected types is gradually reduced with the increase of the read number threshold.

In addition to the threshold for the number of reads, additional key parameters are the left threshold gradient, the right threshold gradient, and the initial gradient. Assuming that the type with the highest number of reads is A, the threshold value of a certain type B is: if B > a, the threshold is the initial threshold + (a-B) multiplied by the right threshold gradient; if B < A, the threshold is the initial threshold + (A-B) multiplied by the left threshold gradient. The present invention determines the optimal values of the three parameters using dozens of real samples and a triple loop traversal, and uses a topographic map to see if the resulting three thresholds are indeed optimal, with the results shown in fig. 7. The positions indicated by arrows in the left and right graphs in fig. 7 are the optimal values of the left and right gradient thresholds. The left and right gradient thresholds in the graph are both multiplied by 4 to make the three values more similar in magnitude. It can be seen that the optimal value of the initial gradient obtained therein is 0.28, the optimal value of the left-side gradient is 0.19, the optimal value of the right-side gradient is 0.33, and the position of the arrow in the figure is the position of the value.

After obtaining the parameters, a specific improved typing method is described as follows, and the pseudo code based on R is shown in Table 1:

selecting a first type: and selecting typing according to the peak value position, wherein for each STR locus, the maximum repeat number A of RC is a peak, if the RC of the repeat number is more than or equal to 20, the repeat number is judged as the first typing, otherwise, the STR locus is judged not to detect any typing.

Traversing and selecting a second subtype: in order to avoid the interference of the found peak to the subsequent peak finding, the RC value of the peak is assigned to 0 after one peak is found, and the RC of the first typing is assigned to 0, namely RC (A) is assigned to 0; assuming that RC of the repetition number A, B is RC (a) and RC (B), respectively, if the following two conditions are satisfied, the repetition number B is determined to be a peak:

a. a gradient threshold T of the to-be-detected type B is calculated, T ═ initial gradient +0.25 × (left or right gradient) × (a-B). A specific formula obtained by optimizing the parameters is T ═ 0.28+0.25 × step × (a-B), where if a-B >0, step ═ 0.19 (left side gradient); if a-B <0, step is 0.33 (right gradient). As known, ACR ≧ RC (B)/RC (A) is required to satisfy ACR ≧ T, ACR ≧ 0.02 and RC (B) ≧ 3; where A represents the first type detected, B represents the type being detected, and the ACR value represents the ratio between the RC of the B type and the A type.

b. The RC of the repetition number B needs to be higher than the RC of the adjacent repetition number. Let the integer repetition numbers adjacent to each other on the left and right of the repetition number B be B _ left and B _ right, respectively, and RC corresponding to the adjacent repetition numbers is RC (left), RC (right), where 0< B-B _ left ≦ 1,0< B _ right-B ≦ 1, RC (left) | | 0, and RC (right) | | | 0. It is necessary to satisfy both RC (B) > RC (left) and RC (B) > RC (right).

If the second fractal is found, the output fractal is (A, B); if only one type is found, the final type is A; if no typing is found, no typing is found.

Table 1: the pseudo code of the specific algorithm for detecting STR typing by next generation sequencing is based on R language.

Example 2 sample detection test

After the algorithm is constructed and the optimal parameter system is obtained, the method provided by the invention is tested by respectively using the standard sample and the real sample. Specifically, the test was performed using standard positive samples 9948 and 9947 cell lines, and 70 sets of authentic samples. The positive control for the real sample test is the result of Capillary Electrophoresis (CE). The specific steps comprise firstly establishing a second generation library of STR loci and high-throughput sequencing; splitting data after off-line, and comparing sequences; finally, the method provided by the invention is applied to detect the number of repetition and type the STR, and the old method (directly type by ACR value) is used as a contrast.

Test results for the standard samples:

STR typing was tested against 66 sites of the 9948 cell line using the new method with 100% agreement between NGS and CE and 98% agreement using the old method. The agreement rate of NGS and CE was 99.9% for 37 sites of the 9947 cell line, and 99.1% using the old method.

Detection results for real samples:

the consensus rate of 66 sites of 73 real samples by using a new flanking sequence searching method and a new detecting STR typing method is 99.6%, wherein a small number of inconsistent parts have reasons of NGS and a small number of inconsistent parts have reasons of CE. While the agreement rate using the old method was 93.8%. Specific results are shown in tables 2 and 3.

Table 2: differentiation of results obtained for real samples at partial STR sites using different methods. The results for the agreement between NGS and CE are shown in grey boxes.

Table 3: and (3) obtaining the typing consistency rate of the NGS and the CE in real samples by using different methods. The format of (a/b) is:

a is a consistent sample-bit pair and b is a total sample-bit pair.

Therefore, the method provided by the invention has strong advantages compared with the traditional method. The analysis consistency rate is obviously improved no matter whether the single new repeat number detection method or the typing of the new STR is adopted independently or the two methods are combined. In particular, the agreement rate increased from 93.8% to 99.6% after the combination of the two methods of the present invention. Also, significant improvements in effect were seen at some STR sites, such as D13S 317.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A searching and comparing method for STR flanking sequences based on second-generation sequencing is characterized by comprising the following steps:

step 2) construction of comparison algorithm:

2. The method for searching and aligning STR flanking sequences based on secondary sequencing of claim 1, wherein the method further comprises:

3. The method for searching and aligning STR flanking sequences based on secondary sequencing of any one of claims 1-2, wherein the step 2) further comprises:

4. An STR typing method for setting a dynamic threshold value based on detection of an effective peak value and combination of a difference of a repetition number is characterized in that:

b.RC(B)≥3；

RC (B) higher than RC of adjacent repeat number;

5. The STR typing method based on detection of valid peaks and dynamic thresholding in combination with the difference in the number of repeats of claim 4 wherein in step 2) RC for the maximum number of repeats A is 20 or more.

6. The STR typing method according to any one of claims 4 to 5, wherein in step 3), the gradient threshold T of the type B to be detected is calculated as follows:

t is 0.28+0.25 × step x (A-B), wherein,

if a-B >0, step ═ 0.19;

if a-B <0, step ═ 0.33;

preferably, T is 0.02.

7. The STR typing method based on the detection of valid peaks and dynamic thresholding combined with difference in the number of repeats as claimed in any one of claims 4 to 6, wherein the number of repeats of STR loci in the single read alignment based on second generation sequencing in step 1) is determined by the method as claimed in any one of claims 1 to 5.

8. An STR typing system based on detecting valid peaks and dynamic thresholding in combination with difference in number of repeats, characterized in that the system comprises modules for performing the steps of the method according to any of claims 4 to 7.

9. A computer-readable medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.

10. An electronic device comprising a processor and a memory, the memory having stored thereon one or more readable instructions that, when executed by the processor, implement the method of any of claims 1-7.