CN113362892A - Method for detecting and typing repetition number of short tandem repeat sequence - Google Patents

Method for detecting and typing repetition number of short tandem repeat sequence Download PDF

Info

Publication number
CN113362892A
CN113362892A CN202110669050.9A CN202110669050A CN113362892A CN 113362892 A CN113362892 A CN 113362892A CN 202110669050 A CN202110669050 A CN 202110669050A CN 113362892 A CN113362892 A CN 113362892A
Authority
CN
China
Prior art keywords
str
typing
sequence
score
repeat
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110669050.9A
Other languages
Chinese (zh)
Other versions
CN113362892B (en
Inventor
李梦
郭茂平
胡欢
陈初光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yuewei Gene Technology Co ltd
Original Assignee
Beijing Yuewei Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yuewei Gene Technology Co ltd filed Critical Beijing Yuewei Gene Technology Co ltd
Priority to CN202111224049.1A priority Critical patent/CN113724783B/en
Priority to CN202110669050.9A priority patent/CN113362892B/en
Publication of CN113362892A publication Critical patent/CN113362892A/en
Application granted granted Critical
Publication of CN113362892B publication Critical patent/CN113362892B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the field of sequencing data generation analysis, and particularly provides a short tandem repeat number detection and typing method based on next generation sequencing. The detection method can cross the middle repetitive region, simultaneously align two flanking sequences and simultaneously consider the mismatching, the insertion and the deletion of the flanking sequences; the typing method is an STR typing method which is based on effective peak detection and combines the difference of the number of repeats to set a dynamic threshold value, and can be applied to the detection of the number of repeats of different next generation sequencing platforms.

Description

Method for detecting and typing repetition number of short tandem repeat sequence
Technical Field
The invention belongs to the field of letter generation analysis, and particularly relates to a method for detecting and typing a repeat number of a short tandem repeat sequence.
Background
Short Tandem Repeat (STR), also called microsatellite DNA, is a DNA sequence synthesized by taking 2-6 bases as a core in tandem, and the STR has the characteristics of high mutation rate, polymorphism, easiness in detection and the like, so that the short tandem repeat is widely applied to detection related to forensics. Since 1985, STR detection has been applied to the forensic field, usually by capillary electrophoresis coupled with fluorescent labeling. Specific primer design is carried out aiming at different STR loci, amplification products with different lengths and different fluorescent labels are obtained through amplification, and the different STR loci are distinguished through capillary electrophoresis.
However, this method based on capillary electrophoresis detection has limited throughput and cannot effectively distinguish fragments of more than 1,000bp in length. In addition, the color of the fluorescence channel used is generally not more than 6, which would otherwise cause leakage problems of fluorescence of different wavelengths. Therefore, the number of STRs detected based on capillary electrophoresis generally does not exceed 60. With the development of high throughput sequencing, more and more research is beginning to be directed to the use of high throughput sequencing to detect and identify molecular markers such as STRs. The second-generation sequencing has the advantages of large flux, multiple detection sites and the like, is very suitable for identity identification and genetic inference by using STR detection, and is also more and more widely applied to the field of forensic science. For example, the kingdom killer, which was recently earned in the united states, is to identify a criminal suspect by applying genetic inference based on second-generation sequencing to find the relative of the criminal suspect.
The STR is an important object for forensic detection, on one hand, the polymorphism of the STR enables the contained information to be richer, and the information content of one STR site is about 6 SNP sites on average; on the other hand, the judicial community has accumulated STR information for tens of millions of individuals over the past few decades, and many of them are criminal suspects. The detection of STR sites using second generation sequencing has many advantages over capillary electrophoresis. The second generation sequencing can obtain the information of the length of the STR repetitive region and the information of the DNA sequence of the repetitive region, and the two kinds of information are added together to further strengthen the function of identifying the STR individuals. Research has shown that there are types with the same length and different sequences in the population, and STR detection based on capillary electrophoresis cannot distinguish different types caused by the sequence difference; the amplicon fragment required by the second-generation sequencing is smaller than that of the first-generation sequencing, and is more suitable for STR detection of some degraded DNA; the second-generation sequencing detection typing does not depend on bin values, so that new and rare typing can be detected, and researches show that the STR repetitive region of some samples can be divided into the same bin by capillary electrophoresis when insertion or deletion of a few bp occurs, so that STR typing errors are brought; by adding different sequence labels to different samples, the second-generation sequencing can detect hundreds or even thousands of samples at the same time, and the detection flux is large; the second generation sequencing technology can fuse target molecule markers such as STR, SNP, mitochondrial DNA and the like into one panel, and can obtain a large amount of useful information in one panel. Tests that have begun to use second generation sequencing instead of capillary electrophoresis to detect STRs in forensic systems in some states and cities in the united states are currently expected to begin to be practiced in some states and cities in recent years.
Although China also conducts active research and investigation in the related field, and some companies have also introduced their own panel products, the existing methods for detecting STR based on second-generation sequencing still have shortcomings. The currently mainstream sequencer is the sequencer of the U.S. illumina company, and the sequencing length of the sequencer does not exceed 250bp generally, which provides a challenge for accurate typing of the long repetitive sequence STR. On the other hand, the second generation sequencing detection method for STR relies on the matching of flanking sequences on both sides, and if the flanking sequences are mismatched, inserted or deleted, the STR region cannot be accurately located, which affects the detection of the number of repeats. In addition, the current confidence-generation analysis method for mainstream STR typing is to check whether the number of reads of the second typing is 40% or more of that of the first typing, if so, the site is considered to be heterozygous, and two different types are provided (namely, the STR site has two repeat numbers). Since the data size of the second-generation sequencing is deviated along with the length of the fragment, and generally, the longer the fragment is, the smaller the data size is, when detecting the heterozygous STR locus, if the difference between the two typing repetition numbers is large, the number of reads on the two allels may be much different (sometimes, the ratio of the two is over 1: 10), and the STR typing with the large repetition number is missed according to the 40% threshold.
Therefore, the invention is provided in view of the problem that the current next generation sequencing still has a plurality of inaccurate typing for STR typing detection.
Disclosure of Invention
The invention aims to solve the problems in the prior art, and at least the following technical innovations are made in order to overcome the problems:
1. aiming at the possible mismatches, insertions and deletions of flanking sequences, the invention provides an improved Needleman-Wunsch algorithm which can realize the simultaneous alignment of two flanking sequences within O (MN) time, simultaneously considers the mismatches, insertions and deletions and crosses the middle repeated region.
2. Aiming at the problem of inaccurate STR typing of second-generation sequencing detection, the invention provides a novel STR typing method for setting a dynamic threshold value based on the detection of an effective peak value (the time complexity is O (T ^2)) and by combining the difference of the repetition number.
Specifically, the invention provides the following technical scheme:
the invention firstly provides a searching and comparing method of STR flanking sequences based on next generation sequencing, which comprises the following steps:
step 1) raw data processing step: comparing the fastq file with an STR reference sequence library;
step 2) construction of comparison algorithm:
a. knowing the left flanking sequence flank _ left, the right flanking sequence flank _ right and the sequence to be analyzed S1, splicing the flank _ left and the flank _ right into a sequence S2 in sequence;
b. initializing a score matrix: constructing a score matrix by taking S1 as a column and S2 as a row, and setting the penalty scores of the 1 st column as 0; setting punishment scores of two positions at the splicing position of the left flanking sequence and the right flanking sequence in the 1 st row S2 as-10 respectively, and setting punishment scores of the rest positions as-1;
c. filling a score matrix: setting a matching score match _ score equal to 1, a mismatch score mismatch _ score equal to-1, and a gap score gap _ score equal to-2; for each cell (i, j) in the matrix, where i represents the index of the matrix row and j represents the index of the matrix column, the score is calculated as:
if j ≠ x and j ≠ y: F _ ij ═ max (F _ (i-1, j-1) + S (a _ i, B _ j), F _ (i, j-1) + d, F _ (i-1, j) + d);
if j ═ x, F _ ij ═ max (F _ (i-1, j-1) +2S (a _ i, B _ j), F _ (i, j-1) +0, F _ (i-1, j) +2 xd);
if j is y, F ij max (F (i-1, j-1) +2S (a _ i, B _ j), F (i, j-1) +2 xd, F (i-1, j) + 0);
where x is the length of the sweep _ left, y is x +1, d is gap _ score, S (a _ i, B _ j) is match _ score or mismatch _ score;
step 3) STR site repeat region anchoring: reading each read aligned in the step 1), and dividing into two cases according to whether the STR reference sequence can be aligned: if the STR locus is compared, searching an STR repetitive region by using the flanking sequence of the STR locus and the comparison algorithm; if any STR locus is not compared, traversing the flanking sequences of all the STR loci, and searching an STR repeated region by using the comparison algorithm; if the result of the flanking sequence comparison does not exceed 2 mismatches, insertions and deletions in total, outputting STR loci and repetitive region sequences on the comparison, otherwise, discarding the read; finally, the sequence length of the repeat region anchored by each read is obtained.
Further, the method further comprises:
step 4): and calculating the number of the repeat of the STR locus on each read alignment according to the sequence length of the repeat region anchored by each read.
Further, the step 2) comprises:
d. acquiring a repetitive region sequence: selecting the first maximum score S of the last column in the matrix from bottom to top as the comparison score of the flanking sequences, and if the length of the sequence S2 is n, if n-S is less than or equal to 4, comparing S1 with S2 under the condition of allowing 2 mismatches, insertions or deletions; and (5) tracing back to the left by taking the coordinate of S as a source to obtain coordinates i1 and i2 of a time line when j is x and j is y, and taking a sequence in the middle of positions i1 and i2 on S1 as an STR locus repetitive region sequence.
The invention also provides an STR parting method for setting a dynamic threshold value based on detecting an effective peak value and combining the difference of the number of repetitions, which is characterized in that:
1) counting the number of Reads (RC) detected by each repeat number on each STR locus of the sample based on the repeat number of the STR loci compared by a single read of the second-generation sequencing;
2) selecting a first type: selecting typing according to the peak value position, regarding each STR locus, taking the maximum repeat number A of RC as a peak, judging the repeat number as a first typing, otherwise, judging that the STR locus does not detect any typing;
3) traversing and selecting a second subtype: after one peak is found, the RC value of the peak is assigned to 0, and the RC of the first parting is assigned to 0, namely RC (A) is assigned to 0; assuming that RC of the repetition number A, B is RC (a) and RC (B), respectively, the repetition number B is determined to be the peak if the following 3 conditions are satisfied:
rc (B)/rc (a) ≧ gradient threshold T of to-be-detected typing B, which is initial gradient +0.25 × (left side gradient or right side gradient) × (a-B);
b.RC(B)≥3;
RC (B) higher than RC of adjacent repeat number;
4) outputting a typing result, wherein if a first typing and a second typing which meet the requirements are found at the same time, the typing result is (A, B); if the first classification meeting the requirement is found but the second classification meeting the requirement is not found, the classification result is (A); if the first and second typing satisfying the requirement are not found, no typing is detected.
Further, in the step 2), RC of the maximum repetition number A is more than or equal to 20.
Further, the gradient threshold T of the to-be-detected type B is calculated as follows:
t is 0.28+0.25 × step x (A-B), wherein,
if a-B >0, step ═ 0.19;
if a-B <0, step ═ 0.33;
preferably, T is 0.02.
Further, the number of the STR loci repeated in the single read alignment in step 1) based on the second generation sequencing can be specifically determined by any one of the methods described above.
The invention also provides an STR typing system for performing dynamic threshold setting based on the detection of the effective peak value and the combination of the difference of the number of repetitions, and the system comprises the following modules for executing the steps of any one of the methods.
The invention also provides a computer-readable medium, in which a computer program is stored, which, when executed by a processor, implements any of the methods described above.
The invention also provides an electronic device, which comprises a processor and a memory, wherein one or more readable instructions are stored on the memory, and when the one or more readable instructions are executed by the processor, any one of the methods is realized.
Compared with the prior art, the invention has at least the following advantages:
1) the invention has the advantages of high accuracy, high speed and the like. The temporal complexity of the search for flanking sequences of the present invention is O (MN), while the temporal complexity of the search for typing is O (T)2) Where M is the sum of the lengths of the two flanking sequences, N is the length of read, and T is the number of total typing detected for a particular STR locus. When 1-2 mismatches, insertions or deletions occur in the flanking sequence, the improved Needleman-Wunsch algorithm can still detect the STR repetitive region according to the flanking sequence, thereby effectively avoiding the situation that the read containing the mismatched, inserted or deleted flanking sequence is missed by the conventional detection method, having higher data utilization rate and more accurate detection result.
2) The invention provides an algorithm for detecting peaks, wherein different reads quantity threshold values are set for a second section of an STR locus, the threshold values are reduced along with the increase of the number of the second section and increased along with the reduction of the number of the second section. By setting a dynamic threshold, the influence of the sequencing length on the typing detection can be effectively corrected, and the probability of detecting two sites of the heterozygous STR at the same time is improved.
3) The method provided by the invention obviously improves the application of the second-generation sequencing technology to the typing detection of STR, and has the advantages of strong objectivity and high accuracy of detection results, and popularization and use values.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of the second generation sequencing detection STR of the present invention;
FIG. 2 is a schematic diagram of a method for identifying flanking sequences by the modified NW algorithm;
FIG. 3 details of the modified NW algorithm and an example of a scoring matrix;
FIG. 4 effect of conventional sequencing length on the number of reads for longer typing;
FIG. 5 is a schematic diagram of the algorithm for second generation sequencing by peak detection STR typing according to the present invention;
FIG. 6 is an adjustment of the read quantity threshold parameter of the present invention;
FIG. 7 parameter adjustment of peak and gradient thresholds according to the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The following terms or definitions are provided only to aid in understanding the present invention. These definitions should not be construed to have a scope less than understood by those skilled in the art.
Unless defined otherwise below, all technical and scientific terms used in the detailed description of the present invention are intended to have the same meaning as commonly understood by one of ordinary skill in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present invention.
As used herein, the terms "comprising," "including," "having," "containing," or "involving" are inclusive or open-ended and do not exclude additional unrecited elements or method steps. The term "consisting of …" is considered to be a preferred embodiment of the term "comprising". If in the following a certain group is defined to comprise at least a certain number of embodiments, this should also be understood as disclosing a group which preferably only consists of these embodiments.
Where an indefinite or definite article is used when referring to a singular noun e.g. "a" or "an", "the", this includes a plural of that noun.
The terms "about" and "substantially" in the present invention denote an interval of accuracy that can be understood by a person skilled in the art, which still guarantees the technical effect of the feature in question. The term generally denotes a deviation of ± 10%, preferably ± 5%, from the indicated value.
Furthermore, the terms first, second, third, (a), (b), (c), and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.
The general idea of the method comprises the following two aspects:
the invention firstly provides a searching and comparing method for STR flanking sequences. This method can consider mismatch, insertion and deletion, and search the algorithm of STR site two flanking sequences at the same time, including the following steps:
1. raw data processing
1) And (3) data collection, storing an original file in a BCL format output by the second-generation sequencer in a target path, splitting the BCL file into files in a fastq format by using BCL2fastq, and storing the files in a specified path.
2) And (3) comparing sequences, arranging all STR loci, extending 300bp upwards and downwards respectively to serve as reference sequence libraries, and comparing the fastq file with the reference sequence libraries by using BWA to obtain an output file in an SAM format.
2. Search for flanking sequences in the read:
3) an alignment algorithm is constructed (in the following 5 steps):
a) given the left flank sequence flank _ left, the right flank sequence flank _ right, and the read sequence to be analyzed S1, let x be the left flank sequence length (i.e., x be the length of the left flank sequence), let y be x +1, and splice flank _ left, flank _ right into the sequence S2 in order. The x and y values obtained here represent the positions at which the flanking sequences are joined.
b) Initializing a score matrix: constructing a score matrix by taking S1 as a column and S2 as a row, and setting the penalty scores of the 1 st column to be 0 (namely, the values of the first column of the matrix are 0); the penalty score is set to-10 for the flanking sequence junction in row 1 (i.e., the x and y positions) and to-1 for the remaining positions (i.e., -1 for the values in the first row of the matrix, except for the flanking sequence junction at-10).
c) Filling a score matrix: assuming that the match score match _ score is 1, the mismatch score mismatch _ score is-1, and the gap score gap _ score is-2; for each cell (i, j) in the matrix, where i represents the index of the matrix row and j represents the index of the matrix column, the score is calculated as:
if j is not equal to x and j is not equal to y, Fij=max(Fi-1,j-1+S(Ai,Bj),Fi,j-1+d,Fi-1,j+d);
If j is x, Fij=max(Fi-1,j-1+2S(Ai,Bj),Fi,j-1+0,Fi-1,j+2×d);
If j is y, Fij=max(Fi-1,j-1+2S(Ai,Bj),Fi,j-1+2×d,Fi-1,j+0);
Wherein d is gap _ score, S (a)i,Bj) Match _ score or mismatch _ score.
d) Acquiring a repetitive region sequence: selecting the first maximum score S of the last column from bottom to top as an alignment score, assuming that the length of the sequence S2 is n, if n-S is less than or equal to 4, under the condition of allowing 2 mismatches, insertions or deletions, the sequence S1 and the sequence S2 can be aligned; and (3) going back to the left by taking the coordinate of S as a source, obtaining coordinates i1 and i2 of a time line when j is x and j is y, and finally, taking a sequence in the middle of positions i1 and i2 on S1 as an STR locus repetitive region sequence.
4) STR site repeat region anchoring: reading each read of the SAM file, and classifying into two cases according to whether it can compare the upper STR reference sequence: 1. if the STR locus is compared, finding out an STR repetitive region by using the flanking sequence of the STR locus and the algorithm; 2. if any STR locus is not compared, traversing the flanking sequences of all the STR loci, and finding an STR repeated region by using the algorithm; if the comparison result does not exceed 2 mismatches, insertions or deletions in total, outputting STR loci and repetitive region sequences on the comparison, otherwise, discarding the read; finally, the sequence length of the repeat region anchored by each read is obtained.
5) And calculating the number of the repeat of the STR locus on each read alignment according to the sequence length of the repeat region anchored by each read.
Secondly, further aiming at all reads and the number of repetitions of one STR locus, the invention also relates to a novel method for STR typing based on detecting effective peaks and setting a dynamic threshold value by combining the difference of the number of repetitions, and specifically comprises the following steps:
after the search ratios of flanking sequences are concluded, the following is a new method of typing based on peak detection and dynamic threshold:
6) and counting the number of Reads (RC) detected at each repeat number of each STR locus of the sample based on the comparison of the repeat numbers of the STR loci on the single read of the second-generation sequencing.
7) Selecting a first type: and selecting typing according to the peak position, wherein for each STR locus, the maximum repetition number A of RC is a peak, and if the value of RC (A) is too low, the locus is considered to have no typing detected.
8) Traversing and selecting a second subtype: in order to avoid the interference of the peak which is found to the subsequent peak finding, the RC value of the peak is assigned to be 0 after one peak is found, namely RC (A) is assigned to be 0; and if the following two conditions are met, judging the repetition number B as a peak:
a. a gradient threshold T of the to-be-detected type B is calculated, T ═ initial gradient +0.25 × (left or right gradient) × (a-B). Where A represents the first type detected, B represents the type being detected, and the ACR value represents the ratio between the RC of the B type and the A type. RC of the repetition number B is higher than the gradient threshold T of the typing.
b. The RC of the repetition number B needs to be higher than the RC of the adjacent repetition number.
9) Outputting a typing result, and if a first typing and a second typing which meet the requirements are simultaneously found A, B, the typing result is (A, B); if a first subtype A meeting the requirement is found but a second subtype meeting the requirement is not found, the subtype result is (A); if the first and second typing satisfying the requirement are not found, no typing is detected.
The following are specific embodiments.
Example 1 inventive Process architecture construction
The flow of the overall analysis of the second-generation sequencing STR is shown in figure 1, and in the flow, firstly, the BCL file of the offline data is split into fastq, and then, the reads are compared to obtain the STR locus to which each read belongs. And (3) for all reads aligned to a certain locus, matching the flanking sequences, and if the flanking sequences are matched, calculating the number of repeats contained in the repeated region by using a repeat number calculation formula. Then counting the number of reads of each type in each locus, and finally obtaining correct type by using a type division algorithm. The key step is the detection of the repeat region and the determination of the type, as follows.
1. STR flanking sequence searching/comparing method construction
When using flanking sequence matched reads, the flanking sequences may contain mismatches, insertions and deletions, for example 2 SNPs around D13S 317. To take these factors into account, the present invention improves the Needleman-Wunsch algorithm (NW algorithm) to be able to search for both left and right flanking sequences simultaneously and to be able to cross the middle repeat region of unknown repeat number. FIG. 2 shows the manner in which the flanking sequences are identified, with the top of the figure being the structure of the read and where the flanking sequences are ideally located, the middle being the point of improvement, and the bottom being an example of an alignment of the two flanking sequences with the read.
Specifically, the improved Needleman-Wunsch alignment algorithm is as follows:
given the left flank sequence flank _ left, the right flank sequence flank _ right, and the read sequence to be analyzed S1, let x be the left flank sequence length (i.e., x be the length of the left flank sequence), let y be x +1, and splice flank _ left, flank _ right into the sequence S2 in order. The x and y values obtained here represent the positions at which the flanking sequences are joined.
The modified Needleman-Wunsch alignment algorithm is demonstrated by an example, assuming that flank _ left is TCA, flank _ right is TC, and S1 is TCATTTC, then S2 is TCATC, x is 3, and y is 4 (see fig. 3 for details of the improved NW algorithm and scoring matrices, with the case scoring matrices on the left and the case scoring matrices on the right in fig. 3 for specific improvements to the NW algorithm).
Initializing a score matrix: the penalty score of column 1 is 0; the punishment of the connection position of the 1 st row of the flanking sequence is-10, and the punishment of the rest positions is-1;
filling a score matrix: assuming that the match score match _ score is 1, the mismatch score mismatch _ score is-1, and the gap score gap _ score is-2; for each cell (i, j), the score calculation formula is:
if j! X and j! F as yij=max(Fi-1,j-1+S(Ai,Bj),Fi,j-1+d,Fi-1,j+d)
If j is x, Fij=max(Fi-1,j-1+2S(Ai,Bj),Fi,j-1+0,Fi-1,j+2×d)
If j is y, Fij=max(Fi-1,j-1+2S(Ai,Bj),Fi,j-1+2×d,Fi-1,j+0)
Wherein d is gap _ score, S (a)i,Bj) Match _ score or mismatch _ score. The specific populated score matrix is shown in fig. 3.
Acquiring a repetitive region sequence: selecting the first maximum score S of the last column from bottom to top as an alignment score, assuming that the length n of the sequence S2 is equal to length (S2), if n-S is equal to or less than 4, S1 and S2 can be aligned if 2 mismatches, insertions or deletions are allowed; and (3) going back to the left by taking the coordinate of S as a source, obtaining the coordinates i1 and i2 of i when j is x and j is y, and finally taking a sequence (TT) between i1 and i2 on S1 as an STR locus repetitive region sequence.
STR site repeat region anchoring: reading each read of the SAM file, and classifying into two cases according to whether it can compare the upper STR reference sequence: 1. if the STR locus is compared, finding out an STR repetitive region by using the flanking sequence of the STR locus and the algorithm; 2. if any STR locus is not compared, traversing the flanking sequences of all the STR loci, and finding an STR repeated region by using the algorithm; if the comparison result does not exceed 2 mismatches, insertions or deletions in total, outputting STR loci and repetitive region sequences on the comparison, otherwise, discarding the read; finally, the sequence length of the repeat region anchored by each read is obtained.
And calculating the number of the repeat of the STR locus on each read alignment according to the sequence length of the repeat region anchored by each read.
2. STR parting method construction
After the length of the repeat region of each read in the STR locus is detected, the next step is to find the correct typing. The traditional approach is to look at the percentage of the number of reads that the second highest type accounts for the number of reads of the first highest type (ACR value), and if the ACR value is higher than 40% or 50%, the second highest type is generally considered a correct type. However, in capillary electrophoresis and next-generation sequencing, the longer the repeat region, the lower the read number or fluorescence intensity of the typing. Furthermore, this trend varies with the length of the read sequencing, as shown in FIG. 4. FIG. 4 is a graph of the effect of sequencing read length on the number of reads for longer typing where the ratio of sequencing read length to the number of reads between two typing (ACR value) has been plotted. For two allels at the same STR locus, the shorter the sequencing read length, the greater the difference in the number of reads.
To this end, the present invention has developed an algorithm for peak-based detection typing, which is shown in fig. 5, fig. 5 is a schematic diagram of an algorithm for second generation sequencing STR typing by peak detection, wherein the second row of the graph emphasizes the effect of removing the previously found peaks when two typing neighbors. Specifically, the highest read score must be the peak, and the next step is to assign the read number of the highest peak to 0, in order to avoid the situation where the second highest correct score is adjacent to the highest score. And searching the peak position again, and circulating in sequence.
However, due to sequencing and PCR errors, there will usually be reads at incorrect typing positions, so the present invention adds the number of reads as a threshold, except for a peak for a certain typing, and the number of reads must exceed the threshold. In order to take into account the fact that the larger the difference in the number of reads between the types, the larger the length, the gradient threshold is used, i.e. the threshold varies with the variation in the difference in the number of repetitions of the type to be detected and the type of the highest number of reads, the lower the threshold if the number of repetitions of the type to be detected is larger than the number of repetitions of the highest type, and the higher the threshold otherwise.
In the second generation sequencing, because the imbalance between PCR and sequencing often results in an insufficient number of reads for a certain sample or a certain locus, the position is excluded when the number of reads is less than a certain threshold. The present invention uses actual samples to find a more ideal threshold of 20. As shown in fig. 6, the variation in the inconsistency rate is large when the threshold value of the number of reads is less than 20, and is small when the threshold value of the number of reads is greater than 20. Wherein the number of detected types is gradually reduced with the increase of the read number threshold.
In addition to the threshold for the number of reads, additional key parameters are the left threshold gradient, the right threshold gradient, and the initial gradient. Assuming that the type with the highest number of reads is A, the threshold value of a certain type B is: if B > a, the threshold is the initial threshold + (a-B) multiplied by the right threshold gradient; if B < A, the threshold is the initial threshold + (A-B) multiplied by the left threshold gradient. The present invention determines the optimal values of the three parameters using dozens of real samples and a triple loop traversal, and uses a topographic map to see if the resulting three thresholds are indeed optimal, with the results shown in fig. 7. The positions indicated by arrows in the left and right graphs in fig. 7 are the optimal values of the left and right gradient thresholds. The left and right gradient thresholds in the graph are both multiplied by 4 to make the three values more similar in magnitude. It can be seen that the optimal value of the initial gradient obtained therein is 0.28, the optimal value of the left-side gradient is 0.19, the optimal value of the right-side gradient is 0.33, and the position of the arrow in the figure is the position of the value.
After obtaining the parameters, a specific improved typing method is described as follows, and the pseudo code based on R is shown in Table 1:
selecting a first type: and selecting typing according to the peak value position, wherein for each STR locus, the maximum repeat number A of RC is a peak, if the RC of the repeat number is more than or equal to 20, the repeat number is judged as the first typing, otherwise, the STR locus is judged not to detect any typing.
Traversing and selecting a second subtype: in order to avoid the interference of the found peak to the subsequent peak finding, the RC value of the peak is assigned to 0 after one peak is found, and the RC of the first typing is assigned to 0, namely RC (A) is assigned to 0; assuming that RC of the repetition number A, B is RC (a) and RC (B), respectively, if the following two conditions are satisfied, the repetition number B is determined to be a peak:
a. a gradient threshold T of the to-be-detected type B is calculated, T ═ initial gradient +0.25 × (left or right gradient) × (a-B). A specific formula obtained by optimizing the parameters is T ═ 0.28+0.25 × step × (a-B), where if a-B >0, step ═ 0.19 (left side gradient); if a-B <0, step is 0.33 (right gradient). As known, ACR ≧ RC (B)/RC (A) is required to satisfy ACR ≧ T, ACR ≧ 0.02 and RC (B) ≧ 3; where A represents the first type detected, B represents the type being detected, and the ACR value represents the ratio between the RC of the B type and the A type.
b. The RC of the repetition number B needs to be higher than the RC of the adjacent repetition number. Let the integer repetition numbers adjacent to each other on the left and right of the repetition number B be B _ left and B _ right, respectively, and RC corresponding to the adjacent repetition numbers is RC (left), RC (right), where 0< B-B _ left ≦ 1,0< B _ right-B ≦ 1, RC (left) | | 0, and RC (right) | | | 0. It is necessary to satisfy both RC (B) > RC (left) and RC (B) > RC (right).
If the second fractal is found, the output fractal is (A, B); if only one type is found, the final type is A; if no typing is found, no typing is found.
Table 1: the pseudo code of the specific algorithm for detecting STR typing by next generation sequencing is based on R language.
Figure BDA0003117270410000111
Figure BDA0003117270410000121
Example 2 sample detection test
After the algorithm is constructed and the optimal parameter system is obtained, the method provided by the invention is tested by respectively using the standard sample and the real sample. Specifically, the test was performed using standard positive samples 9948 and 9947 cell lines, and 70 sets of authentic samples. The positive control for the real sample test is the result of Capillary Electrophoresis (CE). The specific steps comprise firstly establishing a second generation library of STR loci and high-throughput sequencing; splitting data after off-line, and comparing sequences; finally, the method provided by the invention is applied to detect the number of repetition and type the STR, and the old method (directly type by ACR value) is used as a contrast.
Test results for the standard samples:
STR typing was tested against 66 sites of the 9948 cell line using the new method with 100% agreement between NGS and CE and 98% agreement using the old method. The agreement rate of NGS and CE was 99.9% for 37 sites of the 9947 cell line, and 99.1% using the old method.
Detection results for real samples:
the consensus rate of 66 sites of 73 real samples by using a new flanking sequence searching method and a new detecting STR typing method is 99.6%, wherein a small number of inconsistent parts have reasons of NGS and a small number of inconsistent parts have reasons of CE. While the agreement rate using the old method was 93.8%. Specific results are shown in tables 2 and 3.
Table 2: differentiation of results obtained for real samples at partial STR sites using different methods. The results for the agreement between NGS and CE are shown in grey boxes.
Figure BDA0003117270410000131
Table 3: and (3) obtaining the typing consistency rate of the NGS and the CE in real samples by using different methods. The format of (a/b) is:
a is a consistent sample-bit pair and b is a total sample-bit pair.
Figure BDA0003117270410000132
Therefore, the method provided by the invention has strong advantages compared with the traditional method. The analysis consistency rate is obviously improved no matter whether the single new repeat number detection method or the typing of the new STR is adopted independently or the two methods are combined. In particular, the agreement rate increased from 93.8% to 99.6% after the combination of the two methods of the present invention. Also, significant improvements in effect were seen at some STR sites, such as D13S 317.
The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims (10)

1. A searching and comparing method for STR flanking sequences based on second-generation sequencing is characterized by comprising the following steps:
step 1) raw data processing step: comparing the fastq file with an STR reference sequence library;
step 2) construction of comparison algorithm:
a. knowing the left flanking sequence flank _ left, the right flanking sequence flank _ right and the sequence to be analyzed S1, splicing the flank _ left and the flank _ right into a sequence S2 in sequence;
b. initializing a score matrix: constructing a score matrix by taking S1 as a column and S2 as a row, and setting the penalty scores of the 1 st column as 0; setting punishment scores of two positions at the splicing position of the left flanking sequence and the right flanking sequence in the 1 st row S2 as-10 respectively, and setting punishment scores of the rest positions as-1;
c. filling a score matrix: setting a matching score match _ score equal to 1, a mismatch score mismatch _ score equal to-1, and a gap score gap _ score equal to-2; for each cell (i, j) in the matrix, where i represents the index of the matrix row and j represents the index of the matrix column, the score is calculated as:
if j ≠ x and j ≠ y: F _ ij ═ max (F _ (i-1, j-1) + S (a _ i, B _ j), F _ (i, j-1) + d, F _ (i-1, j) + d);
if j ═ x, F _ ij ═ max (F _ (i-1, j-1) +2S (a _ i, B _ j), F _ (i, j-1) +0, F _ (i-1, j) +2 xd);
if j is y, F ij max (F (i-1, j-1) +2S (a _ i, B _ j), F (i, j-1) +2 xd, F (i-1, j) + 0);
where x is the length of the sweep _ left, y is x +1, d is gap _ score, S (a _ i, B _ j) is match _ score or mismatch _ score;
step 3) STR site repeat region anchoring: reading each read aligned in the step 1), and dividing into two cases according to whether the STR reference sequence can be aligned: if the STR locus is compared, searching an STR repetitive region by using the flanking sequence of the STR locus and the comparison algorithm; if any STR locus is not compared, traversing the flanking sequences of all the STR loci, and searching an STR repeated region by using the comparison algorithm; if the result of the flanking sequence comparison does not exceed 2 mismatches, insertions and deletions in total, outputting STR loci and repetitive region sequences on the comparison, otherwise, discarding the read; finally, the sequence length of the repeat region anchored by each read is obtained.
2. The method for searching and aligning STR flanking sequences based on secondary sequencing of claim 1, wherein the method further comprises:
step 4): and calculating the number of the repeat of the STR locus on each read alignment according to the sequence length of the repeat region anchored by each read.
3. The method for searching and aligning STR flanking sequences based on secondary sequencing of any one of claims 1-2, wherein the step 2) further comprises:
d. acquiring a repetitive region sequence: selecting the first maximum score S of the last column in the matrix from bottom to top as the comparison score of the flanking sequences, and if the length of the sequence S2 is n, if n-S is less than or equal to 4, comparing S1 with S2 under the condition of allowing 2 mismatches, insertions or deletions; and (5) tracing back to the left by taking the coordinate of S as a source to obtain coordinates i1 and i2 of a time line when j is x and j is y, and taking a sequence in the middle of positions i1 and i2 on S1 as an STR locus repetitive region sequence.
4. An STR typing method for setting a dynamic threshold value based on detection of an effective peak value and combination of a difference of a repetition number is characterized in that:
1) counting the number of Reads (RC) detected by each repeat number on each STR locus of the sample based on the repeat number of the STR loci compared by a single read of the second-generation sequencing;
2) selecting a first type: selecting typing according to the peak value position, regarding each STR locus, taking the maximum repeat number A of RC as a peak, judging the repeat number as a first typing, otherwise, judging that the STR locus does not detect any typing;
3) traversing and selecting a second subtype: after one peak is found, the RC value of the peak is assigned to 0, and the RC of the first parting is assigned to 0, namely RC (A) is assigned to 0; assuming that RC of the repetition number A, B is RC (a) and RC (B), respectively, the repetition number B is determined to be the peak if the following 3 conditions are satisfied:
rc (B)/rc (a) ≧ gradient threshold T of to-be-detected typing B, which is initial gradient +0.25 × (left side gradient or right side gradient) × (a-B);
b.RC(B)≥3;
RC (B) higher than RC of adjacent repeat number;
4) outputting a typing result, wherein if a first typing and a second typing which meet the requirements are found at the same time, the typing result is (A, B); if the first classification meeting the requirement is found but the second classification meeting the requirement is not found, the classification result is (A); if the first and second typing satisfying the requirement are not found, no typing is detected.
5. The STR typing method based on detection of valid peaks and dynamic thresholding in combination with the difference in the number of repeats of claim 4 wherein in step 2) RC for the maximum number of repeats A is 20 or more.
6. The STR typing method according to any one of claims 4 to 5, wherein in step 3), the gradient threshold T of the type B to be detected is calculated as follows:
t is 0.28+0.25 × step x (A-B), wherein,
if a-B >0, step ═ 0.19;
if a-B <0, step ═ 0.33;
preferably, T is 0.02.
7. The STR typing method based on the detection of valid peaks and dynamic thresholding combined with difference in the number of repeats as claimed in any one of claims 4 to 6, wherein the number of repeats of STR loci in the single read alignment based on second generation sequencing in step 1) is determined by the method as claimed in any one of claims 1 to 5.
8. An STR typing system based on detecting valid peaks and dynamic thresholding in combination with difference in number of repeats, characterized in that the system comprises modules for performing the steps of the method according to any of claims 4 to 7.
9. A computer-readable medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.
10. An electronic device comprising a processor and a memory, the memory having stored thereon one or more readable instructions that, when executed by the processor, implement the method of any of claims 1-7.
CN202110669050.9A 2021-06-16 2021-06-16 Method for detecting and typing repetition number of short tandem repeat sequence Active CN113362892B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111224049.1A CN113724783B (en) 2021-06-16 2021-06-16 Method for detecting and typing repetition number of short tandem repeat sequence
CN202110669050.9A CN113362892B (en) 2021-06-16 2021-06-16 Method for detecting and typing repetition number of short tandem repeat sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110669050.9A CN113362892B (en) 2021-06-16 2021-06-16 Method for detecting and typing repetition number of short tandem repeat sequence

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202111224049.1A Division CN113724783B (en) 2021-06-16 2021-06-16 Method for detecting and typing repetition number of short tandem repeat sequence

Publications (2)

Publication Number Publication Date
CN113362892A true CN113362892A (en) 2021-09-07
CN113362892B CN113362892B (en) 2021-12-17

Family

ID=77534710

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202111224049.1A Active CN113724783B (en) 2021-06-16 2021-06-16 Method for detecting and typing repetition number of short tandem repeat sequence
CN202110669050.9A Active CN113362892B (en) 2021-06-16 2021-06-16 Method for detecting and typing repetition number of short tandem repeat sequence

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202111224049.1A Active CN113724783B (en) 2021-06-16 2021-06-16 Method for detecting and typing repetition number of short tandem repeat sequence

Country Status (1)

Country Link
CN (2) CN113724783B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113990392A (en) * 2021-10-28 2022-01-28 中南大学湘雅医院 Pathogenicity analysis method and device of short tandem repeat sequence and server
CN114530200A (en) * 2022-03-18 2022-05-24 北京阅微基因技术股份有限公司 Mixed sample identification method based on calculation of SNP entropy

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104272311A (en) * 2012-02-08 2015-01-07 陶氏益农公司 Data analysis of DNA sequences
US20180163265A1 (en) * 2014-12-19 2018-06-14 The Broad Institute Inc. Unbiased identification of double-strand breaks and genomic rearrangement by genome-wide insert capture sequencing
CN110570901A (en) * 2019-09-03 2019-12-13 北京市农林科学院 method and system for SSR typing based on sequencing data
CN110870020A (en) * 2017-10-16 2020-03-06 因美纳有限公司 Aberrant splicing detection using Convolutional Neural Network (CNNS)
CN111508561A (en) * 2019-07-04 2020-08-07 北京希望组生物科技有限公司 Method for detecting homologous sequence and tandem repeat sequence in homologous sequence, computer readable medium and application
CN111933215A (en) * 2020-06-08 2020-11-13 西安电子科技大学 Transcription factor binding site searching method, system, storage medium and terminal

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104152568B (en) * 2014-08-19 2016-03-16 东南大学 High-throughput STR sequence core repeat number detection method
CN107122625B (en) * 2016-02-24 2020-10-09 北京爱普益生物科技有限公司 Method for processing high-throughput sequencing information of human short segment tandem repeat sequence
WO2017193198A1 (en) * 2016-05-13 2017-11-16 Deep Genomics Incorporated Neural network architectures for scoring and visualizing biological sequence variations using molecular phenotype, and systems and methods therefor
CN110491441B (en) * 2019-05-06 2022-04-22 西安交通大学 Gene sequencing data simulation system and method for simulating crowd background information
CN110373458B (en) * 2019-06-27 2020-05-19 东莞博奥木华基因科技有限公司 Kit and analysis system for thalassemia detection
CN112669906B (en) * 2020-11-25 2021-09-28 深圳华大基因股份有限公司 Detection method, device, terminal device and computer-readable storage medium for measuring genome instability

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104272311A (en) * 2012-02-08 2015-01-07 陶氏益农公司 Data analysis of DNA sequences
US20180163265A1 (en) * 2014-12-19 2018-06-14 The Broad Institute Inc. Unbiased identification of double-strand breaks and genomic rearrangement by genome-wide insert capture sequencing
CN110870020A (en) * 2017-10-16 2020-03-06 因美纳有限公司 Aberrant splicing detection using Convolutional Neural Network (CNNS)
CN111508561A (en) * 2019-07-04 2020-08-07 北京希望组生物科技有限公司 Method for detecting homologous sequence and tandem repeat sequence in homologous sequence, computer readable medium and application
CN110570901A (en) * 2019-09-03 2019-12-13 北京市农林科学院 method and system for SSR typing based on sequencing data
CN111933215A (en) * 2020-06-08 2020-11-13 西安电子科技大学 Transcription factor binding site searching method, system, storage medium and terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
叶亚新 等: "分子生物学技术在环境微生物多相分类中的应用", 《苏州科技学院学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113990392A (en) * 2021-10-28 2022-01-28 中南大学湘雅医院 Pathogenicity analysis method and device of short tandem repeat sequence and server
CN114530200A (en) * 2022-03-18 2022-05-24 北京阅微基因技术股份有限公司 Mixed sample identification method based on calculation of SNP entropy

Also Published As

Publication number Publication date
CN113724783A (en) 2021-11-30
CN113724783B (en) 2022-04-12
CN113362892B (en) 2021-12-17

Similar Documents

Publication Publication Date Title
US6681186B1 (en) System and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms
CN113362892B (en) Method for detecting and typing repetition number of short tandem repeat sequence
CN108073791B (en) Method based on two generation sequencing datas detection target gene structure variation
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
WO2010039553A1 (en) Method and system for determining the accuracy of dna base identifications
KR20140006846A (en) Data analysis of dna sequences
CN112086131B (en) Screening method for false positive variation sites in resequencing database
CN110993023A (en) Detection method and detection device for complex mutation
CN110970091A (en) Label quality control method and device
CN106404878B (en) Protein secondary Mass Spectrometric Identification method based on multiple groups abundance messages
CN107563148B (en) Ion index-based integral protein identification method and system
CN1360057A (en) Method based on repetitive sequence recognition for splicing sequencing data of whole genome
CN109698011A (en) Indel regional correction method and system based on short sequence alignment
CN114530200B (en) Mixed sample identification method based on calculation of SNP entropy
CN110232951A (en) Judge method, computer-readable medium and the application of sequencing data saturation
CN114078568B (en) Metagenome sequencing data processing system and processing method based on IIB type restriction endonuclease characteristics
CN112420129A (en) Method and system for removing redundancy of optical spectrum auxiliary assembly result
CN110544510A (en) contig integration method based on adjacent algebraic model and quality grade evaluation
CN113764041B (en) Searching method and device for species gene identification tag and electronic equipment
US20050001160A1 (en) Determination of sample purity through mass spectroscopy analysis
CN115331736B (en) Splicing method for extending high-throughput sequencing genes based on text matching
KR100537636B1 (en) Apparatus for predicting transcription factor binding sites based on similar sequences and method thereof
Lysiak et al. SpecGlob: rapid and accurate alignment of mass spectra differing from their peptide models by several unknown modifications
CN111883212B (en) Construction method and construction device of DNA fingerprint spectrum and terminal equipment
KR102380935B1 (en) System and method for searching genomic regions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant