CN110491445B - UID sequencing, UID sequence design, UID duplicate removal quality value correction method and application - Google Patents

UID sequencing, UID sequence design, UID duplicate removal quality value correction method and application Download PDF

Info

Publication number
CN110491445B
CN110491445B CN201810450617.1A CN201810450617A CN110491445B CN 110491445 B CN110491445 B CN 110491445B CN 201810450617 A CN201810450617 A CN 201810450617A CN 110491445 B CN110491445 B CN 110491445B
Authority
CN
China
Prior art keywords
uid
sequence
sequences
sequencing
deduplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810450617.1A
Other languages
Chinese (zh)
Other versions
CN110491445A (en
Inventor
刘继龙
刘足
叶明芝
程少敏
谭美华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huada Medical Laboratory
Tianjin Medical Laboratory Bgi
Bgi Guangzhou Medical Laboratory Co ltd
BGI Shenzhen Co Ltd
Original Assignee
Shenzhen Huada Medical Laboratory
Tianjin Medical Laboratory Bgi
Bgi Guangzhou Medical Laboratory Co ltd
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huada Medical Laboratory, Tianjin Medical Laboratory Bgi, Bgi Guangzhou Medical Laboratory Co ltd, BGI Shenzhen Co Ltd filed Critical Shenzhen Huada Medical Laboratory
Priority to CN201810450617.1A priority Critical patent/CN110491445B/en
Publication of CN110491445A publication Critical patent/CN110491445A/en
Application granted granted Critical
Publication of CN110491445B publication Critical patent/CN110491445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The application discloses a UID sequencing, UID sequence design and UID duplicate removal quality value correction method and application. The method comprises the steps of UID sequence design and UID duplicate removal quality value correction; the UID sequence design comprises the steps of adding a longer UID for a sample to be tested in advance; counting the total number of sequences in the repeated sequence group subjected to conventional deduplication; removing the weight of UIDs, and counting the number of UID groups in the repeated sequence groups; fitting the total number of sequences and the corresponding UID group number; obtaining the expected UID group number from the fitting function according to the required data volume; when the simulated UID length n takes different values by using R language programming, UIDs are added to all expected UID groups, and the minimum n value, namely the optimal UID length, with 95% or more of different UID probabilities is ensured to be connected to all groups. The length of the UID can be dynamically designed, and the sequencing data quantity can be occupied less on the premise of meeting the randomness of the UID; the quality value is corrected, and the accuracy of UID sequencing can be embodied according to the improvement of the quality value; can be applied to a variation detection algorithm to support lower-frequency variation detection.

Description

UID sequencing, UID sequence design, UID duplicate removal quality value correction method and application
Technical Field
The application relates to the technical field of UID sequencing, in particular to a method for UID sequencing, UID sequence design and UID duplicate removal mass value correction and application.
Background
With the rise and maturation of liquid biopsy technology, the detection of low-frequency variation becomes a great challenge for high-throughput sequencing technology, and in order to improve the detection performance of low-frequency variation, various new experimental methods have been developed, wherein the most influencing and rapid development of the low-frequency variation belongs to the UID sequencing technology. UID refers to a unique identifier, i.e., an abbreviation for english unique identifier. The UID sequencing technology, namely adding a random sequence with a fixed length to each DNA fragment before PCR operation, is used as a unique identifier of each DNA fragment, and has the function of accurately identifying all PCR repeated fragments belonging to the same original DNA fragment through the comparison position information, UID sequence information, comparison orientation information and fragment length information of the DNA sequencing fragment and human reference genome hg19 after sequencing is completed, and can filter out sequencing errors and PCR errors while removing the duplication by matching with a UID specific duplication removal algorithm, so that a most accurate duplication removed fragment is left.
However, current UID sequencing techniques remain to be optimized and improved.
Disclosure of Invention
The invention aims to provide a novel UID sequencing method, a UID sequence design method and a UID duplicate removal quality value correction method adopted in the method, and applications of the methods.
The application adopts the following technical scheme:
one aspect of the application discloses a UID sequencing method, which comprises a UID sequence design step and a UID duplicate removal quality value correction step;
the UID sequence design step comprises the steps of adding an 8-20bp UID sequence into a DNA sample to be tested in advance; performing conventional deduplication on the sequencing result, and counting the total number of sequences contained in each conventional deduplication repeated sequence group; performing secondary deduplication on the conventional deduplication repeated sequence groups by adopting a UID deduplication algorithm, and counting the number of UID groups contained in each conventional deduplication repeated sequence group; fitting the total number of sequences in each conventional duplicate sequence group and the corresponding UID group number to obtain fitting functions of the sequences in each conventional duplicate sequence group; the pre-added UID sequence of 8-20bp is a longer UID sequence, and the purpose of the pre-added UID sequence is to ensure randomness of the UID sequence as much as possible, namely, the maximum probability that the UID sequences connected with each original DNA template are different, so that the pre-added UID sequence is a longer sequence of about 8-20 bp;
Obtaining the required expected UID group number according to the fitting function according to the total number of sequences in the repeated sequence group after conventional de-duplication required by the sequencing of the DNA sample to be tested;
in the application, the total number of sequences in a repeated sequence group after conventional de-duplication required by the sequencing of the DNA sample to be tested is the data quantity before de-duplication by the UID de-duplication algorithm; in UID sequencing, each DNA original template adds a unique identifier, so the expected UID set number, in fact the original template number in each conventional deduplication repeat set, is expected;
if the length of the UID sequence is n, the number of combinations of the UID sequence is 4 n Taking the expected UID group number as an ordinate, programming by using R language, and simulating and calculating n to obtain different lengths, wherein 4 is obtained n The possible UID sequences are randomly added into the original templates of the expected UID groups, the probability that all the original templates of the expected UID groups are connected with different UID sequences is ensured to be 95% or more, and the minimum n value, namely the optimal length of the UID sequences, is used for designing the UID sequences.
It can be understood that, the probability is 95% or more, the larger the value is, the larger the corresponding minimum n value is, the longer the UID sequence is, the better the uniqueness of the UID sequence added by each DNA fragment in the sample to be tested is, but the longer the UID sequence is, the larger the corresponding sequencing data waste is. Therefore, the probability value is usually 95%.
It should be noted that, according to the UID sequencing method, through the UID sequence design step, the most reasonable UID sequence length can be designed according to different sequencing objects or sequencing requirements, so that randomness and uniqueness of the UID sequence added by each DNA fragment in the object to be tested are guaranteed, sequencing data waste caused by UID sequence introduction is reduced to the greatest extent, and sequencing detection quality and efficiency are improved.
Preferably, the step of correcting the UID deduplication quality value comprises the steps of selecting a base with the occurrence ratio larger than or equal to a set threshold value at each position in a compression deduplication algorithm of the UID deduplication algorithm, and calculating the probability of sequencing errors of the base with the set threshold value of the position and the base with the ratio larger than the set threshold value by utilizing R language programming, wherein the probability is marked as P1;
setting that PCR errors occur in the jth round, wherein the proportion of the corresponding PCR errors is fj, the corresponding PCR error rate is p, comprehensively considering the condition that two reads of the first round of PCR have errors simultaneously or at least one PCR amplification has errors, and calculating all fj and corresponding p meeting the conditions by using R language programming, wherein the fj and the p are shown in table 1; preferably, fj satisfying the condition is fj in which the proportion of PCR errors is greater than or equal to the set threshold value of the present application,
TABLE 1
fj 0.625 0.6875 0.75 0.8125 0.8750 0.9375 1
p 1.788456e-06 9.504819e-09 8.911369e-07 4.752641e-09 2.384031e-09 1.266989e-11 1.115584e-07
Calculating the proportion of PCR errors in an actual DNA sample to be detected, namely an actual fj ', searching a fj value nearest to the actual fj' in a table 1 by utilizing a latest principle, and obtaining a corresponding PCR error rate from the table 1, wherein the PCR error rate is marked as P2;
the total probability of the sequencing error and the PCR error is comprehensively considered to be P after UID de-duplication,
P=P1×(1-P2)+(1-P1)×P2
and (3) converting the total probability P into a quality value Q= -10 x lg (P), and obtaining the quality value after UID de-duplication.
It should be noted that, the total probability P calculated in general is very small, so that the quality value Q obtained by conversion is larger than the original sequencing quality value, and the accuracy of UID sequencing is reflected by the improvement of the quality value. In the UID sequencing method, the UID de-duplication quality value correction step provides a quantifiable judgment parameter for UID sequencing and de-duplication effects, UID sequencing can be further optimized by improving the quality value Q, and lower-frequency mutation detection can be further supported by applying the UID de-duplication quality value correction step to a mutation detection algorithm. Therefore, the UID sequencing method can improve the accuracy of UID sequencing and the low-frequency variation detection capability through the UID duplicate removal mass value correction step.
Theoretically, for the same PCR amplification enzyme, excluding human operation factors, p corresponding to fj is the same, but p corresponding to fj is different for different PCR amplification enzymes; fj and the corresponding p shown in Table 1 of the present application are typical values of PCR amplification enzymes, and thus are generally applicable to general PCR amplification assays; of course, in order to obtain a more accurate PCR error rate, fj and the corresponding p satisfying all conditions may be recalculated according to the inventive concept of the present application, which is not specifically limited herein.
It should be further noted that, because UID is conventionally de-duplicated by compression, this action removes PCR errors and sequencing errors to some extent, so that the de-duplicated bases are more reliable, and on the other hand, the error rate is lower, i.e. the quality value is higher. Because of the mutation detection algorithm based on the traditional probability, the probability that the site is judged to be positive and the probability that the site is judged to be negative are needed to be calculated in most cases, the significance difference of the two is calculated, and negative and positive judgment is made by comparing with cutoff. However, if the UID deduplication quality value correction step of the present application is not performed, or an uncorrected quality value is used, the probability calculated based on this uncorrected quality value and the base distribution that has undergone compression correction are not matched, which affects the detection of variance; therefore, the UID sequencing method of the UID duplicate removal quality value correction step can improve the accuracy of UID sequencing and the low-frequency variation detection capability.
Preferably, the threshold is set to 60%.
It should be noted that, the UID deduplication algorithm generally adopts a compression deduplication algorithm, and in general, each position in the compression deduplication algorithm will leave the base with occurrence ratio > =60%, so the threshold is preferably set to 60% in one implementation of the present application. The probability of sequencing errors occurring in 60% or more of the bases of the corresponding P1, i.e., the corresponding site; and all fj meeting the conditions meet the condition that the proportion of the PCR errors is greater than or equal to 60 percent of fj.
Preferably, in the UID sequencing method of the present application, conventional deduplication specifically includes labeling repeated sequences according to alignment positions, alignment directions, fragment lengths by using picard software.
It should be noted that, conventional deduplication is performed by using picard software, because picard software does not consider UID sequence information and marks repeated sequences only according to information such as alignment positions, alignment directions, fragment lengths, and the like; thus, a set of picard tagged repeats may contain many sets of UID information, i.e., each conventional deduplication repeat set may contain multiple UIDs, and when using UID deduplication, one conventional deduplication repeat set may be divided into many smaller sets, which is also an advantage of UID sequencing, while deduplication retains more of the true original templates.
Preferably, in the UID sequencing method of the present application, the fitting function is y=0.0053x+1.3158
Wherein y is the number of UID groups contained in the repeated sequence group subjected to conventional deduplication, and x is the total number of sequences contained in the repeated sequence group subjected to conventional deduplication.
It should be noted that, the fitting function of different sequencing projects or different DNA samples to be tested may be slightly different, because indexes such as the adaptor connection efficiency, PCR efficiency, capture efficiency, etc. of the library building link of different sequencing projects may be different, which may affect the fitting function finally obtained; but in general, the fit functions of the different items are not widely different. As used herein, y=0.0053x+1.3158, the fitting function being constructed in one implementation of the present application; it can be understood that different sequencing projects or different DNA samples to be tested can be subjected to UID design by adopting the fitting function provided by the present application, and in order to obtain a more accurate design effect, a new fitting function can be obtained by re-fitting according to the inventive concept of the present application, which is not particularly limited herein.
The other side of the application discloses application of the UID sequencing method in UID sequence design or UID duplicate removal mass value correction.
It can be understood that in the UID sequencing method of the present application, the UID sequence design step and the UID deduplication quality value correction step can be performed separately, that is, separately for UID sequence design or UID deduplication quality value correction.
Therefore, another aspect of the application discloses a UID sequence design method, which comprises the steps of adding an 8-20bp UID sequence to a DNA sample to be tested in advance; performing conventional deduplication on the sequencing result, and counting the total number of sequences contained in each conventional deduplication repeated sequence group; performing secondary deduplication on the conventional deduplication repeated sequence groups by adopting a UID deduplication algorithm, and counting the number of UID groups contained in each conventional deduplication repeated sequence group; fitting the total number of sequences in each conventional duplicate sequence group and the corresponding UID group number to obtain fitting functions of the sequences in each conventional duplicate sequence group; obtaining the required expected UID group number according to the fitting function according to the total number of sequences in the repeated sequence group after conventional de-duplication required by the sequencing of the DNA sample to be tested; if the length of the UID sequence is n, the number of combinations of the UID sequence is 4 n Taking the expected UID group number as an ordinate, programming by using R language, and simulating and calculating n to obtain different lengths, wherein 4 is obtained n Randomly adding the possible UID sequences into the original templates of the expected UID groups, ensuring that the probability that the original templates of the expected UID groups are connected with different UID sequences is 95% or more, and designing the UID sequences according to the minimum n value, namely the optimal length of the UID sequences; preferably, conventional deduplication, in particular, comprises using picard software to determine the alignment position, alignment direction, fragment lengthRepeated sequences are marked.
It will be appreciated that the UID sequence design method of the present application is actually a UID sequence design step taken from the UID sequencing method of the present application; after obtaining the optimal UID sequence length, a specific UID sequence may be obtained by conventional random sequence generation software, which is not specifically limited herein.
The further aspect of the application discloses a correction method for the UID duplicate removal quality value, which comprises the steps that in a compression duplicate removal algorithm of the UID duplicate removal algorithm, each position selects a base with the occurrence ratio being greater than or equal to a set threshold value, the probability of sequencing errors of the base with the set threshold value of the position and the base with the ratio being greater than the set threshold value is calculated by utilizing R language programming, and the probability is marked as P1; setting that PCR errors occur in the jth round, wherein the proportion of the corresponding PCR errors is fj, the corresponding PCR error rate is p, comprehensively considering the condition that two reads of the first round of PCR have errors simultaneously or at least one PCR amplification has errors, and calculating all fj and corresponding p meeting the conditions by using R language programming, wherein the fj and the p are shown in table 1; preferably, all fj satisfying the condition is fj in which the proportion of all PCR errors is greater than or equal to the set threshold value of the present application,
TABLE 1
fj 0.625 0.6875 0.75 0.8125 0.8750 0.9375 1
p 1.788456e-06 9.504819e-09 8.911369e-07 4.752641e-09 2.384031e-09 1.266989e-11 1.115584e-07
Calculating the proportion of PCR errors in an actual DNA sample to be detected, namely an actual fj ', searching a fj value nearest to the actual fj' in a table 1 by utilizing a latest principle, and obtaining a corresponding PCR error rate from the table 1, wherein the PCR error rate is marked as P2; the total probability of the sequencing error and the PCR error is comprehensively considered to be P after UID de-duplication,
P=P1×(1-P2)+(1-P1)×P2
the total probability P is converted into a quality value Q= -10 xlg (P), namely the quality value after UID de-duplication is obtained;
preferably, the threshold is set to 60%.
Similarly, the UID duplicate removal quality value correction method is also obtained from the UID duplicate removal quality value correction step in the UID sequencing method.
It can be understood that the UID sequencing method, the UID sequence design method and the UID duplicate removal quality value correction method can all adopt special equipment to carry out UID sequencing, UID sequence design or UID duplicate removal quality value correction according to the thought of each method of the application.
Thus, in yet another aspect of the present application, an apparatus for UID sequence design is disclosed, the apparatus comprising,
the fitting function acquisition module is used for carrying out conventional deduplication on the DNA sample to be detected by utilizing the sequencing result of the 8-20bpUID sequence, and counting the total number of sequences contained in each conventional deduplication repeated sequence group; performing secondary deduplication on the conventional deduplication repeated sequence groups by using a UID deduplication algorithm, and counting the number of UID groups contained in each conventional deduplication repeated sequence group; fitting the total number of sequences in each conventional duplicate sequence group and the corresponding UID group number to obtain fitting functions of the sequences in each conventional duplicate sequence group;
The expected UID group number acquisition module is used for acquiring the required expected UID group number according to a fitting function by utilizing the total number of sequences in the repeated sequence group subjected to conventional de-duplication required by the sequencing of the DNA sample to be tested;
the optimal UID sequence length acquisition module is used for simulating and calculating 4 when the UID sequence length n is different in length n The possible UID sequences are randomly added into the original templates of the expected UID groups, so that the probability that all the original templates of the expected UID groups are connected with different UID sequences is 95% or more, and the minimum n value is the optimal UID sequence length.
Still another aspect of the present application also discloses an apparatus for UID de-duplication quality value correction, the apparatus comprising,
the sequencing error probability extraction module is used for calculating the probability of sequencing errors of bases with a set threshold value of a site to be tested and above by utilizing R language programming, and marking the probability as P1;
the PCR error probability extraction module is used for programming and calculating all fj and corresponding p meeting the conditions by using R language, as shown in a table 1; calculating the proportion of PCR errors in an actual DNA sample to be detected, namely an actual fj ', searching a fj value nearest to the actual fj' in a table 1 by utilizing a latest principle, and obtaining a corresponding PCR error rate from the table 1, wherein the PCR error rate is marked as P2; wherein fj is the proportion of the corresponding PCR errors occurring in the jth round of PCR errors, and p is the corresponding PCR error rate; preferably, all fj meeting the condition, that is, all the proportion of PCR errors, is greater than or equal to fj of the set threshold value of the present application;
The UID de-duplication correction quality value extraction module is used for obtaining the total probability P of comprehensively considering sequencing errors and PCR errors after UID de-duplication according to the formula P=P1× (1-P2) + (1-P1) ×P2, and obtaining the quality value Q after UID de-duplication through the formula Q= -10×lg (P).
The beneficial effects of this application lie in:
according to the UID sequencing method, through the UID sequence design step, the length of the UID sequence can be dynamically designed according to the requirement of the DNA sample data volume to be tested, and the sequencing data volume can be occupied less on the premise of meeting the randomness of the UID sequence. Furthermore, through the UID duplicate removal quality value correction step, the accuracy of UID sequencing can be embodied according to the improvement of the quality value; and, the improvement of the quality value is applied to the variation detection algorithm, so that the lower-frequency variation detection can be further supported.
Drawings
Fig. 1 is a block diagram of a UID sequence design apparatus in an embodiment of the present application;
fig. 2 is a block diagram of the UID de-duplication quality value correction apparatus in the present embodiment;
FIG. 3 is a graph of a fit of the total number of sequences in a conventional deduplicated repeat sequence set and the corresponding UID set number in an embodiment of the present application;
fig. 4 is a graph of probability simulation calculation results of 54 original templates in the embodiment of the present application, where UID sequences added to each template are different;
FIG. 5 is a graph of the result of the error rate analysis before correction of the quality values in the embodiment of the present application;
FIG. 6 is a graph of the error rate analysis result after correction of the quality values in the embodiment of the present application;
FIG. 7 is a graph showing the result of comparing the number of UID sets obtained by performing secondary deduplication on the same DNA sample to be tested in the example of the present application by using an 8bp UID and a 6bp UID, respectively.
Detailed Description
In the existing UID sequencing method, importance is attached to a UID de-duplication algorithm, and related researches are more. However, in the process of a large number of practices and researches, the UID sequence introduced by UID sequencing can cause a large amount of sequencing data waste, and particularly, the long UID sequence with stronger randomness is more obvious; moreover, the current UID de-duplication algorithm does not involve calculation of quality values, and cannot quantitatively or intuitively reflect the effect of the UID de-duplication algorithm on low-frequency mutation detection, so that the low-frequency mutation detection lower limit cannot be further optimized.
In the long-term practice and research process of UID sequencing, the key points of the UID sequencing technology are found to be three: firstly, the randomness of the UID sequences is that the longer the UID sequences theoretically have stronger randomness, but the longer the UID sequences at the same time can cause more waste of sequencing data; second, UID de-duplication algorithm matched with UID sequencing technology; thirdly, how to embody the de-duplication accuracy brought by the UID sequencing technology into a mutation detection algorithm.
Aiming at the three key points, the solution of the second point is relatively mature, and a relatively perfect UID deduplication algorithm is already available at present. However, there is not yet a good solution for the first and third points. For the first point, the length of the UID sequence, a fixed length is usually adopted, for example, the length is uniformly fixed to 14bp; although the randomness of the UID sequence can be ensured, the randomness of the UID sequence can only be ensured as much as possible at present, but the loss of the data cannot be minimized. For the third point, no research and report for embodying the de-duplication accuracy of UID sequencing technology into a mutation detection algorithm exist at present; although UID sequencing technology necessarily improves the detection performance of low frequency variations; however, for a specific mutation detection, the UID de-duplication algorithm has a specific effect of improving the low-frequency mutation detection performance, and has no quantitative and visual judgment standard, so that the existing UID de-duplication algorithm cannot embody the accuracy of UID de-duplication into the mutation detection algorithm because the quality value is not corrected, and the detection limit of the low-frequency mutation cannot be lower or cannot be further optimized.
Based on the above research and knowledge, the present application develops and proposes a UID sequencing method, including a UID sequence design step and a UID duplicate-removal quality value correction step.
UID sequence design:
comprising adding a longer UID sequence, for example, about 8-20bp UID sequence, to a DNA sample to be tested in advance; performing conventional deduplication on the sequencing result, and counting the total number of sequences contained in each conventional deduplication repeated sequence group; performing secondary deduplication on the conventional deduplication repeated sequence groups by adopting a UID deduplication algorithm, and counting the number of UID groups contained in each conventional deduplication repeated sequence group; and fitting the total number of sequences in each conventional duplicate sequence group and the corresponding UID group number to obtain a fitting function of the two sequences. In one implementation of the present application, conventional deduplication is performed using picard software. The UID de-duplication algorithm is according to the currently adopted compression de-duplication algorithm.
And obtaining the required expected UID group number according to the fitting function according to the total number of sequences in the repeated sequence group after conventional de-duplication required by the sequencing of the DNA sample to be tested.
If the length of the UID sequence is n, the number of combinations of the UID sequence is 4 n Taking the expected UID group number as an ordinate, programming by using R language, and simulating and calculating n to obtain different lengths, wherein 4 is obtained n The possible UID sequences are randomly added into the original templates of the expected UID groups, the probability that all the original templates of the expected UID groups are connected with different UID sequences is ensured to be 95% or more, and the minimum n value, namely the optimal length of the UID sequences, is used for designing the UID sequences.
And correcting the UID duplicate removal quality value:
in the compression deduplication algorithm of the UID deduplication algorithm, each position selects a base with the occurrence ratio larger than or equal to a set threshold value, and the probability that sequencing errors occur on the base with the set threshold value of the position and the base with the ratio larger than the set threshold value is calculated by utilizing R language programming and marked as P1. In one implementation of the present application, each location of the compression deduplication algorithm selects bases with an occurrence ratio greater than or equal to 60%, i.e., the threshold is set to 60%.
Setting that PCR errors occur in the jth round, wherein the proportion of the corresponding PCR errors is fj, the corresponding PCR error rate is p, comprehensively considering the condition that two reads of the first round have errors simultaneously or at least one PCR amplification in a plurality of rounds has errors, and calculating all fj meeting the conditions and the corresponding p by using R language programming, wherein the proportion of all fj meeting the conditions, namely the PCR errors, is fj which is greater than or equal to 60 percent; calculating the proportion of PCR errors in an actual DNA sample to be detected, namely an actual fj ', searching a fj value nearest to the actual fj' in the table 1 by utilizing a latest principle, and obtaining a corresponding PCR error rate from the table 1, wherein the PCR error rate is marked as P2.
The total probability of the sequencing error and the PCR error is comprehensively considered to be P after UID de-duplication,
P=P1×(1-P2)+(1-P1)×P2
and (3) converting the total probability P into a quality value Q= -10 x lg (P), and obtaining the quality value after UID de-duplication.
The UID sequence design step and the UID duplicate removal quality value correction step can be independently executed to respectively realize UID sequence design and UID duplicate removal quality value correction. Thus, in some implementations of the present application, a UID sequence design method and a UID deduplication quality value correction method are provided, respectively.
It will be appreciated by those skilled in the art that all or part of the functions of each step in the above methods may be implemented by means of hardware, or may be implemented by means of a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.
Thus, as shown in fig. 1, the apparatus for UID sequence design in one implementation of the present application includes a fitting function obtaining module 101, an expected UID group number obtaining module 102, and an optimal UID sequence length obtaining module 103.
The fitting function obtaining module 101 is configured to perform conventional deduplication on a DNA sample to be tested to which an UID sequence of 8-20bp is added by using a sequencing result, and count the total number of sequences included in each conventional deduplication repeated sequence group; performing secondary deduplication on the conventional deduplication repeated sequence groups by using a UID deduplication algorithm, and counting the number of UID groups contained in each conventional deduplication repeated sequence group; and fitting the total number of sequences in each conventional duplicate sequence group and the corresponding UID group number to obtain a fitting function of the two sequences.
The expected UID set number obtaining module 102 is configured to obtain a required expected UID set number according to a fitting function by using a total number of sequences in the repeated sequence set after conventional deduplication required for sequencing the DNA sample to be tested.
An optimal UID sequence length obtaining module 103 for simulating and calculating UID sequence length n to obtain 4 n The possible UID sequences are randomly added into the original templates of the expected UID groups, so that the probability that all the original templates of the expected UID groups are connected with different UID sequences is 95% or more, and the minimum n value is the optimal UID sequence length.
As shown in fig. 2, the apparatus for UID deduplication quality value correction in one implementation of the present application includes a sequencing error probability extraction module 201, a PCR error probability extraction module 202, and a UID deduplication correction quality value extraction module 203.
The sequencing error probability extraction module 201 is configured to calculate, by using R language programming, a probability that a sequencing error occurs in a base with a ratio of a set threshold value or more for a site to be tested, and mark the probability as P1.
The PCR error probability extraction module 202 is configured to calculate fj and corresponding p that all satisfy the conditions by using R language programming, as shown in table 1; calculating the proportion of PCR errors in an actual DNA sample to be detected, namely an actual fj ', searching a fj value nearest to the actual fj' in a table 1 by utilizing a latest principle, and obtaining a corresponding PCR error rate from the table 1, wherein the PCR error rate is marked as P2; wherein fj is the proportion of the corresponding PCR errors occurring in the jth round of PCR errors, and p is the corresponding PCR error rate; all fj meeting the condition, namely the proportion of the PCR errors, is greater than or equal to fj of the set threshold. Wherein the set threshold is 60% in one implementation of the present application.
The UID de-duplication correction quality value extraction module 203 is configured to obtain a total probability P of comprehensively considering sequencing errors and PCR errors after UID de-duplication according to a formula p=p1× (1-P2) + (1-P1) ×p2, and obtain a UID de-duplication quality value Q according to a formula q= -10×lg (P).
Based on the above method and apparatus, in one implementation manner of the present application, another UID serial design apparatus is also provided, where the apparatus includes a memory for storing a program; and the processor is used for executing the program stored in the memory to realize the UID sequence design method.
Similarly, in one implementation manner of the application, another UID deduplication quality value correction apparatus is also provided, and the apparatus includes a memory and a processor; wherein, the memory is used for storing programs; and the processor is used for executing a program stored in the memory to realize the UID duplicate removal quality value correction method.
All or part of the functions of each step in each method of the present application may be implemented by means of a computer program; thus, in one implementation of the present application, there is also provided a computer-readable storage medium including a program executable by a processor to implement the UID sequencing method, the UID sequence design method, or the UID deduplication quality value correction method of the present application.
The present application is described in further detail below by way of specific examples. The following examples are merely illustrative of the present application and should not be construed as limiting the present application.
Example 1
The method mainly researches and tests a UID sequence design step and a UID deduplication quality value correction step in UID sequencing, and specifically comprises the following steps:
UID sequence design:
the UID sequence is a random sequence with a certain length and consists of ATCG, for different sequencing data amounts, the length of the UID sequence can determine whether the randomness of the UID sequence is enough, and because the UID sequence occupies the sequencing data amount, the length of the UID sequence can determine the loss amount of the data. The design of UID sequence length therefore requires trade-off between sequence randomness and loss of data volume.
1. Relationship between pre-deduplication site depth and number of UID groups contained
1) Sample selection
A batch of samples with a longer UID sequence, for example, the UID sequence is larger than or equal to 8bp, and the data with the longer UID sequence is selected to ensure the randomness of the UID sequence. The example specifically adopts conventional clinical sample data with the UID sequence length of 20bp, and the data is provided by Huada genes.
2) Conventional deduplication by picard software
Conventional deduplication is performed by using picard software, UID sequence information is not considered by the picard software, repeated sequences are marked only according to information such as alignment positions, alignment directions, fragment lengths and the like, and the total number of sequences contained in each conventional deduplication repeated sequence group is counted. In particular, 1316515 repeated sequence groups were obtained in this example, and the total number of sequences in these repeated sequence groups was counted.
3) The UID de-duplication algorithm performs the second de-duplication
Based on 2), performing second deduplication by using a UID deduplication algorithm, counting the number of groups divided into when performing UID deduplication, namely the number of contained UID groups in each conventional deduplication repeated sequence group in 2), wherein one group of picard marked repeated sequences may contain information of a plurality of groups of UIDs, and the repeated sequences can be divided into a plurality of smaller groups when performing UID deduplication, which is also an advantage of UID sequencing, and more real original templates can be reserved when performing UID deduplication. The number of UID sets in 1316515 repeated sequence sets obtained in this example was counted in detail.
4) Fitting the total number of sequences in each of the conventional de-duplicated repeated sequence groups of 2) and 3) and the corresponding UID group number, finding a corresponding fitting function, the fitting graph of this example is shown in figure 3,
a specific fitting function is y=0.0053x+1.3158
Wherein y is the number of UID groups contained in the repeated sequence group subjected to conventional deduplication, and x is the total number of sequences contained in the repeated sequence group subjected to conventional deduplication.
The results of FIG. 3 show that R 2 =0.8156, the UID set number and the total number of sequences have a good fitting relationship.
2. Expected UID group number extraction
And obtaining the required expected UID group number according to the fitting function according to the total number of sequences in the repeated sequence group after conventional de-duplication required by the sequencing of the DNA sample to be tested. The total number of sequences in the repeated sequence group after conventional de-duplication required by the sequencing of the DNA sample to be tested is the data quantity before de-duplication by the UID de-duplication algorithm; the expected UID set number is the number of original templates in each conventional deduplicated repeat sequence set.
For example, if 10000×isassumed according to the required data amount before de-duplication, the corresponding ordinate is calculated as 54 according to the fitting function of 4), the abscissa is < =10000, and the maximum value is taken as 10000, i.e. the expected required UID group number is 54.
3. Optimal UID sequence length calculation
The purpose of calculating the optimal length of the UID sequence is to find a balance point between the randomness of the UID sequence and the data loss amount, namely, the loss of the data amount is as small as possible on the premise of meeting the randomness of the UID sequence.
If the length of the UID sequence is n, the number of combinations of the UID sequence is 4 n Taking the expected UID group number as an ordinate, programming by using R language, and simulating and calculating n to obtain different lengths, wherein 4 is obtained n The possible UID sequences are randomly added into the original templates of the expected UID groups, so that the probability that the original templates of the expected UID groups are connected with different UID sequences is 95% or more, and the minimum n value, namely the optimal length of the UID sequences, is ensured, and the UID sequences are designed according to the length.
For example, let the ordinate be 54, when n is calculated to be different lengths by R language programming, 4 will be n The possible UID sequences are randomly added into 54 original templates, the probability that the UID sequences added into each template are different is calculated, and the calculation result is shown in figure 4.
The results of fig. 4 show that in order to ensure that the 54 original templates are all connected to the UID with different probabilities of 95% or more, and at the same time, the length of the UID sequence is as short as possible, the optimal length of the UID sequence is 6. That is, for the 10000×data amount in this example, the optimal UID sequence length is 6, and under this length, it is possible to ensure that all the original templates are added with different UID sequences, and also to minimize the waste of sequencing data.
After the optimal UID sequence length is obtained, the specific UID sequence may be obtained by conventional random sequence generation software, which is not described here.
UID duplicate removal quality value correction:
generally, UID sequencing has the advantage that the UID sequencing can bring more accurate de-duplication operation, namely, the quality value of the sequence obtained after de-duplication is higher, however, how to embody the characteristic of higher quality value is that quality value correction is needed, in this example, the original sequencing quality value of the sequence is corrected by combining a sequencing error and a PCR error evolution model, the accuracy brought by UID de-duplication is embodied, and further lower-frequency mutation detection is supported. The method comprises the following steps:
the UID de-duplication method is a compression de-duplication method, i.e. repeated sequences of the same UID are compressed, and each site has a base with occurrence ratio > =60%, so as to remove a sequence with the most accuracy of sequencing error and PCR error, but the accuracy of such an operation depends on the sequencing error and PCR error, so that the sequencing error and PCR error need to be considered simultaneously when performing quality value correction. Specifically, consideration is given to each site.
1. Sequencing error model:
the UID deduplication algorithm is a compression deduplication algorithm, each site will leave that base with a ratio > =60%, and the probability that more than 60% of bases have a sequencing error is calculated in consideration of the sequencing error. The calculation is performed by programming in the R language, and the obtained probability is marked as P1.
Pcr error model:
generally, the probability of more than 60% of the partial sequencing errors occurring at the same site is extremely low, and more than 60% of the errors occur mostly due to the existence of PCR errors, so that the PCR error model is mainly weighted in the model with corrected quality values.
2.1 setting the total cycle of PCR as N;
2.2PCR errors occur in the jth round, and the proportion of the corresponding PCR errors is fj;
2.3 the error rate corresponding to the PCR error occurring in different rounds is p, and the corresponding relation between fj and p is explored;
setting the probability of each single error occurrence of the PCR at each position point of each round to be p0;
when PCR errors occur on a single pass of single reads, the correspondence between N, j, fj, p0, p is as follows:
j 1 2 3 4 N
fj 0.5 0.25 0.125 0.0625 1/(2^N)
p p0 2^2*p0 2^3*p0 2^4*p0 2^N*p0
it is obvious that these cases fj are less than 0.6, and are not considered in this example, because the compression deduplication is performed under 60% or more, and the compression accuracy is not affected even if these cases exist.
The PCR error conditions to be considered do not therefore simply occur on a single round of single reads, but rather are a number of comprehensive conditions, exemplified by the following:
j 1*2 1&2 1&3 1&2&3
fj 1 0.75 0.625 0.875
p p0^2 P0^2 P0^2 P0^3
where 1*2 indicates that two reads in the first round are in error at the same time, 1&2 indicates that one error occurred in the first round and one error occurred in the second round.
All fj and corresponding p satisfying the condition, i.e., greater than 0.6, were calculated using R language programming as shown in table 1.
Table 1 satisfies the condition fj and the corresponding p-value
fj p
0.625 1.788456e-06
0.6875 9.504819e-09
0.75 8.911369e-07
0.8125 4.752641e-09
0.8750 2.384031e-09
0.9375 1.266989e-11
1 1.115584e-07
Then, the proportion of PCR errors in the actual DNA sample to be tested, namely the actual fj ', is calculated, the fj value nearest to the actual fj' is found in the table 1 by utilizing the latest principle, and the corresponding PCR error rate is obtained from the table 1 and is marked as P2.
3. Mass value correction model
Compressing the site of de-duplication, comprehensively considering the total probability of sequencing error and PCR error as P after UID de-duplication,
P=P1×(1-P2)+(1-P1)×P2
and (3) converting the total probability P into a quality value Q= -10 x lg (P), namely obtaining the corrected quality value after UID de-duplication.
The P value calculated in general is very small, so that the quality value Q obtained by conversion is larger than the original sequencing quality value, and the accuracy of UID sequencing is reflected by the improvement of the quality value; and the improvement of the quality value can be applied to a variation detection algorithm to further support the variation detection of lower frequency.
The mass value correction is performed by using the "mass value correction model" of this example as an example for a reads sequence. Wherein the reads sequence is the sequence shown in SEQ ID NO. 1.
SEQ ID NO.1:
GCTATTATTGATGGCAAATACACAGAGGAAGCCTTCGCCTGTCCTCATGTATTGGTCTCTCATGGCACTGTACTCTTCTTGTCCAGCTGTATCCAGTA
The sequence of the ready sequencing quality value of the sequence shown in SEQ ID NO.1 is as follows:
FDFEFFDGFFGFFGFEAGFEGGFFFGFFCGFFFFDGFFGGFFFGGEFFAEEEGFG@GGFGGFGFEGGEGFFGFFFFDFFGFFFDFFGGGG=GFGFF
the corresponding mass values are:
“37,35,37,36,37,37,35,38,37,37,38,37,37,38,37,36,32,38,37,36,38,38,37,37,37,30,38,37,37,34,38,37,37,37,37,35,38,37,37,38,38,37,37,37,38,38,36,37,37,32,36,36,36,38,37,38,31,38,38,37,38,38,37,38,37,36,38,38,36,38,37,37,38,37,37,37,37,35,37,37,38,37,30,37,37,35,37,37,38,38,38,38,28,38,37,38,37,37”
the corresponding error rates are shown in fig. 5.
The corrected mass value sequence is:
“ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff”
the corresponding mass values are:
“69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69,69”
the corresponding error rates are shown in fig. 6.
The above results show that the corrected quality value Q is improved, which is significant for further supporting lower frequency mutation detection when applied to a mutation detection algorithm.
Example two
1. Balance test
And calculating the optimal UID sequence length according to the data volume requirement of the project. In the example, BGI tumor research and development liquid biopsy project data of 8bp UID provided by Huada genes are adopted, the depth of UID de-duplication is required to be 4000 multiplied, the number of sequences contained in each group of repeated sequences after the picard mark is repeated is calculated to be within 10000, a fitting function is combined, an ordinate is 54, the UID sequence needs to ensure the randomness of distinguishing 54 original templates, and the optimal length of the UID sequence is calculated to be 6.
The data of the 6bp UID is obtained by carrying out the operation of UID sequence interception simulation on the basis of the 8bp UID, and the number of the groups of the 8bp UID and the 6bp UID subjected to secondary duplicate removal is compared and analyzed, and the result is shown in figure 7.
It should be noted that the sequence will be shown in a specific position of the sequencing data, for example, the 8bp UID sequence will appear in the 8bp position at the beginning of the sequencing fragment, the 8bp UID data will be obtained by extracting all 8bp, or only the data of the first 6bp may be obtained, thus the data of the 6bp UID sequence is equivalent, and in this way, the data of the 6bp UID is simulated on the basis of the 8bp UID data.
The results of FIG. 7 show that, from the distribution graph, the number of groups separated by using 8bp UID and 6bp UID for secondary deduplication is basically consistent, and the loss rate of data can be reduced by using 6bp UID, which indicates that 6bp is better than the original 8bp and the optimal UID length.
The above experiments demonstrate that there is less loss of sequencing data in balance, i.e., with the assurance of UID randomness.
Variation detection test before and after UID duplicate removal quality value correction
The test carries out mutation detection based on UID sequencing on tumor standards with mixed mutation rates of 0.10%, 0.20% and 0.50%. The tumor standard is purchased from horizons corporation.
The statistical variation detection results using and not using the quality value correction at the time of UID deduplication are shown in table 2.
TABLE 2 detection results of variation before and after UID duplicate removal quality value correction
Figure GDA0001752530960000151
In table 2, "detection results": NEG indicates negative, WCX indicates weak positive, SCX indicates strong positive. "LOD value": the result of the LOD variation detection algorithm is that the threshold value is 2, the value is positive when the threshold value is more than or equal to 2, and the value is negative when the threshold value is less than 2. Poisson probability value ": based on the result of the poisson distribution variation detection algorithm, the threshold value is 0.95, more than or equal to 0.95 is positive, and less than 0.95 is negative. A null LOD and Poisson probability value indicates that the mutation at this mutation site supports a number of reads of 0.
The results in Table 2 show that UID deduplication quality value correction can raise LOD values and Poisson probability values as a whole; after UID deduplication quality value correction, kras_g12d, such as #5, whose original detection result is NEG can be detected as a weak positive result, and egfr_l858R, such as #2, whose original detection result is WCX can be detected as a strong positive result. It can be seen that UID deduplication mass value correction can show the accuracy of UID sequencing and can support lower frequency mutation detection.
The LOD value is a value calculated by a variance detection algorithm based on a bayesian probability model, and the poisson probability value is a probability value calculated by a hypothesis testing method based on poisson distribution conversion. The two methods are common mutation detection algorithms, and the most central factors affecting LOD values and poisson probability values are quality values of bases, namely error rates of the bases, because the algorithms weigh the appearance of the bases and the error rates corresponding to the bases, and further judge whether the locus is positive or negative. The quality value Q is corrected, so that the accuracy of the Q value is improved, the LOD value and the Poisson probability value are affected, and the accuracy of variation detection is improved.
It will be appreciated that this example can support lower frequency detection because if no quality value correction is performed, or if an uncorrected quality value is used, e.g., Q30, the corresponding quality value is 30, the error rate is 1/1000, so that when the depth is deep, the theoretical background noise is high, e.g., 10000×, and the theoretical noise is 10 in combination with an error rate of 1/1000; however, the noise of the 10 pieces is basically removed in the process of removing the weight of the UID, so that the problem that the theoretical noise is inconsistent with the actual noise is caused, and the low-frequency variation cannot be detected. The quality value is corrected in this example, for example, the original Q30 is changed into Q60, the error rate is 1/1000000, so that the corresponding theoretical noise is less than 1, and the corresponding theoretical noise accords with the actual data; therefore, the quality value correction can support the detection of low frequency.
Example III
The functions of the UID sequencing method of this example may be implemented by means of a computer program, and the program that can be executed by a processor to implement the UID sequencing method, the UID sequence design method, or the UID deduplication quality value correction method of this application may be run in a windows or linux environment, where the running in the windows environment requires an IDE "Rgui" or "Rstudio" of R language, and where the running in the linux environment requires the installation of R language. In the following, an operation in the linux environment is taken as an example, and R in the linux environment needs to be started:
UID Length design
(1) Fitting function relation between depth before loading deduplication and UID group number
source("depth_and_uid_num.r")
The specific loading fitting function of this example was y=0.0053x+1.3158, which was obtained with reference to example one.
(2) Calculating the corresponding UID group number according to the depth before de-duplication of the project requirement, and setting the depth of the requirement as D
depth_and_uid_num (D), the calculated UID group number is N
The step is used for achieving that the required expected UID group number, namely the original template number in each conventional deduplication repeated sequence group, is obtained according to a fitting function according to the data quantity before conventional deduplication required by sequencing the DNA sample to be tested, namely the expected sequence total number in each conventional deduplication repeated sequence group. Wherein the value of D is equal to the total number of sequences in the repeated sequence group of conventional deduplication, and N is the expected UID group number.
(3) Loading a function that calculates the optimal UID length
source("uid_length.r")
The step is used for realizing that if the length of the UID sequence is n, the combination number of the UID sequence is 4 n Taking the expected UID group number as an ordinate, programming by using R language, and simulating and calculating n to obtain different lengths, wherein 4 is obtained n A possible UID sequence is randomly added to the original template of the expected UID group number.
(4) Calculating the optimal UID length
uid_length(N)
This step is used to obtain a minimum n value, i.e. the optimal UID sequence length, that ensures that the probability that the original templates of the expected UID sets all connect different UID sequences is 95% or more.
The program language for the optimal length of UID is summarized as follows:
Figure GDA0001752530960000171
Figure GDA0001752530960000181
UID duplicate removal quality value correction
(1) Loading a function that calculates the proportion of PCR errors and the corresponding probability
source("fj_P.r")
The function is used to achieve the probability of obtaining a sequencing error through the programming calculation of the R language.
The programming language for calculating the sequencing error probability is summarized as follows:
Figure GDA0001752530960000182
Figure GDA0001752530960000191
(2) Calculating PCR error proportion and corresponding probability
fj_p()
This function is used to implement that all satisfying conditions, i.e., fj greater than 0.6 and the corresponding p, are calculated using the R language programming to obtain the results shown in table 1.
The programming language for calculating all p corresponding to fj greater than 0.6 is summarized as follows:
Figure GDA0001752530960000192
/>
Figure GDA0001752530960000201
(3) Searching fj values nearest to the actual fj in the table 1 by utilizing a nearest principle according to the PCR error proportion of the actual DNA sample data to be detected, and obtaining corresponding PCR error rates from the table 1;
calculating the total probability P according to the sequencing error rate P1 and the PCR error rate P2
P=P1×(1-P2)+(1-P1)×P2
And (3) converting the total probability P into a quality value Q= -10 x lg (P), namely obtaining the corrected quality value after UID de-duplication.
For the above total probability and quality value Q, the UID de-duplication program of this example operates in a manner such as a program "UID-rmdup-for-any-len. Pl", the main body is UID de-duplication, and quality value correction is accompanied during operation, and the per program is specifically as follows:
perl uid-rmdup-for-any-len.pl-i bam-o rmdup.bam-u uid_length-l read_length-q quality_system-p PCR_erro_rate_matrix-a anchor_length。
The UID sequencing method, the UID sequence design method and the UID duplicate removal quality value correction method are realized through the computer program which can be executed by the processor, and the computer program can be operated in a windows or linux environment, so that the method is simple and convenient to use.
The foregoing is a further detailed description of the present application in connection with the specific embodiments, and it is not intended that the practice of the present application be limited to such descriptions. It should be understood that those skilled in the art to which the present application pertains may make several simple deductions or substitutions without departing from the spirit of the present application, and all such deductions or substitutions should be considered to be within the scope of the present application.
SEQUENCE LISTING
<110> Guangzhou Hua big Gene medical examination all Limited
Shenzhen Huada clinical laboratory center
Shenzhen Huada Gene Co., Ltd.
TIANJIN BGI MEDICAL LABORATORY Co.,Ltd.
<120> UID sequencing, UID sequence design, UID duplicate removal quality value correction method and application
<130> 18I26068
<160> 1
<170> PatentIn version 3.3
<210> 1
<211> 98
<212> DNA
<213> artificial sequence
<400> 1
gctattattg atggcaaata cacagaggaa gccttcgcct gtcctcatgt attggtctct 60
catggcactg tactcttctt gtccagctgt atccagta 98

Claims (8)

1. A UID sequencing method, characterized in that: the method comprises a UID sequence design step and a UID duplicate removal quality value correction step;
the UID sequence design step comprises the steps of adding an 8-20bp UID sequence into a DNA sample to be tested in advance; performing conventional deduplication on the sequencing result, and counting the total number of sequences contained in each conventional deduplication repeated sequence group; performing secondary deduplication on the conventional deduplication repeated sequence groups by adopting a UID deduplication algorithm, and counting the number of UID groups contained in each conventional deduplication repeated sequence group; fitting the total number of sequences in each conventional duplicate sequence group and the corresponding UID group number to obtain fitting functions of the sequences in each conventional duplicate sequence group;
Obtaining the required expected UID group number according to the fitting function according to the total number of sequences in the repeated sequence group after conventional de-duplication required by the sequencing of the DNA sample to be tested;
if the length of the UID sequence is n, the number of combinations of the UID sequence is 4 n Taking the expected UID group number as an ordinate, programming by using R language, and simulating and calculating n to obtain different lengths, wherein 4 is obtained n Randomly adding a possible UID sequence into the original templates of the expected UID group number, ensuring that the probability that the original templates of the expected UID group number are connected with different UID sequences is 95% or more, and designing the UID sequence according to the minimum n value, namely the optimal length of the UID sequence;
the UID duplicate removal quality value correction step comprises the steps that in a compression duplicate removal algorithm of a UID duplicate removal algorithm, each position selects a base with the occurrence ratio larger than or equal to a set threshold value, and the probability that sequencing errors occur on the base with the set threshold value of the position and the base with the ratio larger than the set threshold value is calculated by utilizing R language programming, and the probability is marked as P1;
setting that PCR errors occur in the jth round, wherein the proportion of the corresponding PCR errors is fj, the corresponding PCR error rate is p, comprehensively considering the condition that two reads of the first round of PCR have errors simultaneously or at least one PCR amplification has errors, and calculating all fj and corresponding p meeting the conditions by using R language programming, wherein the fj and the p are shown in table 1;
TABLE 1
fj 0.625 0.6875 0.75 0.8125 0.8750 0.9375 1 p 1.788456e-06 9.504819e-09 8.911369e-07 4.752641e-09 2.384031e-09 1.266989e-11 1.115584e-07
2. The UID sequencing method of claim 1, wherein:
and fj meeting the condition is fj with the proportion of the PCR errors being greater than or equal to the set threshold.
3. The UID sequencing method of claim 2, wherein: the set threshold is 60%.
4. A UID sequencing method according to any one of claims 1-3, wherein: the conventional deduplication specifically comprises the step of marking repeated sequences by adopting picard software according to alignment positions, alignment directions and fragment lengths.
5. A UID sequencing method according to any one of claims 1-3, wherein: the fitting function is y=0.0053x+ 1.3158
Wherein y is the number of UID groups contained in the repeated sequence group subjected to conventional deduplication, and x is the total number of sequences contained in the repeated sequence group subjected to conventional deduplication.
6. Use of the UID sequencing method of any one of claims 1-5 in UID sequence design or UID deduplication mass value correction.
7. A UID sequence design method is characterized in that: adding a UID sequence of 8-20bp into a DNA sample to be detected in advance; performing conventional deduplication on the sequencing result, and counting the total number of sequences contained in each conventional deduplication repeated sequence group; performing secondary deduplication on the conventional deduplication repeated sequence groups by adopting a UID deduplication algorithm, and counting the number of UID groups contained in each conventional deduplication repeated sequence group; fitting the total number of sequences in each conventional duplicate sequence group and the corresponding UID group number to obtain fitting functions of the sequences in each conventional duplicate sequence group;
Obtaining the required expected UID group number according to the fitting function according to the total number of sequences in the repeated sequence group after conventional de-duplication required by the sequencing of the DNA sample to be tested;
if the length of the UID sequence is n, the number of combinations of the UID sequence is 4 n Taking the expected UID group number as an ordinate, programming by using R language, and simulating and calculating n to obtain different lengths, wherein 4 is obtained n The possible UID sequences are randomly added into the original templates of the expected UID groups, the probability that all the original templates of the expected UID groups are connected with different UID sequences is ensured to be 95% or more, and the minimum n value, namely the optimal length of the UID sequences, is used for designing the UID sequences.
8. The device for designing the UID sequence is characterized in that: the apparatus may be configured to be coupled to a device,
the fitting function acquisition module is used for carrying out conventional deduplication on the DNA sample to be detected by utilizing the sequencing result of the 8-20bpUID sequence, and counting the total number of sequences contained in each conventional deduplication repeated sequence group; performing secondary deduplication on the conventional deduplication repeated sequence groups by using a UID deduplication algorithm, and counting the number of UID groups contained in each conventional deduplication repeated sequence group; fitting the total number of sequences in each conventional duplicate sequence group and the corresponding UID group number to obtain fitting functions of the sequences in each conventional duplicate sequence group;
The expected UID group number acquisition module is used for acquiring the required expected UID group number according to the fitting function by utilizing the total number of sequences in the repeated sequence group subjected to conventional de-duplication required by the sequencing of the DNA sample to be tested;
the optimal UID sequence length acquisition module is used for simulating and calculating 4 when the UID sequence length n is different in length n Randomly adding the possible UID sequences into the original templates of the expected UID groups, and ensuring that the probability that the original templates of the expected UID groups are connected with different UID sequences is 95% or more, wherein the minimum n value is the optimal length of the UID sequences;
the UID weight-removing quality value correction module is used for selecting bases with occurrence proportion larger than or equal to a set threshold value from each position in a compression weight-removing algorithm of the UID weight-removing algorithm, calculating the probability of sequencing errors of the bases with the set threshold value and above proportion of the position by using R language programming, and marking the probability as P1;
setting that PCR errors occur in the jth round, wherein the proportion of the corresponding PCR errors is fj, the corresponding PCR error rate is p, comprehensively considering the condition that two reads of the first round of PCR have errors simultaneously or at least one PCR amplification has errors, and calculating all fj and corresponding p meeting the conditions by using R language programming, wherein the fj and the p are shown in table 1;
TABLE 1
fj 0.625 0.6875 0.75 0.8125 0.8750 0.9375 1 p 1.788456e-06 9.504819e-09 8.911369e-07 4.752641e-09 2.384031e-09 1.266989e-11 1.115584e-07
CN201810450617.1A 2018-05-11 2018-05-11 UID sequencing, UID sequence design, UID duplicate removal quality value correction method and application Active CN110491445B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810450617.1A CN110491445B (en) 2018-05-11 2018-05-11 UID sequencing, UID sequence design, UID duplicate removal quality value correction method and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810450617.1A CN110491445B (en) 2018-05-11 2018-05-11 UID sequencing, UID sequence design, UID duplicate removal quality value correction method and application

Publications (2)

Publication Number Publication Date
CN110491445A CN110491445A (en) 2019-11-22
CN110491445B true CN110491445B (en) 2023-05-30

Family

ID=68543212

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810450617.1A Active CN110491445B (en) 2018-05-11 2018-05-11 UID sequencing, UID sequence design, UID duplicate removal quality value correction method and application

Country Status (1)

Country Link
CN (1) CN110491445B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6537773B1 (en) * 1995-06-07 2003-03-25 The Institute For Genomic Research Nucleotide sequence of the mycoplasma genitalium genome, fragments thereof, and uses thereof
CN104232760A (en) * 2014-08-26 2014-12-24 深圳华大基因医学有限公司 Method and device for determining sample source of reading segments in mixed sequencing data
CN105653893A (en) * 2015-12-25 2016-06-08 北京百迈客生物科技有限公司 Genome re-sequencing analysis system and method
CN106202991A (en) * 2016-06-30 2016-12-07 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in a kind of genome multiplex amplification order-checking product
CN106893774A (en) * 2017-01-22 2017-06-27 苏州首度基因科技有限责任公司 The method that DNA Deflection levels are detected with polymolecular label
CN107075730A (en) * 2014-09-12 2017-08-18 利兰·斯坦福青年大学托管委员会 The identification of circle nucleic acid and purposes
CN107360224A (en) * 2017-07-07 2017-11-17 携程旅游信息技术(上海)有限公司 Sequence number generation method, system, equipment and storage medium in distributed system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150141257A1 (en) * 2013-08-02 2015-05-21 Roche Nimblegen, Inc. Sequence capture method using specialized capture probes (heatseq)
US9697228B2 (en) * 2014-04-14 2017-07-04 Vembu Technologies Private Limited Secure relational file system with version control, deduplication, and error correction

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6537773B1 (en) * 1995-06-07 2003-03-25 The Institute For Genomic Research Nucleotide sequence of the mycoplasma genitalium genome, fragments thereof, and uses thereof
CN104232760A (en) * 2014-08-26 2014-12-24 深圳华大基因医学有限公司 Method and device for determining sample source of reading segments in mixed sequencing data
CN107075730A (en) * 2014-09-12 2017-08-18 利兰·斯坦福青年大学托管委员会 The identification of circle nucleic acid and purposes
CN105653893A (en) * 2015-12-25 2016-06-08 北京百迈客生物科技有限公司 Genome re-sequencing analysis system and method
CN106202991A (en) * 2016-06-30 2016-12-07 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in a kind of genome multiplex amplification order-checking product
CN106893774A (en) * 2017-01-22 2017-06-27 苏州首度基因科技有限责任公司 The method that DNA Deflection levels are detected with polymolecular label
CN107360224A (en) * 2017-07-07 2017-11-17 携程旅游信息技术(上海)有限公司 Sequence number generation method, system, equipment and storage medium in distributed system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DNA序列编码的研究进展;崔光照等;《生物技术通报》;20060826(第04期);全文 *
Pycnoporus sp. SYBC-L1 18S rDNA序列分析及其固态发酵水葫芦产漆酶的研究;王志新等;《食品与发酵工业》;20090830(第08期);全文 *
基于RFID技术的数字化生产线研究;田美花等;《计算机应用》;20061228;全文 *
组织型纤维酶原激活因子A1u重复序列基因多态性检测;魏然等;《中国血液流变学杂志》;20030930(第03期);全文 *

Also Published As

Publication number Publication date
CN110491445A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
Aronesty Comparison of sequencing utility programs
Baggerly et al. Differential expression in SAGE: accounting for normal between-library variation
CN111755068B (en) Method and device for identifying tumor purity and absolute copy number based on sequencing data
CN110491441A (en) A kind of gene sequencing data simulation system and method for simulation crowd background information
CN114502744B (en) Copy number variation detection method and device based on blood circulation tumor DNA
CN112328499A (en) Test data generation method, device, equipment and medium
CN117766020A (en) Method, device and system for detecting chromosome aneuploidy
CN110491445B (en) UID sequencing, UID sequence design, UID duplicate removal quality value correction method and application
CN109920480B (en) Method and device for correcting high-throughput sequencing data
CN111584002B (en) Method, computing device and computer storage medium for detecting tumor mutational burden
CN111737349A (en) Data consistency checking method and device
JPWO2019132010A1 (en) Methods, devices and programs for estimating base species in a base sequence
WO2019213810A1 (en) Method, apparatus, and system for detecting chromosome aneuploidy
CN110910955B (en) Method for establishing longitudinal analysis model of rare mutation sites of susceptibility genes
Wang et al. GSDcreator: an efficient and comprehensive simulator for genarating ngs data with population genetic information
CN109637585B (en) Method and device for correcting sequencing depth
Prjibelski et al. IsoQuant: a tool for accurate novel isoform discovery with long reads
Han mRNA-Sequencing pipeline for differential gene Expression analysis
CN114067909B (en) Method, device and storage medium for correcting homologous recombination defect score
US11869632B2 (en) Method and system for analyzing sequences
Voshall Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies
Gollwitzer et al. MetaFast: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation
Hesse Check Chapter 4 updates for
Shen A Method for Calculating the Least Mutated Sequence in DNA Alignment Based on Point Mutation Sites
KR20220164409A (en) Apparatus and Method for Genome Sequence Alignment Acceleration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 510006 Huada gene, 3 / F, zone B, national digital home industry application demonstration base, No. 22, Xiaoguwei street, University Town, Panyu District, Guangzhou City, Guangdong Province

Applicant after: BGI-GUANGZHOU MEDICAL LABORATORY Co.,Ltd.

Applicant after: Shenzhen Huada Medical Laboratory

Applicant after: BGI SHENZHEN Co.,Ltd.

Applicant after: TIANJIN MEDICAL LABORATORY, BGI

Address before: 510006 Huada gene, 3 / F, zone B, national digital home industry application demonstration base, No. 22, Xiaoguwei street, University Town, Panyu District, Guangzhou City, Guangdong Province

Applicant before: BGI-GUANGZHOU MEDICAL LABORATORY Co.,Ltd.

Applicant before: SHENZHEN HUADA CLINIC EXAMINATION CENTER

Applicant before: BGI SHENZHEN Co.,Ltd.

Applicant before: TIANJIN MEDICAL LABORATORY, BGI

GR01 Patent grant
GR01 Patent grant