CN111696622A

CN111696622A - Method for correcting and evaluating detection result of mutation detection software

Info

Publication number: CN111696622A
Application number: CN202010456693.0A
Authority: CN
Inventors: 王旭文; 杨玲; 易鑫; 黄毅; 吴玲清; 林浩翔
Original assignee: Shenzhen Guiinga Medical Laboratory; Beijing Jiyinjia Medical Laboratory Co ltd
Current assignee: Shenzhen Guiinga Medical Laboratory; Beijing Jiyinjia Medical Laboratory Co ltd
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-09-22
Anticipated expiration: 2040-05-26
Also published as: CN111696622B

Abstract

The invention relates to a method for correcting and evaluating detection results of mutation detection software, which comprises the following steps: inputting a detection file, identifying and dividing polynucleotide variation in the detection file, performing duplication removal and integration on variation results in the detection file after division processing to obtain a corrected detection result, and performing consistency evaluation on the variation results and/or the corrected detection result in the detection file by taking the variation detection result of reference software as a gold standard. The method for correcting and evaluating the detection result of the mutation detection software can be used for correcting and evaluating the detection result of any mutation detection software based on the result file of the mutation detection software as input, and can improve the final mutation detection rate.

Description

Method for correcting and evaluating detection result of mutation detection software

Technical Field

The invention belongs to the technical field of gene detection, and particularly relates to a method for correcting and evaluating a detection result of mutation detection software.

Background

Genes have many types of mutations, most commonly single nucleotide mutations (SNV), DNA fragment insertions (insertions) and deletions (deletions), but during the course of mutation, polynucleotide Mutations (MNV) also occur frequently. A polynucleotide is mutated by a plurality of SNPs or indels within a block, such as: '1, 1289564, AGCT, CGCC', i.e. the sequence AGCT (REF) is mutated to the sequence (ALT) CGCC at position 1289564 on chromosome 1, in fact, the sequence has base substitution at the head and tail ends, also called SNP mutation; for another example: '2,56892445, TGGCTGCAA, CGGCGGCA', i.e., a base substitution occurs in the head and middle of the sequence, while a deletion occurs at the end of the sequence, and so on. In practical research, polynucleotide variation needs to be segmented out to rearrange variation information, otherwise, the accuracy of analysis results of gene downstream data is influenced.

The gene mutation is an important cause for cancer occurrence, different cancer types have different gene mutation type characteristics, the software most commonly used for SNV detection on tissues at present is GATK-mutect2, the software well performs strict quality correction on sequencing data, and a reliable Bayesian model and a Markov model trained by a large amount of clinical medical data can be used for detecting SNV variation, so that the detection result is accurate.

However, the GATK algorithm is relatively slow in calculation speed, and has some defects in detecting the mutation of a blood sample, and firstly, the GATK algorithm is not sensitive enough to detect the site with extremely low mutation rate in blood. Second, the model parameters used by GATK are trained using tissue data and are not suitable for blood samples.

And based on different gene mutation type analysis requirements, other software is required to be applied to carry out mutation information detection, for example, the detection sensitivity of FreeBayes is high, and for example, Platypus can realize rapid mutation detection. However, a great deal of polymorphic site information is often found in the mutation detection results of the software, the polymorphic site information is not filtered, the false positive of the detected mutation information is high, the detection result is inaccurate, consistency comparison with the analysis result of the mutect2 software is difficult, and the detection result cannot be confirmed. When the detection result of the software of the type of mutact2 is used as the detection standard, a method for consistency comparison with the analysis result of the reference software is lacked.

Disclosure of Invention

In view of the above problems, the present invention provides a method for correcting and evaluating the detection result of mutation detection software.

A method for correcting and evaluating the detection results of mutation detection software, comprising:

inputting a detection file, and identifying and segmenting polynucleotide variation in the detection file;

carrying out duplication removal and integration on the variation results in the detection files after the segmentation processing is carried out, and obtaining a correction detection result;

and taking the variation detection result of the reference software as a gold standard, and carrying out consistency evaluation on the variation result and/or the correction detection result in the detection file.

Further, the detection file is a result file of any mutation detection software, and the result file is corrected and evaluated.

Further, the identifying and segmenting the polynucleotide variation in the test file comprises the following steps:

step (1): acquiring a consensus sequence of variant reads in a reference genome and the detection file;

step (2): determining a selected consensus sequence according to a principle of preferentially selecting a longest consensus sequence, and segmenting two ends of the selected consensus sequence to obtain two new variation information M and N;

and (3): repeating the step (2) for M and N to identify and partition the variant sites of the polynucleotide by a recursive algorithm;

and (4): and respectively calculating M, N the length of the consensus sequence of the reference sequence to obtain the segment length P of the variation information M and the segment length Q of the variation information N, judging according to the length values of P and Q, and continuing to carry out polynucleotide variation site recognition and segmentation until the recognition and segmentation of the variation sites of the consensus sequences at the two ends are finished.

Further, the obtaining of the consensus sequences of the variant reads in the reference genome and the test file comprises:

if the lengths of the reference sequence and the variant site are both larger than 2, searching a consensus sequence of the reference genome and the variant reading based on a pattern recognition algorithm;

if the lengths of the reference sequence and the variation site are both 2 and the bases of the reference genome and the variation site are different, splitting the polymorphic variation site into two SNPs;

if the length of the reference sequence is more than or equal to 2 and the length of the variation site is more than 2, searching a consensus sequence of the reference genome and the variation reading based on a pattern recognition algorithm.

Further, the segmentation of the two ends of the selected consensus sequence is based on a character string segmentation technology to segment the two ends of the consensus sequence.

Further, the method for continuing the recognition and segmentation of the polynucleotide variation sites through the judgment of the length values of P and Q comprises the following steps:

if P is greater than Q or P is less than Q, firstly, polynucleotide variation site recognition and segmentation are carried out according to variation reading segments with long consensus sequence length;

when P is Q, the above steps (1) to (4) are repeated in the order from left to right according to the coordinate information of M and N on the genome until all the polymorphic mutation sites are completely divided.

Further, the mutation result in the detection file comprises segmented and non-segmented mutation information;

the segmented and non-segmented variation information comprises mutated chromosomes, mutated positions, reference base sequences and mutated base sequences;

integrating the variation results in the detection file after the segmentation treatment into: and merging the variation information, and integrating the variation into a line according to the same standard of chromosomes, mutation positions and reference sequences of the mutation in the variation information to be used as the variation information of one locus.

Further, the deduplication is performed on the variation result in the detection file after the segmentation processing, specifically:

for the variation result meeting the preset de-duplication standard, a random algorithm is adopted to retain variation information, and the variation result after de-duplication is taken as the correction detection result;

the preset deduplication standard is as follows: taking whether the mutation chromosome, the mutation position, the reference sequence and the sequence with the mutation in the mutation information are the same as the judgment basis for judging whether the mutation result is repeated; if the difference information is identical, the compared variation information is judged to be repeated, otherwise, the compared variation information is not repeated.

Further, the criteria for the consistency assessment are: whether the chromosomes of the mutation are the same, whether the coordinate positions of the mutation are the same, whether the reference sequences of the mutation are the same, whether the sequences of the mutation are the same and whether the frequency difference of the mutation is within 0.01, and if the conditions are met, the mutation is judged to be a true positive mutation:

the index of consistency assessment is sensitivity;

the sensitivity calculation is performed by: comparing the variation result in the detection file with the variation detection result filtered by the reference software, and comparing the variation number in the detection file with the total variation detection number in the reference software;

and/or the presence of a gas in the gas,

and comparing the corrected detection result with the variation detection result filtered by the reference software, wherein the variation number of the corrected detection result is/the total variation detection number in the reference software.

The method for correcting and evaluating the detection result of the variation detection software can be applied to correction and evaluation of the variation detection result of whole genome sequencing, whole exon sequencing and target region capture sequencing data thereof.

The method for correcting and evaluating the detection result of the mutation detection software provided by the invention has the following advantages:

the method is suitable for whole genome sequencing, whole exon sequencing and target region capture sequencing data;

the method can be used for correcting and evaluating the detection result based on the result file of any variation detection software, the final variation detection rate can be improved, and the sensitivity before and after correction is up to 1-1.5%;

moreover, when the polymorphic variable sites are identified, the common sequence is searched through a pattern recognition algorithm, other matched sub-common sequences do not need to be searched in a user-defined distance through the common sequence, and the search is not limited by the size of an extended window;

the method is suitable for all variation detection results which contain polymorphic variation sites and are not processed, and the accuracy of the detection results is ensured;

in the process of repeatedly identifying and segmenting polymorphic variable sites, the invention adopts a recursive algorithm, which can save both time and memory required by development.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart illustrating a method for correcting and evaluating the detection results of mutation detection software according to the present invention;

fig. 2 shows a flowchart for correcting and evaluating the detection result of Platypus mutation detection software based on the result as an input file according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A method for calibrating and evaluating the detection result of mutation detection software, as shown in fig. 1, comprises the following steps:

The method is suitable for whole genome sequencing, whole exon sequencing and target region capture sequencing data. The method can be used for correcting and evaluating the detection result based on the result file of any mutation detection software as input.

Identifying and segmenting polynucleotide variations in the test file in the present method comprises:

(1) obtaining consensus sequences of variant reads in the reference genome and the test file:

(2) Determining a selected consensus sequence according to a principle of preferentially selecting a longest consensus sequence, and segmenting two ends of the selected consensus sequence to obtain two new variation information M and N;

(3) repeating the identification and segmentation of the polynucleotide variation sites of M and N by a recursive algorithm according to the step (2);

the two ends of the selected consensus sequence are segmented based on a character string segmentation technology.

(4) And respectively calculating M, N the length of the consensus sequence of the reference sequence to obtain the segment length P of the variation information M and the segment length Q of the variation information N, judging according to the length values of P and Q, and continuing to carry out polynucleotide variation site recognition and segmentation until the recognition and segmentation of the variation sites of the consensus sequences at the two ends are finished.

when P is Q, the steps (1) to (4) are repeated in the order from left to right according to the coordinate information of M and N on the genome until all the polymorphic mutation sites are completely divided.

In the method, the variation result in the detection file comprises variation information after segmentation and variation information without segmentation;

the segmented and non-segmented variation information comprises a mutated chromosome, a mutation position, a reference base sequence and a mutated base sequence;

The removing duplication of the variation result in the detection file after the segmentation processing specifically comprises:

The consistency evaluation of the detection result by correcting the detection result comprises:

the criteria for the assessment of consistency are: whether the chromosomes of the mutation are the same, whether the coordinate positions of the mutation are the same, whether the reference sequences of the mutation are the same, whether the sequences of the mutation are the same and whether the frequency difference of the mutation is within 0.01, and if the conditions are met, the mutation is judged to be a true positive mutation:

the index of consistency assessment is sensitivity;

the sensitivity calculation is performed by: comparing the variation result in the detection file with the variation detection result filtered by the reference software, wherein the variation number in the detection file/the total variation detection number in the reference software is larger than the variation detection result filtered by the reference software;

and/or the presence of a gas in the gas,

Example 1

Fig. 2 shows a flowchart for correcting and evaluating the detection result of Platypus mutation detection software based on the result as an input file, which specifically includes the following contents:

three different tumor tissues were selected, and the control group for each tumor tissue was peripheral blood leukocytes (supplied by Beijing Gionee plus medical laboratory).

1. Respectively carrying out nucleic acid extraction on the tumor tissues, constructing a nucleic acid library, and sequencing a target capture region.

In order to ensure the accuracy of mutation detection, the average sequencing depth of a target capture area of the tumor tissue reaches over 500 x; the average sequencing depth of the target capture area of the control group is more than 200X.

2. And comparing the detected tumor tissue and the control group sequencing data with the reference genome respectively to obtain comparison result files.

Comparing the sequencing data of the detection group and the control group with the reference genome by adopting BWA-MEM software;

the comparison result file comprises a tumor tissue comparison result and a comparison result of a control group.

3. And (3) respectively adopting GATK-mutat 2 software and Platypus mutation detection software to carry out mutation detection analysis on the comparison result of the tumor tissue and the comparison result of the control group, wherein the comparison of the detection results shows that the detection mutation number of the Platypus mutation detection software is inaccurate and a large number of polynucleotide mutation sites exist.

Analyzing the comparison result of each group of tumor tissues and the comparison result of the control group by using GATK-mutat 2 software and Platypus mutation detection software respectively, searching the mutation of the tumor tissues by using the control group as a background, and obtaining the detection results shown in Table 1:

TABLE 1 data of three samples tested using Mutect2 and Platyus software

Table 2 shows the resource consumption values of the Platyus software and the GATK-mutact2 software, and the Platyus software can complete the detection of the mutation in a shorter time in the detection process compared with the GATK-mutact2 software. As can be seen from Table 1, the number of variation in SNP sites and indel sites detected by Platyus software before and after correction is far greater than that of the GATK-mutat 2 software, and a large number of polynucleotide variation sites also exist in the detection result of the Platyus software.

TABLE 2 comparison of two software resource consumptions

Software	Number of passes	Memory device	Time consuming
				Platyus	6	0.5G	20 minutes
GATK-mutact2	6	10G	700 minutes

4. And identifying and dividing the polynucleotide variation according to the variation detection result of the Platypus variation detection software.

Identification and segmentation of polynucleotide variations requires the following steps:

(1) consensus sequences of the reference genome and variant reads were obtained.

The method for obtaining the consensus sequence needs to be set according to the length of the reference sequence and the variation site of the reference genome.

if the lengths of the reference sequence and the variation site are both 2, and the two bases of the reference genome and the variation site are different, splitting the polymorphic variation site into two SNP sites without carrying out polynucleotide variation identification;

if the reference sequence is more than or equal to 2 and the length of the variation site is more than 2, searching the consensus sequence of the reference genome and the variation reading section based on a pattern recognition algorithm.

(2) After obtaining the consensus sequence, the longest consensus sequence principle is preferentially selected to segment the two ends of the selected consensus sequence, and two new variation information M and N are obtained.

Wherein the segmentation of both ends of the selected consensus sequence is based on a string segmentation technique.

(3) And (3) repeating the identification and segmentation of the variant sites of the polynucleotides for M and N according to the step (2) by a recursive algorithm.

(4) Respectively calculating M, N the length of the consensus sequence of the reference sequence to obtain the segment length P of the variation information M and the segment length Q of the variation information N, if P > Q or P < Q, firstly identifying and segmenting the polynucleotide variation sites according to the variation reading with long length of the consensus sequence until the identification and segmentation of the variation sites of the consensus sequence at the two ends are completed;

when P is Q, the steps (1) to (4) are repeated in the order from left to right according to the coordinate information of M and N on the genome until all the polynucleotide mutation sites are completely divided.

5. Integrating and de-duplicating variation detection results

The segmented and non-segmented variation information includes mutated chromosomes, mutated positions, reference base sequences, and mutated base sequences;

the variation results are integrated as follows: and merging the variation information, namely integrating the variation of the mutant base sequence into a line according to the standard that the chromosome, the mutant position and the reference sequence of the mutation in the variation information are the same, and taking the variation information as the variation information of one site.

And removing the duplicate of the integrated variation information, and reserving one variation information by adopting a random algorithm for the variation result meeting a preset duplicate removal standard.

The preset de-duplication standard is as follows: and if all the information of the chromosome, the mutation position, the reference sequence and the sequence with the variation is the same, the compared variation information is repeated, and only one piece of variation information is reserved.

After the correction of the polymorphic variation sites, a large number of single-base polymorphic sites and insertion-deletion variations can be recovered from the three samples, and specific numerical values are shown in Table 3.

TABLE 3 comparison of calibration data of three samples using Platyus software

(7) And (5) carrying out consistency evaluation on the variation detection result.

The consistency evaluation comprises the following steps: and (3) taking the variation detection result of the GATK mutect2 software as a gold standard, and performing consistency evaluation on the variation detection result of the same sample.

The specific criteria for evaluation are: whether the chromosomes of the mutation are the same, whether the coordinate positions of the mutation are the same, whether the reference sequences of the mutation are the same, whether the sequences of the mutation are the same and whether the frequency difference of the mutation is within 0.01, and if the conditions are met, the mutation is judged to be true positive.

The specific evaluation indexes are as follows: sensitivity, i.e., the number of detected mutations in the Platypus mutation detection software was corrected to the same number of detected mutations/total number of detected mutations in the GATKmutect2 software as compared to the mutation detection results filtered from the GATKmutect2 software.

According to the analysis results in table 1, the sensitivity of the three samples is improved by 1-1.5% before and after the correction of the polynucleotide variation sites. After the Platypus mutation detection software is used, the consistency evaluation method disclosed by the invention is adopted, so that the sensitivity of the Platypus mutation detection software is improved on the basis of shortening the detection time.

In the present embodiment, only GATK-mutect2 detection software is used as reference software, and the detection result of Platypus software is exemplified, but the reference software and the input detection result file are not limited thereto.

The consistency evaluation method can improve the consistency of the detection results of the existing mutation information detection software and the GATK-mutect2, can ensure that the existing mutation information detection software can quickly and accurately obtain the detection results, has no requirement on the types of detection samples, and has wide application range.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for correcting and evaluating detection results of mutation detection software is characterized by comprising the following steps:

2. The method of claim 1, wherein the test file is a test file of any mutation test software, and the test file is calibrated and evaluated.

3. The method of claim 1, wherein the identifying and segmenting the polynucleotide variants in the test file comprises the steps of:

4. The method of claim 3, wherein the obtaining of the consensus sequence of variant reads in the reference genome and the test file comprises:

5. The method of claim 3, wherein the segmentation of the two ends of the selected consensus sequence is based on string segmentation.

6. The method of claim 3, wherein the identification and segmentation of the variant sites of the polynucleotides by the length values of P and Q is continued by the method of correcting and evaluating the detection results of the variant detection software, comprising:

7. The method of claim 1, wherein the mutation results in the test file include segmented and non-segmented mutation information;

8. The method according to claim 7, wherein the de-duplication of the mutation result in the detection file after the segmentation process is specifically:

9. The method of claim 1, wherein the variation detection software is capable of correcting and evaluating the detection result,

the index of consistency assessment is sensitivity;

and/or the presence of a gas in the gas,

10. The application of the method for correcting and evaluating the detection result of the variation detection software is characterized in that the method for correcting and evaluating the detection result of the variation detection software can be applied to correction and evaluation of the variation detection result of whole genome sequencing, whole exon sequencing and target region capture sequencing data thereof.