Comprehensive strategy for identifying somatic mutation
Technical Field
The invention relates to a method for identifying somatic mutations, in particular to a comprehensive strategy for identifying somatic mutations.
Background
The "somatic mutation" referred to herein is a mutation occurring in a tumor tissue but not occurring in a normal control, and includes SNV (Single Nucleotide Variation) and Indel (Insertion and Deletion).
There are currently numerous tools for somatic mutation identification, including MuTect2, VarScan2, radiaa, Somato Snaper, Strelka, MuSE, etc. for SNV identification, and VarScan2, MuTect2, Strelka, Pindel, Indocator, etc. for Indel identification. Each of these software has the advantage and disadvantage that it does not give the best possible mutation results when used alone. The advantages and disadvantages of a single software are described below, listing two of the more widely used softwares.
MuTect2 is part of the best-known software suite GATK (genome Analysis toolkit) in the field of mutation identification. The method has the advantages that the Bayesian classifier is used, known mutation is used as background knowledge, and statistical test is carried out on the possibility that the mutation in the tumor tissue is somatic mutation and the possibility that corresponding mutation does not occur in normal tissue, so that the judgment on the somatic cell has higher accuracy. In addition, the method can be used for splicing local sequences of mutation sites again, so that Indel and SNV near Indel can be identified more accurately. The disadvantage is that without extensive practical application, false negatives or false positives may be present. In addition, the software is also in the beta version (test version), may have bugs (program bugs), and is not recommended for use in a production environment.
VarScan2 is a common software used in the art for identifying mutations. The method has the advantages of high running speed and relatively accurate mutation identification. The disadvantage is that Fisher's Exact Test is used to statistically Test whether the mutation is somatic, which is a simple, purely mathematical method, and is not based on prior knowledge as in MuTect 2. In addition, the presence of some mutations is missed without reason.
These single tools all present the problem of false positives and false negatives in identifying somatic mutations. In a 'tumor-normal control' paired sample, a single software tool in the current field is controlled to identify false positive and false negative in mutation, so that the specificity and sensitivity are improved simultaneously, somatic mutation is accurately identified, and finally the position (chromosome) and the mutation form (wild type and mutant type) which are closest to the real somatic mutation of the tumor are obtained. No forming, writing or other methods or tools can directly realize the function.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provide a comprehensive strategy for identifying somatic mutation, which controls false negative and false positive at the same time and realizes the accurate identification of the somatic mutation by the combination of the two aspects.
In order to achieve the purpose, the invention adopts the following technical means: a comprehensive strategy for identifying somatic mutation is characterized in that a comparison result is used as basic data, an SNV identification tool group and an Indel identification tool group are used for carrying out SNV and Indel identification on the basic data at the same time, all SNV data obtained by the SNV identification tool group are compared and filtered with set conditions, reliable SNV data meeting the conditions are extracted, all Indel data obtained by the Indel identification tool group are compared and filtered with the set conditions, reliable Indel data meeting the conditions are extracted, and the reliable SNV data and the reliable Indel data are combined to obtain a reliable somatic mutation result.
Further, the Alignment result is stored using a BAM (Binary Alignment/Map) format.
Furthermore, the comparison result is obtained by BWA comparison software, and Re-comparison is carried out on the comparison result by using a Re-align module of GATK.
Further, the SNV identification tool set consists of MuSE, mutec 2, Radia, somaicsniper and VarScan2, five somatic mutation identification tools.
Further, the Indel identification tool set consisted of Pindel, Strelka and VarScan2, three somatic mutation identification tools.
Further, the comparative filtering means: for SNV, SNV was identified in at least two of five tools; identifying, for indels, indels in at least two of the three tools; for the two types of mutations, the sequencing coverage depth in the normal sample is not less than 8, and the sequencing coverage depth in the tumor sample is not less than 10.
Further, the reliable somatic mutation results obtained by the combination are as follows: and combining the SNV and Indel obtained by comparative filtration to obtain the final reliable mutation.
The invention has the beneficial effects that: because the SNV identification tool group and the Indel identification tool group are used for simultaneous identification, and a plurality of mutation identification tools are comprehensively used to achieve the purpose of simultaneously reducing false positive and false negative during mutation identification, thereby realizing the accurate identification of somatic mutation.
Drawings
The invention is further illustrated with reference to the following figures and examples.
FIG. 1 is a flow chart of the somatic mutation identification of the present invention.
Detailed Description
As shown in FIG. 1, a comprehensive strategy for identifying somatic mutations is to use the comparison result as basic data, to use an SNV identification tool set and an Indel identification tool set to simultaneously identify SNV and Indel in the basic data, to use all SNV data obtained by the SNV identification tool set to compare with set conditions and filter, to take reliable SNV data meeting the conditions, to use all Indel data obtained by the Indel identification tool set to compare with set conditions and filter, to take reliable Indel data meeting the conditions, and to combine the reliable SNV data with the reliable Indel data to obtain reliable somatic mutation results.
The Alignment results are stored using BAM (Binary Alignment/Map) format.
The comparison result is obtained by comparison of BWA comparison software, and Re-comparison is carried out on the comparison result by using a Re-align module of GATK.
The SNV identification tool set consisted of MuSE, mutec 2, Radia, somaicsniper and VarScan2, five somatic mutation identification tools.
The Indel identification tool set consisted of Pindel, Strelka and VarScan2, three somatic mutation identification tools.
The comparative filtration is that: for SNV, SNV was identified in at least two of five tools; identifying, for indels, indels in at least two of the three tools; for the two types of mutations, the sequencing coverage depth in the normal sample is not less than 8, and the sequencing coverage depth in the tumor sample is not less than 10.
The reliable somatic mutation results obtained by the combination are as follows: and combining the SNV and Indel obtained by comparative filtration to obtain the final reliable mutation.
Examples of the applications
The method is used for carrying out process development by using Bash (Bourne-Again Shell) language, and the obtained tool consists of the following modules:
a main control module: is a bash script. The main function is to call each mutation identification module to carry out mutation identification and integrate mutation results. When each mutation identification module is called, computing resources (threads) are proportionally allocated according to the running time of each mutation identification tool, so that a plurality of tools run and finish at adjacent time as far as possible (as long as any tool does not run and finish, the downstream steps cannot be continued). The parameters that the program can specify include: output catalog, bus number, whether it is whole genome sequencing. The input files required by the program include: reference genomic sequence, dbSNP locus file, COSMIC locus file, GATK toolkit file path, VarScan2 toolkit file path, and sample information file. Wherein, the sample information file is a table file, and comprises 5 columns: tumor sample name, tumor sample alignment result BAM file, normal control sample name, normal control sample alignment result BAM file and capture region file used in sequencing (whole genome sequencing may not be specified).
A mutation identification module: the mutation identification module refers to a bash script corresponding to each mutation identification tool. The MuSE tool is taken as an example for explanation. The "MuSE identification SNV module" includes the MuSE standard analysis flow, sequencing depth filtering (vccffilter. pl script), mutation frequency mapping (hist. r script). Wherein the standard analysis flow is different in each mutation identification tool, and the filtering condition and the drawing pattern of the sequencing depth filtering and mutation frequency drawing in each tool are kept consistent.
Mutation integration module: the mutation integration module is used for integrating the mutations obtained by each tool in the mutation identification module by using integration.pl, setting a filtering condition for mutation screening, and finally obtaining a reliable somatic mutation list in a format of VCF (variant Call Format) and TSV (Tab-separated Values).
The invention uses SNV identification tool group and Indel identification tool group for simultaneous identification, and comprehensively uses a plurality of mutation identification tools to achieve the purpose of simultaneously reducing false positive and false negative during mutation identification, thereby realizing the accurate identification of somatic mutation, applying the comprehensive strategy for identifying somatic mutation to a test tool to be used for production, and having high application value.
The above description is only for the specific embodiments of the present invention, and not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the protection scope of the present invention.