CN108021789B - Comprehensive strategy for identifying somatic mutation - Google Patents

Comprehensive strategy for identifying somatic mutation Download PDF

Info

Publication number
CN108021789B
CN108021789B CN201711355434.3A CN201711355434A CN108021789B CN 108021789 B CN108021789 B CN 108021789B CN 201711355434 A CN201711355434 A CN 201711355434A CN 108021789 B CN108021789 B CN 108021789B
Authority
CN
China
Prior art keywords
snv
identification
mutation
indel
reliable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711355434.3A
Other languages
Chinese (zh)
Other versions
CN108021789A (en
Inventor
张仕坚
严建龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Predatum Biomedicine Suzhou Co ltd
Original Assignee
Predatum Biomedicine Suzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Predatum Biomedicine Suzhou Co ltd filed Critical Predatum Biomedicine Suzhou Co ltd
Priority to CN201711355434.3A priority Critical patent/CN108021789B/en
Publication of CN108021789A publication Critical patent/CN108021789A/en
Application granted granted Critical
Publication of CN108021789B publication Critical patent/CN108021789B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a comprehensive strategy for identifying somatic mutation, which takes a comparison result as basic data, simultaneously carries out SNV and Indel identification on the basic data by using an SNV identification tool group and an Indel identification tool group, carries out comparison and filtration on all SNV data obtained by using the SNV identification tool group and set conditions, extracts reliable SNV data meeting the conditions, carries out comparison and filtration on all Indel data obtained by using the Indel identification tool group and the set conditions, extracts reliable Indel data meeting the conditions, and combines the reliable SNV data and the reliable Indel data to obtain a reliable somatic mutation result. The invention comprehensively uses a plurality of mutation identification tools to simultaneously reduce the false positive and false negative during mutation identification so as to realize the accurate identification of the somatic mutation, and applies the comprehensive strategy for identifying the somatic mutation to the tools for production, thus having high application value.

Description

Comprehensive strategy for identifying somatic mutation
Technical Field
The invention relates to a method for identifying somatic mutations, in particular to a comprehensive strategy for identifying somatic mutations.
Background
The "somatic mutation" referred to herein is a mutation occurring in a tumor tissue but not occurring in a normal control, and includes SNV (Single Nucleotide Variation) and Indel (Insertion and Deletion).
There are currently numerous tools for somatic mutation identification, including MuTect2, VarScan2, radiaa, Somato Snaper, Strelka, MuSE, etc. for SNV identification, and VarScan2, MuTect2, Strelka, Pindel, Indocator, etc. for Indel identification. Each of these software has the advantage and disadvantage that it does not give the best possible mutation results when used alone. The advantages and disadvantages of a single software are described below, listing two of the more widely used softwares.
MuTect2 is part of the best-known software suite GATK (genome Analysis toolkit) in the field of mutation identification. The method has the advantages that the Bayesian classifier is used, known mutation is used as background knowledge, and statistical test is carried out on the possibility that the mutation in the tumor tissue is somatic mutation and the possibility that corresponding mutation does not occur in normal tissue, so that the judgment on the somatic cell has higher accuracy. In addition, the method can be used for splicing local sequences of mutation sites again, so that Indel and SNV near Indel can be identified more accurately. The disadvantage is that without extensive practical application, false negatives or false positives may be present. In addition, the software is also in the beta version (test version), may have bugs (program bugs), and is not recommended for use in a production environment.
VarScan2 is a common software used in the art for identifying mutations. The method has the advantages of high running speed and relatively accurate mutation identification. The disadvantage is that Fisher's Exact Test is used to statistically Test whether the mutation is somatic, which is a simple, purely mathematical method, and is not based on prior knowledge as in MuTect 2. In addition, the presence of some mutations is missed without reason.
These single tools all present the problem of false positives and false negatives in identifying somatic mutations. In a 'tumor-normal control' paired sample, a single software tool in the current field is controlled to identify false positive and false negative in mutation, so that the specificity and sensitivity are improved simultaneously, somatic mutation is accurately identified, and finally the position (chromosome) and the mutation form (wild type and mutant type) which are closest to the real somatic mutation of the tumor are obtained. No forming, writing or other methods or tools can directly realize the function.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provide a comprehensive strategy for identifying somatic mutation, which controls false negative and false positive at the same time and realizes the accurate identification of the somatic mutation by the combination of the two aspects.
In order to achieve the purpose, the invention adopts the following technical means: a comprehensive strategy for identifying somatic mutation is characterized in that a comparison result is used as basic data, an SNV identification tool group and an Indel identification tool group are used for carrying out SNV and Indel identification on the basic data at the same time, all SNV data obtained by the SNV identification tool group are compared and filtered with set conditions, reliable SNV data meeting the conditions are extracted, all Indel data obtained by the Indel identification tool group are compared and filtered with the set conditions, reliable Indel data meeting the conditions are extracted, and the reliable SNV data and the reliable Indel data are combined to obtain a reliable somatic mutation result.
Further, the Alignment result is stored using a BAM (Binary Alignment/Map) format.
Furthermore, the comparison result is obtained by BWA comparison software, and Re-comparison is carried out on the comparison result by using a Re-align module of GATK.
Further, the SNV identification tool set consists of MuSE, mutec 2, Radia, somaicsniper and VarScan2, five somatic mutation identification tools.
Further, the Indel identification tool set consisted of Pindel, Strelka and VarScan2, three somatic mutation identification tools.
Further, the comparative filtering means: for SNV, SNV was identified in at least two of five tools; identifying, for indels, indels in at least two of the three tools; for the two types of mutations, the sequencing coverage depth in the normal sample is not less than 8, and the sequencing coverage depth in the tumor sample is not less than 10.
Further, the reliable somatic mutation results obtained by the combination are as follows: and combining the SNV and Indel obtained by comparative filtration to obtain the final reliable mutation.
The invention has the beneficial effects that: because the SNV identification tool group and the Indel identification tool group are used for simultaneous identification, and a plurality of mutation identification tools are comprehensively used to achieve the purpose of simultaneously reducing false positive and false negative during mutation identification, thereby realizing the accurate identification of somatic mutation.
Drawings
The invention is further illustrated with reference to the following figures and examples.
FIG. 1 is a flow chart of the somatic mutation identification of the present invention.
Detailed Description
As shown in FIG. 1, a comprehensive strategy for identifying somatic mutations is to use the comparison result as basic data, to use an SNV identification tool set and an Indel identification tool set to simultaneously identify SNV and Indel in the basic data, to use all SNV data obtained by the SNV identification tool set to compare with set conditions and filter, to take reliable SNV data meeting the conditions, to use all Indel data obtained by the Indel identification tool set to compare with set conditions and filter, to take reliable Indel data meeting the conditions, and to combine the reliable SNV data with the reliable Indel data to obtain reliable somatic mutation results.
The Alignment results are stored using BAM (Binary Alignment/Map) format.
The comparison result is obtained by comparison of BWA comparison software, and Re-comparison is carried out on the comparison result by using a Re-align module of GATK.
The SNV identification tool set consisted of MuSE, mutec 2, Radia, somaicsniper and VarScan2, five somatic mutation identification tools.
The Indel identification tool set consisted of Pindel, Strelka and VarScan2, three somatic mutation identification tools.
The comparative filtration is that: for SNV, SNV was identified in at least two of five tools; identifying, for indels, indels in at least two of the three tools; for the two types of mutations, the sequencing coverage depth in the normal sample is not less than 8, and the sequencing coverage depth in the tumor sample is not less than 10.
The reliable somatic mutation results obtained by the combination are as follows: and combining the SNV and Indel obtained by comparative filtration to obtain the final reliable mutation.
Examples of the applications
The method is used for carrying out process development by using Bash (Bourne-Again Shell) language, and the obtained tool consists of the following modules:
a main control module: is a bash script. The main function is to call each mutation identification module to carry out mutation identification and integrate mutation results. When each mutation identification module is called, computing resources (threads) are proportionally allocated according to the running time of each mutation identification tool, so that a plurality of tools run and finish at adjacent time as far as possible (as long as any tool does not run and finish, the downstream steps cannot be continued). The parameters that the program can specify include: output catalog, bus number, whether it is whole genome sequencing. The input files required by the program include: reference genomic sequence, dbSNP locus file, COSMIC locus file, GATK toolkit file path, VarScan2 toolkit file path, and sample information file. Wherein, the sample information file is a table file, and comprises 5 columns: tumor sample name, tumor sample alignment result BAM file, normal control sample name, normal control sample alignment result BAM file and capture region file used in sequencing (whole genome sequencing may not be specified).
A mutation identification module: the mutation identification module refers to a bash script corresponding to each mutation identification tool. The MuSE tool is taken as an example for explanation. The "MuSE identification SNV module" includes the MuSE standard analysis flow, sequencing depth filtering (vccffilter. pl script), mutation frequency mapping (hist. r script). Wherein the standard analysis flow is different in each mutation identification tool, and the filtering condition and the drawing pattern of the sequencing depth filtering and mutation frequency drawing in each tool are kept consistent.
Mutation integration module: the mutation integration module is used for integrating the mutations obtained by each tool in the mutation identification module by using integration.pl, setting a filtering condition for mutation screening, and finally obtaining a reliable somatic mutation list in a format of VCF (variant Call Format) and TSV (Tab-separated Values).
The invention uses SNV identification tool group and Indel identification tool group for simultaneous identification, and comprehensively uses a plurality of mutation identification tools to achieve the purpose of simultaneously reducing false positive and false negative during mutation identification, thereby realizing the accurate identification of somatic mutation, applying the comprehensive strategy for identifying somatic mutation to a test tool to be used for production, and having high application value.
The above description is only for the specific embodiments of the present invention, and not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the protection scope of the present invention.

Claims (1)

1. A method of identifying somatic mutations, comprising: taking the comparison result as basic data, simultaneously carrying out SNV and Indel identification on the basic data by using an SNV identification tool set and an Indel identification tool set, carrying out comparison filtering on all SNV data obtained by using the SNV identification tool set and set conditions, taking reliable SNV data meeting the conditions, carrying out comparison filtering on all Indel data obtained by using the Indel identification tool set and the set conditions, taking reliable Indel data meeting the conditions, and combining the reliable SNV data and the reliable Indel data to obtain a reliable somatic mutation result, wherein the SNV identification tool set consists of MuSE, Techmut 2, ia Rad, SomaticSnipe and VarScan2 and five somatic mutation identification tools, and the Indel identification tool set consists of Pindel, Strelka and VarScan2 and three somatic mutation identification tools; when the bash scripts corresponding to the mutation identification tools are called, computing resources or threads are proportionally distributed according to the running time of each mutation identification tool, so that the running of a plurality of somatic mutation identification tools in adjacent time is completed; integrating mutation results, integrating the mutations obtained by each tool by using integration.pl, and setting a filtering condition for mutation screening; the comparative filtration is that: for SNV, SNV was identified in at least two of five tools; identifying indels in at least two of the three tools for indels, and finally obtaining a reliable somatic mutation list in VCF and TSV formats; aiming at the two types of mutation, the sequencing coverage depth in a normal sample is not less than 8, the sequencing coverage depth in a tumor sample is not less than 10, and the reliable somatic mutation result obtained by the combination means that: combining the SNV and Indel obtained by comparison and filtration to obtain final reliable mutation; storing the comparison result by using a BAM format; the comparison result is obtained by comparison of BWA comparison software, and the Re-comparison of the comparison result is carried out by using a Re-align module of GATK.
CN201711355434.3A 2017-12-16 2017-12-16 Comprehensive strategy for identifying somatic mutation Active CN108021789B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711355434.3A CN108021789B (en) 2017-12-16 2017-12-16 Comprehensive strategy for identifying somatic mutation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711355434.3A CN108021789B (en) 2017-12-16 2017-12-16 Comprehensive strategy for identifying somatic mutation

Publications (2)

Publication Number Publication Date
CN108021789A CN108021789A (en) 2018-05-11
CN108021789B true CN108021789B (en) 2022-06-07

Family

ID=62074080

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711355434.3A Active CN108021789B (en) 2017-12-16 2017-12-16 Comprehensive strategy for identifying somatic mutation

Country Status (1)

Country Link
CN (1) CN108021789B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109337957A (en) * 2018-12-25 2019-02-15 江苏医联生物科技有限公司 The method for detecting genome multimutation type
CN109698011B (en) * 2018-12-25 2020-10-23 人和未来生物科技(长沙)有限公司 Indel region correction method and system based on short sequence comparison
CN110846411B (en) * 2019-11-21 2020-09-18 上海仁东医学检验所有限公司 Method for distinguishing gene mutation types of single tumor sample based on next generation sequencing
CN111793678A (en) * 2020-07-30 2020-10-20 臻悦生物科技江苏有限公司 Method and kit for detecting homologous recombination pathway gene mutation based on next-generation sequencing technology

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106282320A (en) * 2015-05-20 2017-01-04 广州华大基因医学检验所有限公司 The method and apparatus of detection bodies cell mutation
CN106611106A (en) * 2016-12-06 2017-05-03 北京荣之联科技股份有限公司 Gene variation detection method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090220507A1 (en) * 2005-07-22 2009-09-03 Sucharov Carmen C Inhibition of extracellular signal-regulated kinase 1/2 as a treatment for cardiac hypertrophy and heart failure

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106282320A (en) * 2015-05-20 2017-01-04 广州华大基因医学检验所有限公司 The method and apparatus of detection bodies cell mutation
CN106611106A (en) * 2016-12-06 2017-05-03 北京荣之联科技股份有限公司 Gene variation detection method and device

Also Published As

Publication number Publication date
CN108021789A (en) 2018-05-11

Similar Documents

Publication Publication Date Title
CN108021789B (en) Comprehensive strategy for identifying somatic mutation
WO2019108695A1 (en) Detecting intratumor heterogeneity of molecular subtypes in pathology slide images using deep-learning
CN107992721B (en) Method, apparatus and storage medium for detecting target region gene fusion
CN107480470B (en) Known variation detection method and device based on Bayesian and Poisson distribution test
CN109725013B (en) X-ray diffraction data analysis system
CN105183642B (en) Program behavior based on pitching pile obtains and structure analysis method
CN106407743B (en) A kind of high-throughput data analysing method based on cluster
Kozak et al. Rampant genome-wide admixture across the Heliconius radiation
WO2019169760A1 (en) Test case range determining method, device, and storage medium
CN111192630B (en) Metagenomic data mining method
Peralta et al. SNiPloid: a utility to exploit high‐throughput SNP data derived from RNA‐seq in allopolyploid species
Russo et al. Comparative study of aCGH and Next Generation Sequencing (NGS) for chromosomal microdeletion and microduplication screening
CN106296129A (en) A kind of status indicator method and device
Page et al. Methods for mapping and categorization of DNA sequence reads from allopolyploid organisms
CN117253539B (en) Method and system for detecting sample pollution in high-throughput sequencing based on germ line mutation
CN105389201B (en) A kind of process management method and its system based on High Performance Computing Cluster
CN105161439A (en) Wafer testing management system and method
CN105420374B (en) A kind of induction myeloid-lymphoid stem cell applies mutation detection methods early period
Song et al. Techniques for detecting chromosomal aberrations in myelodysplastic syndromes
CN115954052A (en) Method and system for screening monitoring sites of tiny residual lesions of solid tumors
WO2016197028A1 (en) Determining the limit of detection of rare targets using digital pcr
CN110222014B (en) Maintenance method of bus map of distributed file system and related components
WO2018006057A1 (en) Synthetic wgs bioinformatics validation
CN110021342B (en) Method and system for accelerating identification of variant sites
CN107794216A (en) A kind of modular system and method for oncogene detection streamline

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20190425

Address after: 215129 No. 66 Wangmi Street, Suzhou High-tech Zone, Jiangsu Province

Applicant after: Predatum Biomedicine (Suzhou) Co.,Ltd.

Address before: Room 2166, 2nd floor, 23 Building, 72 Sanjie, Qinghe, Haidian District, Beijing

Applicant before: PRECISION SCIENTIFIC TECHNOLOGY (BEIJING) CO.,LTD.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Comprehensive Strategy for Identifying Somatic Mutations

Effective date of registration: 20231030

Granted publication date: 20220607

Pledgee: Zhongguancun Technology Leasing Co.,Ltd.

Pledgor: Predatum Biomedicine (Suzhou) Co.,Ltd.

Registration number: Y2023980063358

PE01 Entry into force of the registration of the contract for pledge of patent right