CN110846411B

CN110846411B - Method for distinguishing gene mutation types of single tumor sample based on next generation sequencing

Info

Publication number: CN110846411B
Application number: CN201911147268.7A
Authority: CN
Inventors: 赵国栋; 乔宗赟; 陈洁
Original assignee: Shanghai Rendong Medical Laboratory Co ltd
Current assignee: SHANGHAI RENDONG MEDICAL LABORATORY Co.,Ltd.; Suzhou Rendong Bioengineering Co.,Ltd.
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-09-18
Anticipated expiration: 2039-11-21
Also published as: CN110846411A

Abstract

The invention relates to a method for distinguishing gene mutation types by using single tumor samples based on next generation sequencing, which respectively uses the tumor tissue samples and normal tissue samples to build a library and carry out NGS sequencing, takes the chain preference, different types of base frequencies, base comparison quality and noise frequency of mutation sites stored in an intermediate file BAM for analyzing the biological information of the tumor tissue samples as training characteristics of machine learning, simultaneously pairs the type information of the corresponding mutation sites of the normal tissue samples as a prediction mutation type, constructs a classification prediction model for distinguishing somatic mutation and germ system mutation, distinguishes the somatic mutation and the germ system mutation by using the model, has high detection efficiency and high specificity, can use the single tumor samples to carry out NGS sequencing and mutation detection after the model is built, can well save the detection cost of the normal or cancer samples, meanwhile, the problem that normal tissues of specific types of tumor patients are not easy to obtain can be solved.

Description

Method for distinguishing gene mutation types of single tumor sample based on next generation sequencing

Technical Field

The invention belongs to the technical field of gene detection, and particularly relates to a method for distinguishing gene mutation types of single tumor samples based on next generation sequencing.

Background

High-throughput sequencing (NGS) is a large-scale parallel sequencing technology, sequences covering all genes in a sample can be detected by using a high-throughput sequencing mode, and geneticists and oncologists can obtain gene mutation information of a plurality of target regions on a genome at one time by combining related variation detection software. Different types of genetic mutations may occur in each cell of an individual, and the genetic mutations can be divided into genetic mutations and somatic mutations according to sources, the genetic mutations are inherited by parents, the genetic mutations are also called germ line mutations or germ line mutations because the mutations exist in germ cells of parents, and if the germ line mutations occur, all cells in the individual carry the germ line mutations; somatic mutations occur at any stage of development of an individual from a fertilized egg and are present only in certain specific cells, not every cell of the individual, and are not inherited from parents but are altered by the influence of the environment and other factors, and gene mutations are caused by errors in self-replication of DNA during cell division.

The genetic mutation analysis and judgment of tumor tissues of tumor patients usually needs to be combined with normal tissue samples, namely, the samples are matched for detection, and meanwhile, the germ line mutation and the somatic mutation are distinguished through bioinformatics analysis processes such as GATK and the like. Theoretically, the mutation ratio of homozygous germline mutation is 100%, and the mutation ratio of heterozygous germline mutation is 50%, based on which the germline mutation and the somatic mutation can be distinguished, but actually, because of the different preference of DNA double-strand amplification in the experimental detection process, the proportion of DNA fragments obtained by second-generation sequencing to DNA fragments of germline mutation is 100% or less or about 50%.

In the prior art, the single tumor sample region somatic mutation and embryonic line mutation of the used whole exon sequencing and targeted sequencing are mainly compared with the existing mutation database to filter out the embryonic line mutation of the detected mutant, the method adopts the self-built VariantDx to detect all the mutations, filters out the mutation with the frequency of more than 1% of the human population in the thousand human genome project and the mutation in dbSNP (version number 138), and selects the mutation in COSMIC and kinase structural domain to determine as somatic mutation. However, the above method has the disadvantages that the judgment criteria of the germline mutation and the somatic mutation mainly depend on the existing database, and the positive and negative chain preference of the sequenced DNA fragment generated in the experimental detection process and the mutation frequency characteristics of the germline mutation per se are not considered, so that some germline mutations which are not recorded in the database are finally found and are not filtered and are judged as the somatic mutation, and meanwhile some somatic mutations are filtered out due to the consistency of the locus and the variant amino acid with dbSNP and are judged as the germline mutation, so that the specificity of the differentiation judgment is low and is only 67% effective.

Disclosure of Invention

The invention aims to provide a method for distinguishing gene mutation types of single tumor samples based on next generation sequencing, which can really and quickly distinguish somatic mutation and germ line mutation by establishing a classification model as a classifier for predicting the mutation types of the single tumor samples.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a method for differentiating gene mutation types based on single tumor samples of next generation sequencing, comprising the following steps:

s1, extracting DNA from the tumor tissue sample and the normal tissue sample, and performing library building and sequencing on the DNA extracted from the tumor tissue sample and the normal tissue sample by adopting a probe capture method;

s2, performing sequence alignment on the DNA sequencing data from the tumor tissue sample and the normal tissue sample respectively obtained in step S1 by using BWA MEM algorithm, and generating an alignment file tumor.dup.bam belonging to the tumor tissue sample and an alignment file normal.dup.bam belonging to the normal tissue sample;

s3, analyzing the germ line mutation in the normal tissue sample and the mixed mutation in the tumor tissue sample respectively by utilizing the GATK standard process, and analyzing the somatic mutation in the tumor tissue sample by combining the normal tissue sample and the tumor tissue sample;

s4, using the mutation site information of the mixed mutation in the tumor tissue sample obtained in the step S3 as the input characteristic of a data set required for creating machine learning, and simultaneously combining the germline mutation site in the normal tissue sample obtained in the step S3 and the somatic mutation site in the tumor tissue sample as the mutation type result to be predicted;

s5, randomly dividing the data set obtained in the step S4 into a training data set and a testing data set, performing model training on the training data set by adopting an SVM or KNN, constructing a training model, testing by using the testing data set, evaluating the prediction effect of the training model, and selecting an optimal training model to further obtain a machine learning model capable of distinguishing the gene mutation type in the tumor tissue sample;

s6, the machine learning model obtained in the step S5 is used for carrying out splitting verification on the whole data set obtained in the step S4, and finally the obtained classification model can be used as a classifier for prediction of mutation types of brand-new single tumor samples and distinguishing somatic mutation and germ line mutation.

Further, in step S3, the HaploCaller tool in the GATK standard process is used to separately process the normal tissue sample alignment file, so as to analyze the germline mutation in the normal tissue sample; the tumor tissue sample comparison file is independently processed by utilizing a Mutect2 tool in the GATK standard flow, and then mixed mutation in the tumor tissue sample is analyzed; and (3) processing the comparison file of the normal tissue sample and the comparison file of the tumor tissue sample by utilizing a Mutect2 tool in the GATK standard flow, and further analyzing the somatic mutation in the tumor tissue sample.

Further, in step S4, the mutation site information includes information of the mutation site corresponding to the tumor tissue sample alignment file, and bam-recount software is used to calculate the base numbers of the positive strand and the negative strand of the reference base, the base numbers of the positive strand and the negative strand of the mutant allele, the base numbers of the positive strand and the negative strand of other noise bases, and the corresponding base average quality values of the mutation site, and the aligned average quality values.

Further, the training set tested in step S4 includes ATCG base frequencies, base average quality values, and aligned average quality values based on the mutated sites of somatic mutations and the mutated sites of germline mutations known through step S3.

Further, the whole data set is split 10-20 times in the step S6.

The invention has the following beneficial effects:

1. the invention respectively uses the tumor tissue sample and the normal tissue sample to build a library and carry out NGS sequencing, takes the chain preference, different types of base frequencies, base comparison quality and noise frequency of mutation sites stored in an intermediate file BAM for analyzing biological information of the tumor tissue sample as training characteristics of machine learning, simultaneously pairs the mutation information of the normal tissue sample as a prediction mutation type, constructs a classification prediction model for distinguishing somatic mutation and germ system mutation, and uses the model to distinguish the somatic mutation from the germ system mutation, so that the detection efficiency and the specificity are high, and after the model is built, the NGS sequencing and mutation detection can be carried out by using the single tumor sample, thereby well saving the detection cost of the normal or cancer sample, and simultaneously solving the problem that the normal tissue of a tumor patient of a specific type is difficult to obtain.

2. The invention overcomes the defect that the standard for distinguishing the germline mutation and the somatic mutation depends on the existing database in the prior art, avoids the judgment error that the germline mutation which is not recorded by some databases is judged as the somatic mutation or some somatic mutations are misjudged as the germline mutation because the sites and the variant amino acids are consistent with dbSNP because the sequencing DNA segment positive and negative chain preference generated by an experimental method and the mutation frequency characteristics of the germline mutation per se are not considered in the prior art, utilizes the prediction classification model as a classifier for the prediction of the mutation types of individual tumor samples, and can truly and rapidly distinguish the somatic mutation and the germline mutation.

3. According to the invention, through the analysis of the results of the target NGC sequencing of 189 tumor tissues, the germline mutation type and the somatic mutation type in a mutation site can be better identified, the target gene of the target sequencing is the gene related to tumor target chemotherapy drugs, the sensitivity of classification of the germline mutation after model training by adopting an SVM algorithm can reach 91.5%, and the specificity reaches 85.32%;

drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in detail below with reference to specific embodiments and the accompanying drawings.

Referring to fig. 1, a method for distinguishing gene mutation types of individual tumor samples based on next generation sequencing specifically comprises the following steps:

s1, 189 target areas are selected, DNA extraction is carried out on the 189 target areas in a tumor tissue sample and a normal tissue sample, the DNA extracted from the tumor tissue sample and the normal tissue sample is subjected to library building and sequencing by a probe capture method, and original data of 189 target area sequencing is obtained;

s3, independently processing the normal tissue sample comparison file by utilizing a HaploCaller tool in the GATK standard flow, and further analyzing the germ line mutation in the normal tissue sample; processing a normal tissue sample comparison file and a tumor tissue sample comparison file by using a Mutect2 tool in a GATK standard flow, and further analyzing somatic mutation in the tumor tissue sample; processing the tumor tissue sample comparison file (without adding a normal tissue sample comparison file) by using a Mutect2 tool in a GATK standard process to obtain a VCF file mixed with somatic mutation and potential germline mutation, wherein all mutations contained in the VCF file are mixed mutations;

s4, extracting mutation site information from the mixed mutation in the tumor tissue sample obtained in the step S3 to be used as a training data set required by machine learning; calculating the base numbers of the positive strand and the negative strand of the reference base, the base numbers of the positive strand and the negative strand of the mutant allele, the base numbers of the positive strand and the negative strand of other noise bases, the corresponding mutant site base average quality values and the aligned average quality values by adopting bam-ready count software according to the information of the mutant site corresponding to the tumor tissue sample alignment file; the bam-readcount operation method comprises the following steps:

bam-readcount-f hg19.fasta${sample_bam}-l${snpf}-i>${outdir}/${sample}_snps.txt 2>/dev/null

wherein, $ { sample _ bam } is a duplication removal comparison file after comparison of the tumor tissue samples; $ snpf is the file name, including the chromosome of the detected mutation site and the corresponding base coordinates;

extracting mutation sites in the mixed mutation, marking corresponding homozygous germline mutation, heterozygous germline mutation and somatic mutation existing in the paired sample, namely the germline mutation sites in the normal tissue sample and the somatic mutation in the tumor tissue sample, to obtain 13130 mutation sites, wherein the ATCG base frequency, the average base quality value and the compared average quality value of the mutation sites form the input features of the data set;

s5, characteristic values and corresponding mutation types of 13130 mutation sites are extracted to form a whole data set for machine learning, 3/4 of total data are randomly extracted to serve as a training data set, the rest of total data serve as a testing data set, training is carried out by using a kNN algorithm and an SVM algorithm of an open source machine learning kit skleran of python, and then a testing model capable of distinguishing gene mutation types in a tumor tissue sample is obtained, wherein code statement blocks for training are as follows:

using the KNN algorithm:

using the SVM algorithm:

s6, carrying out multiple random splitting (preferably 10-20 times) verification on the whole data set obtained in S5 by the test model obtained in the step S5, and finally obtaining a classification model which can be used as a classifier for prediction of mutation types of brand-new single tumor samples and distinguishing somatic mutation and germ line mutation, wherein the number of predicted sites of the prediction classification model obtained by respectively training two algorithms is as follows:

test method	TP	TN	FP	FN(het)	FN(hom)	Sensitivity of the probe	Specificity of
								SVM	8614	3174	546	706	90	92.5％	85.32％
kNN	8385	3180	540	858	167	89.1％	85.48％

From the table above, it can be seen that the prediction accuracy using SVM is high, and accurate 8614 true germline mutations can be predicted from 9410 germline mutations, 3174 of 3720 individual cell mutations can be predicted, the sensitivity reaches 92.5%, and the specificity reaches 86.32%.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the present specification, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for distinguishing gene mutation types of single tumor samples based on next generation sequencing is characterized by comprising the following steps:

s2, performing sequence comparison on the DNA sequencing data which are respectively obtained from the tumor tissue sample and the normal tissue sample in the step S1 by using a BWAMEM algorithm, and simultaneously generating a comparison file Tumor.dup.bam belonging to the tumor tissue sample and a comparison file Normal.dup.bam belonging to the normal tissue sample;

s4, using mutation site information of mixed mutation in the tumor tissue sample obtained in the step S3 as a data set input characteristic required by machine learning creation; calculating the base numbers of the positive strand and the negative strand of the reference base, the base numbers of the positive strand and the negative strand of the mutant allele, the base numbers of the positive strand and the negative strand of other noise bases, the corresponding base average quality values of the mutant sites and the aligned average quality values by adopting bam-ready count software according to the information of the mutant sites in the tumor tissue sample alignment file; meanwhile, combining the germ line mutation site in the normal tissue sample obtained in the step S3 and the somatic mutation site in the tumor tissue sample, inputting characteristics as a mutation type result to be predicted;

s6, the machine learning model obtained in the step S5 is used for carrying out splitting verification on the whole data set obtained in the step S4, and finally the obtained classification model can be used as a classifier for prediction of mutation types of brand-new single tumor samples and distinguishing somatic mutation and germ line mutation;

the method for distinguishing gene mutation types based on single tumor samples of next generation sequencing is used for the purpose of diagnosis of non-diseases.

2. The method of claim 1, wherein the single tumor sample is subjected to next generation sequencing to distinguish the types of gene mutations, and wherein: in the step S3, a HaploCaller tool in the GATK standard flow is used for independently processing the comparison file of the normal tissue sample, and further the germ line mutation in the normal tissue sample is analyzed; the tumor tissue sample comparison file is independently processed by utilizing a Mutect2 tool in the GATK standard flow, and then mixed mutation in the tumor tissue sample is analyzed; and (3) processing the comparison file of the normal tissue sample and the comparison file of the tumor tissue sample by utilizing a Mutect2 tool in the GATK standard flow, and further analyzing the somatic mutation in the tumor tissue sample.

3. The method of claim 1, wherein the single tumor sample is subjected to next generation sequencing to distinguish the types of gene mutations, and wherein: the test training set of step S5 includes the ATCG base frequency, the average quality value of bases, and the average quality value of alignment based on the mutated site of somatic mutation and the mutated site of germline mutation known through step S3.

4. The method of claim 3, wherein the single tumor sample is subjected to next generation sequencing to distinguish the types of gene mutations, wherein: in the step S6, the whole data set is split 10-20 times.