CN115171781B

CN115171781B - Method, system, device and medium for identifying whether tumor mutation sites are noise

Info

Publication number: CN115171781B
Application number: CN202210822496.5A
Authority: CN
Inventors: 胡天亮; 欧小华; 韩尔康; 欧阳聪; 陈思彪; 王维世; 袁剑颖; 陈嘉昌; 柳俊; 胡朝晖
Original assignee: Guangzhou Jinqirui Biotechnology Co ltd
Current assignee: Guangzhou Jinqirui Biotechnology Co ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2023-04-07
Anticipated expiration: 2042-07-13
Also published as: CN115171781A

Abstract

The invention discloses a method, a system, a device and a medium for identifying whether a tumor variation site is noise, which can be applied to the technical field of clinical data processing. The method comprises the steps of firstly obtaining a target feature of a training data set, setting an initial trust degree weight and a definition domain of the target feature, then constructing an integrated linear learning model according to the initial trust degree weight and the definition domain, training the data set and verifying the prediction capability of the data set to optimize the integrated linear learning model, then determining that the iteration times of the integrated linear learning model is greater than a preset iteration time, testing the optimized integrated linear learning model according to the test data set, and identifying whether a tumor mutation site in a next generation sequencing to be processed is noise or not through the tested integrated linear learning model, so that the problem of a characteristic value gray scale interval can be solved, and the identification accuracy of the mutation site is improved; meanwhile, the dependency on manual auditing can be effectively reduced, and therefore the efficiency and accuracy of reporting the auditing result are improved.

Description

Method, system, device and medium for identifying whether tumor mutation sites are noise

Technical Field

The invention relates to the technical field of clinical data processing, in particular to a method, a system, a device and a medium for identifying whether a tumor variation site is noise.

Background

The core of the next generation sequencing technology of tumor lies in the accurate detection and interpretation of the report of the mutation information. Due to the storage and transportation mode of the sample, the storage mode of the sample, the adopted experimental kit and experimental process, the sequencing process and the like, noise signals are brought in the process of searching for mutation signals. This noise signal is very similar to the mutation signal, resulting in the inability to accurately identify and filter the noise signal in the conventional manner of quality control and identification of allele mutation frequencies. In order to ensure the accuracy of the report, the report is audited by a manual auditing mode at present, and the manual auditing mode needs one mutation site for auditing, so that a great deal of manpower and energy are needed, the speed is low, the influence of personal subjectivity is easily caused, and the accuracy of the report is reduced.

Disclosure of Invention

The purpose of the invention is: at least to some extent solve some technical problems existing in the prior filtering technology.

In view of the above, the present invention provides a method, a system, a device and a medium for identifying whether a tumor mutation site is noise, which can effectively improve the accuracy of identifying the mutation site and the efficiency and accuracy of reporting the audit result.

In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the invention comprises the following steps:

in a first aspect, an embodiment of the present invention provides a method for identifying whether a tumor mutation site is noise, including the following steps:

acquiring a preset data set and a corresponding preset result, wherein the preset data set comprises a training data set, a verification data set and a test data set;

screening target characteristics with the largest influence in the training data set, wherein the target characteristics comprise effective depth, mutation frequency, mutant reads number, mutation quality, and the positions of mutation sites on a genome, strand deviation and the ratio of soft-chlip in a mutation occurrence region;

determining an initial trust weight and a domain of definition of the target feature;

constructing an integrated linear learning model according to the initial trust degree weight and the definition domain;

optimizing the predictive power of the integrated linear learning model through the training dataset and the validation dataset;

determining that the iteration times of the integrated linear learning model are larger than the preset iteration times or meet a first threshold value, and testing the optimized integrated linear learning model according to the test data set;

and identifying whether the tumor variation site in the second generation sequencing to be processed is noise or not through the tested integrated linear learning model and a second threshold value.

In some embodiments, the obtaining the preset data set and the corresponding preset result includes:

obtaining sample data after repeated verification of a genome browser to form a preset data set, wherein the sample data comprises original data of a tumor sample subjected to second-generation sequencing, mutation sites preliminarily identified by other software and a preset result after repeated verification of the mutation sites.

In some embodiments, the screening out the most influential target feature includes:

calculating and drawing ROC curves of all characteristics in the training data set, and calculating an area AUC surrounded by the ROC curve of each characteristic and a coordinate axis;

calculating the correlation between all the characteristics and a preset result;

and screening all the characteristics to obtain target characteristics according to the area AUC surrounded by the ROC curve and the coordinate axis and the correlation.

In some embodiments, the determining an initial confidence weight and a domain of definition for the target feature comprises:

determining an initial confidence weight of the target feature, wherein the initial confidence weight comprises a maximum value or a minimum value of a scoring function;

and determining the definition domain of the target feature according to the shape of the ROC curve of the target feature.

In some embodiments, said building an integrated linear learning model from said initial confidence weights and said domain of definition comprises:

respectively modeling the target features according to the initial trust degree weight and the definition domain to obtain a preset model corresponding to each target feature, wherein the preset model comprises a logistic regression model, a Bayesian model and a linear model;

comparing a prediction result of a preset model on the training data set with the preset result, and determining the linear model as a target model;

and integrating the target models corresponding to the target features to obtain an integrated linear learning model.

In some embodiments, said optimizing the predictive power of said integrated linear learning model by said training dataset and said validation dataset comprises:

calculating a first score for all mutation sites in the training dataset by the ensemble linear learning model;

calculating a second score for all mutation sites in the validation dataset by the ensemble linear learning model;

acquiring an area AUC (angular coefficient) enclosed by an ROC (rock characteristic) curve and a coordinate axis of the training data set and the verification data set;

and updating the initial trust degree weight and the domain according to the first score, the second score and an area AUC (AUC) enclosed by the ROC curve and the coordinate axis.

In some embodiments, said optimizing the predictive power of said integrated linear learning model by said training dataset and said validation dataset further comprises:

acquiring a weighting weight;

and adjusting the scoring result of the integrated linear learning model according to the weighting weight.

In a second aspect, an embodiment of the present invention provides a system for identifying whether a tumor mutation site is noisy, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a preset data set and a corresponding preset result, and the preset data set comprises a training data set, a verification data set and a test data set;

the screening module is used for screening out target characteristics which have the largest influence in the training data set, wherein the target characteristics comprise effective depth, mutation frequency, mutant reads number, mutation quality, and the positions of mutation sites on a genome, strand bias and the ratio of soft-chlip existing in a mutation occurrence region;

the determining module is used for determining the initial trust degree weight and the domain of definition of the target feature;

the building module is used for building an integrated linear learning model according to the initial trust degree weight and the definition domain;

an optimization module for optimizing the predictive power of the integrated linear learning model by the training dataset and the validation dataset;

the testing module is used for determining that the iteration times of the integrated linear learning model are larger than the preset iteration times or meet a first threshold value, and testing the integrated linear learning model after optimization according to the testing data set;

and the identification module is used for identifying whether the tumor variation site in the second generation sequencing to be processed is noise or not through the tested integrated linear learning model and a second threshold value.

In a third aspect, an embodiment of the present invention provides an apparatus for identifying whether a tumor variation site is noise, including:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method for identifying whether a tumor variant site is noisy.

In a fourth aspect, the present invention provides a storage medium, in which a program executable by a processor is stored, and the program executable by the processor is used for implementing the method for identifying whether a tumor mutation site is noise or not.

The method for identifying whether the tumor variation site is noise provided by the embodiment of the invention has the following beneficial effects:

the method comprises the steps of firstly obtaining a training data set target feature, setting an initial trust degree weight and a definition domain of the target feature, then constructing an integrated linear learning model according to the initial trust degree weight and the definition domain, optimizing the prediction capability of the integrated linear learning model according to a training data set and a verification data set, then testing the optimized integrated linear learning model according to a test data set after the iteration number of the integrated linear learning model is larger than a preset iteration number or meets a first threshold value, and identifying whether a tumor mutation site in a second generation sequencing to be processed is noise or not through the tested integrated linear learning model; meanwhile, different weights are given to the target characteristics, so that the model can obtain an integral score to distinguish a real mutation site from a noise site, and the identification accuracy of the mutation site is improved; in addition, the method and the device can effectively reduce the dependence on manual auditing, thereby improving the efficiency and accuracy of reporting the auditing result.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made on the drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for identifying whether a tumor mutation site is noisy according to an embodiment of the present invention;

FIG. 2 is a block diagram of a system for identifying whether a tumor mutation site is noisy according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an apparatus for identifying whether a tumor mutation site is noise according to an embodiment of the present invention.

Detailed Description

Terms appearing in the embodiments of the present invention are explained below to assist understanding of the embodiments of the present invention.

Referring to fig. 1, this embodiment discloses a method for identifying whether a tumor mutation site is noise, which can be applied to a server. It should be understood that the server may be a server, a processor corresponding to the device, a platform, and/or a background processor corresponding to the platform. The terminal can be an APP, an applet or a WEB application on the mobile phone of the user.

Referring to fig. 1, the method includes, but is not limited to, the following steps in the application process:

and S110, acquiring a preset data set and a corresponding preset result.

In the embodiment of the application, the preset data set and the corresponding preset result can be formed by obtaining the sample data after the repeated verification of the genome browser. Wherein, the sample data comprises data of the tumor sample after the second generation test. The genome browser includes IGV (Integrated Genomics Viewer). IGV makes a powerful tool for comprehensive genomics visualization, which allows researchers working with biological information to easily analyze and view a variety of genomic data. In addition, the tool can not only visually check sequence information and sequence change and mutation in genetic codes, but also check copy number change, chromatin precipitation data, epigenetic modification and the like. In this embodiment, after a large number of preset data sets are obtained, all the preset data may be divided into a training data set1, a verification data set2, and a test data set3. Wherein the training data set is used in a training process of the model; the verification data set is used for the verification process of the trained model; the test data set is used to verify the test results of the model.

And S120, screening out the target characteristics with the largest influence in the training data set.

In the embodiment of the application, the target characteristics can be obtained by firstly drawing ROC curves of all characteristics in the training data set, calculating area AUC defined by the ROC curves and the coordinates, then calculating the correlation between all the characteristics and the preset result, and then screening all the characteristics according to the area AUC defined by the ROC curves and the coordinate axes and the correlation between all the characteristics and the preset result. Wherein the ROC curve is defined as the receiver working curve such that the true positive rate (sensitivity) is plotted on the ordinate and the false positive rate (1-specificity) is plotted on the abscissa according to a series of different two-classification methods (decomposition values or decision thresholds). AUC (Area Under cut) is defined as the Area enclosed by the coordinate axes Under the ROC Curve, and the value of the Area is not more than 1.AUC is a performance index for measuring the quality of the learner. In this embodiment, 7 representative features may be obtained by screening from more than 30 candidate features of the training data set in a manner of calculating AUC and relevance, and the features serve as target features. Wherein the target characteristics comprise effective depth, mutation frequency, mutant reads number, mutation quality, position of a mutation site on a genome, strand deviation and the ratio of soft-chlip in a region where mutation occurs.

S130, determining the initial trust degree weight and the domain of definition of the target feature.

In the embodiment of the application, the initial confidence weight of the target feature can be determined, wherein the initial confidence weight comprises the maximum value or the minimum value of the scoring function. And meanwhile, determining the definition domain of the target feature according to the shape of the ROC curve of the target feature. It is understood that the present embodiment is a process of parameter initialization and update. Specifically, the current data, past experience and test results (parameter updating) are combined, and each feature is given a confidence weight W _i (i =1.. 7), i.e. the maximum or minimum of the scoring function for each feature, these weights includeTwo types of forward encouragement (maximum) and reverse deduction (minimum) are provided; the value range (domain) of each feature is determined in combination with the results of the analysis and testing of the ROC curve for each feature. The range of the characteristics can be divided into two cases: first, a lower limit T is determined ₀ And an upper limit T ₁ The domain is divided into three intervals. x is the number of<T ₀ When, the score is 0; x is the number of>T ₁ Then, the score is the maximum confidence score; at x ∈ [ T ] ₀ ,T ₁ ]While, for each feature, a scoring function F _i (x) The domain of definition of (1). Second, a threshold value T, x is set>When T is obtained, the score is 0; in the case where x is equal to 0]And (3) a domain of the modeling function. In this embodiment, the initial confidence weight for each target feature may not be the same. However, after model training, the output result (score) of each target feature in the model needs to remain the same.

And S140, constructing an integrated linear learning model according to the initial trust degree weight and the domain.

In the embodiment of the application, the target features can be modeled respectively according to the initial confidence weight and the definition domain to obtain a preset model comprising a logistic regression model, a Bayesian model and a linear model, then the prediction results of the preset model on a training data set are contrastively analyzed to determine the linear model as the target model, and then the target models corresponding to each target feature are integrated to obtain an integrated linear learning model. It is understood that the embodiment trains each feature scoring function F _i (x) The definition domain is initialized based on the trust degree weight and the characteristics; each feature is modeled separately, and the data is modeled separately using a plurality of models. For example, the data is individually modeled using logistic regression, bayes, linear models, and the like, and a linear model (F) is found _i (x) The result of =1.. 7) is most intuitive and reliable, so that an integrated linear learning model is obtained by integrating the linear models corresponding to each target feature.

S150, optimizing the prediction capability of the integrated linear learning model through the training data set and the verification data set; and determining that the iteration number of the integrated linear learning model is larger than a preset iteration number or meets a first threshold value, and testing the optimized integrated linear learning model according to the test data set. Where the first threshold refers to a parameter that ends the loop, which may not be iterated that many times when the prediction of the model is good.

In the embodiment of the application, after a model integrating linear learning is constructed, the data of the DataSet1 DataSet is used for training scoring functions F of different features _i (x) Constructing an integrated linear learning model, and then predicting data of the DataSet1 and the DataSet2, wherein the prediction result is as follows: cumulative sum of scores over 7 features for each mutation site

Specifically, the present embodiment may calculate the first score of all mutation sites in the training dataset by integrating the linear learning model; calculating second scores of all mutation sites in the verification data set through an integrated linear learning model; meanwhile, acquiring the area enclosed by the ROC curve and the coordinate axis of the training data set and the verification data set; and then updating the initial trust degree weight and the domain according to the first score, the second score and the area enclosed by the ROC curve and the coordinate axis. It is understood that this example can be performed by calculating the score Y of all mutation sites in the DataSet DataSet1 and DataSet2 data sets, respectively _j And (4) counting mutations occurring on different genes, score distributions (boxplot) of different mutation types and the like according to the equivalent of AUC values, correlation and accuracy of IGV judgment results. The multi-dimensional result of the ensemble learning is subjected to detailed analysis, and the gray level interval of each feature is expected to be inconsistent with the whole gray level interval, so that the model can still obtain a reliable value when the single or multiple features are not clear (the gray level interval) while the whole prediction capability of the ensemble learning model is not damaged. And updating the weight of each characteristic value and the division of each characteristic value range by combining the result of the detailed analysis. In this embodiment, for some specific sites or specific genes, the embodiment can also obtain weighting weights, howeverAnd then adjusting the scoring result of the integrated linear learning model according to the weighting weight. I.e. for its scored result, it is adjusted by giving it a higher influence factor (weighted weight).

In the embodiment of the application, the ensemble linear learning model is iteratively trained, and the iteration is not finished until the scoring of the ensemble linear learning model on the DataSet2 and the ROC graph of the IGV determination result do not have obvious change or a specified iteration number any more. Then, on a verification data set DataSet3, an integrated linear learning model is used for scoring the data set, and the quality of a result is verified; and combining the scoring results of DataSet2 and DataSet3, thresholding to distinguish true mutation and noise sites. And (4) the threshold value is combined with the actual situation to carry out later stage estimation, and the you den point is referred to for selection. Youden is a Youden index (also called a correct index), is a method for evaluating the authenticity of a screening test, and can be applied when the harmfulness of false negatives (missed diagnosis rate) and false positives (misdiagnosis rate) is equal. The jotan index is the sum of sensitivity and specificity minus 1. Indicating that the screening method finds true patient and non-patient overall ability. The larger the index, the better the screening experiment and the greater the authenticity.

And S160, identifying whether the tumor variation site in the second generation sequencing to be processed is noise or not through the tested integrated linear learning model and a second threshold value.

In the present embodiment, the second threshold is to score the mutation site, and then a threshold is needed to determine whether the mutation site is true or false. Specifically, after the identification result of whether the tumor variation site to be processed is noise is obtained, the judgment result (the scoring sum of the confidence level) of the integrated learning model is preliminarily divided into three grades of high, medium and low according to the existing result, and for the mutation site with the scoring sum of the confidence level belonging to high, manual review is not performed in a report review system, so that part of workload is reduced, and for the mutation site with the scoring sum of the confidence level belonging to low, important attention is required.

In summary, other methods for identifying mutation sites and noise sites are basically implemented by using layer-by-layer filtering, i.e., in each filtering, a single feature must meet a specified threshold value before being retained. The method is a filtering method for determining the result by a single factor, and no matter how "excellent" other characteristics/factors are, the site can be filtered out as long as one of the characteristics does not meet the requirement of a threshold value, and the method does not comprehensively consider the specified characteristics to establish an integral filtering model. Therefore, no matter how the filtering threshold of the features is set, the obtained filtering model is never the most suitable, and the model has the condition that partial real mutation sites are filtered and a large number of noise sites are reserved. The embodiment of the application adopts an integrated learning mode, and is a two-layer model. The first layer has self-adaptive capacity in a given characteristic weight and a designated value range, can be adjusted according to the characteristic weight and the value range, and is continuous in score, so that the problem of characteristic value gray scale interval can be well solved, and the problem of 'one-time cutting' (whether the first layer is filtered or not) in the conventional method is solved; and the second layer is an integrated learning process, and can give different weights to 7 characteristics according to historical experiences and model results, and integrates scores of the 7 characteristics to form an overall score so as to distinguish real mutation sites and noise sites. The embodiment adopts a two-layer structure, so that the influence of a single characteristic on the result can be reflected, and 7 characteristics are taken as a whole to be considered comprehensively, thereby having good effect on the accuracy of the model prediction result. Meanwhile, the embodiment also considers the difference and specificity of individuals, and gives a weighting to the comprehensive score of the partial rare mutation sites so as to avoid being filtered out by mistake. In addition, other existing filtering methods greatly look at the parameter of heavy AF (mutation frequency), so that the false positive phenomenon, namely the mutation site with high AF and low sequencing depth (small denominator), exists. In this embodiment, while considering AF, the number of mutant reads is also considered, and the latter is given higher feature weight, so that the occurrence of false positive sites with high AF in the report can be effectively avoided. In addition, the two parameters of the position of the mutation site on the genome and the ratio of soft-clip existing in the region where the mutation occurs are used in the embodiment, and the two parameters play an important role in optimizing the model and adjusting the result. According to the total trust degree score, the embodiment can divide the mutation sites into three grades of high, medium and low, and the mutation sites with high and low trust degrees do not need to be manually checked; and manual review is only performed on a small number of mutation sites with medium trust, so that the review content is greatly reduced, and the review speed is greatly accelerated.

In addition, in 40 samples, the prediction result of AF =5% is completely consistent with the manual interpretation result, and AUC reaches 100%; among the 486 mutation sites of AF < =5%, AUC was about 90.5%, and among the specified thresholds, sen =96.4%, sep =87.9%, with good prediction results. In addition, the embodiment adopts an integrated learning mode, so that the stability and generalization capability of the model are higher. Moreover, as an optimization module of the letter generation process, the method can provide important guarantee for the authenticity and the accuracy of the detection of the mutation information, is more professional, and provides important support for the full automation of letter analysis; in the aspect of reading the report, the time of manual review is greatly reduced, and the professional resources required by the part are saved; the accuracy of the report interpretation is improved, the difficulty of the report interpretation is reduced, obstacles are cleared for admission of the NGS technology, and a simple interface of the tumor NGS technology is provided for non-professionals.

Referring to fig. 2, an embodiment of the present invention provides a system for identifying whether a tumor mutation site is noise, including:

an obtaining module 210, configured to obtain a preset data set and a corresponding preset result, where the preset data set includes a training data set, a verification data set, and a test data set;

a screening module 220, configured to screen out, in the training dataset, a target feature that has the largest influence, where the target feature includes an effective depth, a mutation frequency, a number of mutant reads, a mutation quality, and a ratio of soft-chlip in a position, a strand bias, and a region where mutation occurs of a mutation site on a genome;

a determining module 230, configured to determine an initial confidence weight and a domain of definition of the target feature;

a building module 240, configured to build an integrated linear learning model according to the initial confidence weight and the domain;

an optimization module 250 for optimizing the predictive power of the integrated linear learning model by the training data set and the validation data set;

a testing module 260, configured to determine that the number of iterations of the integrated linear learning model is greater than a preset number of iterations or meets a first threshold, and test the optimized integrated linear learning model according to the test data set;

and the identifying module 270 is configured to identify whether the tumor mutation site in the second-generation sequencing to be processed is noise through the tested integrated linear learning model and the second threshold.

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

Referring to fig. 3, an embodiment of the present invention provides an apparatus for identifying whether a tumor mutation site is noise, including:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, the at least one program causes the at least one processor to implement the method for identifying whether a tumor mutation site is noisy as shown in fig. 1.

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

An embodiment of the present invention provides a storage medium in which a processor-executable program is stored, and the processor-executable program, when executed by a processor, is used to implement the method for identifying whether a tumor variation site is noise as shown in fig. 1.

Furthermore, an embodiment of the present invention provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor, causing the computer device to perform the method illustrated in fig. 1.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present invention is not limited to the specific combinations of the above-mentioned features, and other embodiments in which the above-mentioned features or their equivalents are combined arbitrarily without departing from the spirit of the invention are also encompassed. For example, the above features and (but not limited to) features with similar functions disclosed in the embodiments of the present invention are mutually replaced to form the technical solution.

Claims

1.A method for identifying whether a tumor mutation site is noise or not is characterized by comprising the following steps:

acquiring a preset data set and a corresponding preset result, wherein the preset data set is formed by acquiring sample data repeatedly verified by a genome browser, the sample data comprises original data of a tumor sample subjected to second-generation sequencing, mutation sites preliminarily identified by other software and a preset result repeatedly verified by the mutation sites, and the preset data set comprises a training data set, a verification data set and a test data set;

screening target characteristics with the largest influence in the training data set, wherein the target characteristics comprise effective depth, mutation frequency, mutant reads number, mutation quality, and the position of a mutation site on a genome, strand deviation and the proportion of soft-chlip existing in a mutation occurrence region;

constructing an integrated linear learning model according to the initial trust weight and the domain, respectively modeling target features according to the initial trust weight and the domain to obtain preset models comprising a logistic regression model, a Bayesian model and a linear model, then comparing and analyzing prediction results of the preset models on a training data set to determine the linear models as target models, and then integrating the target models corresponding to each target feature to obtain an integrated linear learning model;

2. The method of claim 1, wherein the step of selecting the most influential target feature comprises:

calculating and drawing ROC curves of all characteristics in the training data set, and calculating an area AUC enclosed by the ROC curve of each characteristic and a coordinate axis;

3. The method of claim 2, wherein the determining the initial confidence weight and the domain of the target feature comprises:

4. The method of claim 3, wherein the constructing an integrated linear learning model according to the initial confidence weights and the domain of definition comprises:

and integrating the target models corresponding to the target characteristics to obtain an integrated linear learning model.

5. The method of claim 2, wherein the optimizing the predictive power of the ensemble linear learning model by the training data set and the validation data set comprises:

6. The method of claim 5, wherein the optimizing the predictive power of the ensemble linear learning model by the training dataset and the validation dataset further comprises:

acquiring a weighting weight;

and adjusting the scoring result of the integrated linear learning model according to the weighted weight.

7. A system for identifying whether a tumor mutation site is noisy, comprising:

the acquisition module is used for acquiring a preset data set and a corresponding preset result, and comprises the steps of acquiring sample data formed by repeated verification of a genome browser to form the preset data set, wherein the sample data comprises original data of a tumor sample subjected to second-generation sequencing, mutation sites preliminarily identified by other software and a preset result after repeated verification of the mutation sites, and the preset data set comprises a training data set, a verification data set and a test data set;

the screening module is used for screening out target characteristics which have the largest influence in the training data set, wherein the target characteristics comprise effective depth, mutation frequency, mutant reads number, mutation quality, and the ratio of soft-chlip in the position, chain deviation and mutation occurrence region of a mutation site on a genome;

the building module is used for building an integrated linear learning model according to the initial trust degree weight and the definition domain, respectively modeling target features according to the initial trust degree weight and the definition domain to obtain a preset model comprising a logistic regression model, a Bayesian model and a linear model, then comparing and analyzing a prediction result of the preset model on a training data set to determine the linear model as the target model, and then integrating the target models corresponding to each target feature to obtain the integrated linear learning model;

8. An apparatus for identifying whether a tumor mutation site is noisy, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method of identifying whether a tumor variation site is noisy according to any of claims 1-6.

9. A storage medium having stored thereon a program executable by a processor, wherein the program executable by the processor is adapted to implement the method for identifying whether a tumor variation site is noisy according to any one of claims 1-6.