CN111584002B

CN111584002B - Method, computing device and computer storage medium for detecting tumor mutational burden

Info

Publication number: CN111584002B
Application number: CN202010443187.8A
Authority: CN
Inventors: 柳文进; 施巍炜; 车月
Original assignee: Origimed Technology Shanghai Co ltd
Current assignee: Origimed Technology Shanghai Co ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2022-04-29
Anticipated expiration: 2040-05-22
Also published as: CN111584002A

Abstract

The present disclosure relates to a method, computing device, and computer storage medium for detecting tumor mutational burden. The method comprises the following steps: obtaining a circulating tumor gene (ctDNA) sequencing result of a blood sample of a to-be-detected object; determining a mutation base type and a mutation frequency associated with each site based on the ctDNA sequencing result so as to generate mutation site information of the object to be detected; comparing the mutation site information of the object to be tested with a healthy object mutation site dataset to generate a differential mutation site set, the healthy object mutation site dataset being generated based on ctDNA sequencing results of a plurality of healthy objects; determining point mutation and short insertion/deletion variation information; and filtering the determined point mutation and short insertion/deletion variation information using the set of differential mutation sites to determine the number of positive mutation sites. The method can accurately, conveniently and stably detect the tumor mutation load.

Description

Method, computing device and computer storage medium for detecting tumor mutational burden

Technical Field

The present disclosure relates generally to bioinformatics processing, and in particular, to methods, computing devices, and computer storage media for detecting tumor mutational burden.

Background

Tumor Mutation Burden (TMB) is a quantifiable biomarker that indicates the number of mutations contained in a tumor cell, typically measured as mutations per megabase of the coding region of the tumor cell genome. The types of mutations used to detect the mutational burden of tumors are mainly Single Nucleotide Variation (SNV) and small fragment insertions/deletions (or short insertions/short deletions, indels). Studies have shown that the magnitude of TMB values correlates with the efficacy of immunotherapy in tumor patients, e.g., TMB levels correlate significantly with the response rate of immune checkpoint inhibitor therapy. Therefore, accurate detection of TMB has some guidance in determining immunotherapy for tumor patients. For example, if the TMB value detected is low, the immunodetection point inhibitors are not effective and thus will not be amenable to immunotherapy, but will be amenable to targeted therapy or other therapies; whereas, if the monitored TMB value is higher, the immunodetection site inhibitor is more effective. Conventional protocols for detecting tumor mutation burden are based primarily on sequencing data of solid tumor tissue of a patient to count the number of point mutations and short insertion/deletion mutations per unit length of gene.

In the conventional scheme for detecting tumor mutation load, since there are some cases where the solid tumor tissue cannot be obtained from some patients, it is difficult to detect the tumor mutation load value by sequencing a solid tumor tissue sample of the patient.

Disclosure of Invention

The present disclosure provides a method, a computing device, and a computer storage medium for detecting a tumor mutation load, which can accurately, conveniently, and stably detect a tumor mutation load.

According to a first aspect of the present disclosure, a method for detecting tumor mutational burden is provided. The method comprises the following steps: obtaining a circulating tumor gene (ctDNA) sequencing result of a blood sample of a to-be-detected object; determining a mutation base type and a mutation frequency associated with each site based on the ctDNA sequencing result so as to generate mutation site information of the object to be tested; comparing the mutation site information of the subject to be tested with a healthy subject mutation site dataset generated based on ctDNA sequencing results of a plurality of healthy subjects to generate a set of differential mutation sites; determining point mutation and short insertion/deletion mutation information of the object to be detected; and filtering the determined point mutation and short insertion/deletion variation information using the set of differential mutation sites to determine the number of positive mutation sites for detecting tumor mutation burden.

According to a second aspect of the present invention, there is also provided a computing device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform the method of the first aspect of the disclosure.

According to a third aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first aspect of the disclosure.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

Fig. 1 shows a schematic diagram of a system 100 for implementing a method of detecting tumor mutational burden according to an embodiment of the present disclosure;

fig. 2 shows a flow diagram of a method 200 for detecting tumor mutational burden according to an embodiment of the present disclosure;

FIG. 3 shows a flow diagram of a method 300 for filtering point mutation and short insertion/deletion variant information, according to an embodiment of the present disclosure;

fig. 4 shows a flow chart of a method 400 for determining a predetermined sudden change frequency threshold according to an embodiment of the present disclosure;

fig. 5 shows a flow diagram of a method 500 for generating a set of differential mutation sites, according to an embodiment of the present disclosure;

fig. 6 shows a flow diagram of a method 600 for detecting tumor mutational burden according to an embodiment of the present disclosure;

fig. 7 shows a comparison graph of bTMB obtained based on blood ctDNA sequencing results and tTMB obtained based on tissue DNA sequencing results, according to an embodiment of the present disclosure;

fig. 8 shows a comparison graph of bTMB obtained based on high frequency variation data of blood ctDNA sequencing results and tTMB obtained based on tissue DNA sequencing results, according to an embodiment of the disclosure; and

FIG. 9 schematically illustrates a block diagram of an electronic device 900 suitable for use in implementing embodiments of the present disclosure.

Like or corresponding reference characters designate like or corresponding parts throughout the several views.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.

As described above, in the conventional scheme for detecting tumor mutation burden, since there are some cases where the solid tumor tissue cannot be obtained from some patients, it is difficult to detect the tumor mutation burden value by sequencing a solid tumor tissue sample of the patient. Thus, the conventional scheme for detecting tumor mutational burden has disadvantages in that: when the solid tumor tissue of the object to be detected cannot be obtained, it is difficult to accurately and conveniently detect the tumor mutation load.

To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present disclosure propose a scheme for detecting tumor mutational burden. The scheme comprises the following steps: obtaining a circulating tumor gene (ctDNA) sequencing result of a blood sample of a to-be-detected object; determining a mutation base type and a mutation frequency associated with each site based on the ctDNA sequencing result so as to generate mutation site information of the object to be tested; comparing the mutation site information of the subject to be tested with a healthy subject mutation site dataset generated based on ctDNA sequencing results of a plurality of healthy subjects to generate a set of differential mutation sites; determining point mutation and short insertion/deletion mutation information of the object to be detected; and filtering the determined point mutation and short insertion/deletion variation information using the set of differential mutation sites to determine the number of positive mutation sites for detecting tumor mutation burden.

In the scheme, sequencing is performed based on the tumor DNA (ctDNA) dissociated from peripheral blood of the object to be detected and mutation site selection is performed, so that the tumor mutation load can be conveniently detected when the solid tumor tissue of the object to be detected cannot be obtained; in addition, by generating a differential mutation site set for filtering the determined point mutation and short insertion/deletion mutation information based on the comparison result of the mutation site information determined by the peripheral blood ctDNA of the subject to be tested and the healthy subject mutation site data set, the present disclosure can eliminate the mutation data due to sequencing errors during the sequencing process using the healthy subject mutation data as a reference data set, thereby enabling accurate and stable detection of tumor mutation load. Therefore, the present disclosure enables accurate, convenient, and stable detection of tumor mutation load.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method of detecting tumor mutational burden according to an embodiment of the present disclosure. As shown in fig. 1, the system 100 includes: a data acquisition unit 112, a unit for determining mutation site information of an object to be tested 114, a unit for determining mutation site data set of a healthy object 116, a unit for determining difference mutation site set 118, a filtering unit 120, and a tumor mutation load calculation unit 122. In some embodiments, the system 100 further comprises: a letter generation server 140, a network 150.

In some embodiments, the data acquisition unit 112, the test object mutation site information determination unit 114, the healthy object mutation site data set determination unit 116, the differential mutation site set determination unit 118, the filtering unit 120, and the tumor mutation load calculation unit 122 may be configured on one or more computing devices 130. Computing device 130 may interact with messaging server 140 in a wired or wireless manner (e.g., network 150).

Regarding the computing device 130, it is configured to determine mutation site information of a subject based on ctDNA sequencing results of a blood sample of the subject, compare the mutation site information of the subject with a healthy subject mutation site dataset to generate a differential mutation site set, and filter the determined point mutation and short insertion/deletion mutation information using the differential mutation site set to determine the number of positive mutation sites, thereby calculating tumor mutation burden. In some embodiments, computing device 130 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device.

Regarding the data acquisition unit 112, it is used for acquiring ctDNA sequencing results of blood samples of the test subject, and acquiring ctDNA sequencing results of blood samples of a plurality of healthy subjects. For example, the data acquisition unit 112 acquires ctDNA sequencing results of blood samples of 41 healthy subjects. ctDNA sequencing results of blood samples of the above-mentioned plurality of healthy subjects are used to generate a reference database for excluding sequencing errors.

And a healthy subject mutation site data set determination unit 116 for generating a healthy subject mutation site data set based on ctDNA sequencing results of the plurality of healthy subject blood samples acquired by the data acquisition unit 112. For example, the healthy subject mutation site data set determination unit 116 generates a plurality of variation result data of a plurality of healthy subjects, respectively, based on ctDNA sequencing results of the plurality of healthy subjects; merging the plurality of variant result data; and counting the number of samples of each variation in the plurality of variation result data for generating the healthy subject mutation site data set.

And a test object mutation site information determination unit 114 for determining a mutation base type and a mutation frequency associated with each site based on ctDNA sequencing results of the test object blood sample acquired by the data acquisition unit 112, so as to generate mutation site information of the test object.

A difference mutation site set determination unit 118 for comparing the mutation site information of the object to be tested generated by the object to be tested mutation site information determination unit 114 with the healthy object mutation site data set generated by the healthy object mutation site data set determination unit 116 to generate a difference mutation site set. In some embodiments, the difference mutation site set determining unit 118 uses the mutation site information of the healthy object as background data to determine whether there is a significant difference between the associated parameter of each mutation site of the object to be tested and the associated parameter of the corresponding site of the mutation site data set of the healthy object; if there is a significant difference, the mutation site of the test object is retained, so that a set of differential mutation sites is generated based on the retained mutation site of the test object.

Regarding the filtering unit 120, it is used for filtering the point mutation and short insertion/deletion variation information determined based on the ctDNA sequencing result of the object to be tested. For example, the filtering unit 120 filters the point mutation and short insertion/deletion variation information of the object to be tested by using the set of differential mutation sites determined by the differential mutation site set determining unit 118. Specifically, if the filtering unit 120 determines that the point mutation and short insertion/deletion mutation information of the object to be tested does not exist in the set of differential mutation sites, the point mutation and short insertion/deletion mutation information is not retained. In some embodiments, the filtration unit 120 also includes a variety of filtration processes. For example, the filtering unit 120 performs predetermined SNP data set filtering on the point mutation site information filtered through the differential mutation site set, and further performs filtering based on the mutation site to support at least one of the number of reads, the sequencing depth, and the mutation frequency.

Regarding the tumor mutation burden calculation unit 122, it is used for determining the number of positive mutation sites based on the mutation sites of the corresponding region of the removed driver gene and a predetermined mutation frequency threshold, and determining a tumor mutation burden value based on the determined number of positive mutation sites and the length of the probe.

A method for detecting tumor mutational burden according to an embodiment of the present disclosure will be described below in conjunction with fig. 2. Fig. 2 shows a flow diagram of a method 200 for detecting tumor mutational burden according to an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 900 depicted in fig. 9. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 202, the computing device 130 obtains a circulating tumor gene (ctDNA) sequencing result of a blood sample of the subject. The object to be measured is, for example, a patient. ctDNA sequencing result data is for example sequenced BAM files. In some embodiments, prior to performing ctDNA sequencing, a region file (i.e., probe file) covering the genome may be prepared for sequencing; and a file of the CDS area covered by the probe file is obtained. The drive variation contained in the collection probe region is selected.

At block 204, computing device 130 determines, based on the ctDNA sequencing results, a mutation base type and a mutation frequency associated with each site in order to generate mutation site information for the test object. For example, the computing device 130 determines the mutation base type and mutation frequency corresponding to each site of the object to be tested using the known detection variation software samtools based on the ctDNA sequencing result.

In some embodiments, the computing device 130 determines whether the mutation frequency of the mutation site of the test object is greater than or equal to a predetermined frequency threshold; if the computing device 130 determines that the mutation frequency of the mutation site of the object to be tested is greater than or equal to a predetermined frequency threshold, determining mutation site information of the object to be tested based on the mutation site for detecting tumor mutation burden. It should be understood that the content of free tumor DNA in peripheral blood of the object to be detected is low, and the difficulty of determining the positive site is high, so that the accuracy and stability of detecting the tumor mutation load can be further improved by detecting the tumor mutation load based on the mutation site with high mutation frequency in the sequencing result.

At block 206, the computing device 130 compares the mutation site information of the subject to be tested to a healthy subject mutation site dataset generated based on ctDNA sequencing results of a plurality of healthy subjects to generate a set of differential mutation sites. The healthy object mutation site data set is used as a reference data set for eliminating sequencing errors and sequencing artifacts, and is used for eliminating mutation noise caused by the sequencing errors in the sequencing process of the object to be detected. The set of differential mutation sites is generated, for example, based on the mutation sites of the test subject that are significantly different from the corresponding mutation sites of the healthy subject.

With respect to the healthy subject mutation site dataset, in some embodiments, the manner of generating the healthy subject mutation site dataset includes: the computing device 130 generates a plurality of variant result data for the plurality of healthy subjects, respectively, based on ctDNA sequencing results for the plurality of healthy subjects; then merging the plurality of variation result data; and counting the number of samples of each variation in the plurality of variation result data for generating the healthy subject mutation site data set. For example, computing device 130 obtains ctDNA sequencing results for blood samples of 41 healthy subjects; then, respectively using common software samtools to obtain a plurality of corresponding variation result data aiming at the 41 sequencing results; thereafter, the computing device 130 combines the corresponding variant result data of 41 healthy subjects and records the number of samples in which each variant in the combined variant result data occurs, so as to form a healthy subject mutation site data set.

Regarding the way of comparing the mutation site information of the test subject with the mutation site dataset of the healthy subject, in some embodiments, it includes, for example: using the mutation site information of the healthy object as background data, and determining whether the correlation parameter of each mutation site of the object to be detected is significantly different from the correlation parameter of the corresponding site of the mutation site data set of the healthy object; if no significant difference exists, filtering out the mutation site of the object to be detected; if there is a significant difference, the mutation site of the test object is retained, so that a set of differential mutation sites is generated based on all the retained mutation sites of the test object. This is because, if it is determined that there is no significant difference between the correlation parameter of a certain mutation site of the test object and the correlation parameter of the corresponding site of the healthy object, the mutation site of the test object may be noise data due to a sequencing error, if the data set of the mutation site of the healthy object is used as a reference data set for excluding a sequencing error.

Sequencing errors were found to be well-distributed by weber (Weibull). Thus, in some embodiments, the computing device 130 may filter the mutation site information of the test object by determining whether the mutation site information of the test object conforms to the Weibull distribution in order to remove errors occurring during the sequencing process. Hereinafter, a method for comparing the mutation site information of the test object with the mutation site data set of the healthy object so as to generate a differential mutation site set will be described with reference to fig. 5, and will not be described herein again.

At block 208, the computing device 130 determines point mutation and short insertion/deletion variation information for the test object. For example, the computing device 130 uses the software Lianti and PINDEL to detect point mutations and short insertion/deletion mutations, and calls mutations and short insertion/deletion mutations of the test object based on the ctDNA sequencing results of the test object.

At block 210, the computing device 130 filters the determined point mutations and short insertion/deletion variation information using the set of differential mutation sites to determine a number of positive mutation sites for detecting tumor mutation burden.

In some embodiments, the means for filtering for the determined point mutation and short insertion/deletion variation information includes, for example: computing device 130 determines whether the mutation sites associated with the point mutation and short insertion/deletion variant belong to the set of differential mutation sites derived at block 206; and if computing device 130 determines that the mutation sites associated with the point mutation and short insertion/deletion variant belong to the set of differential mutation sites, leaving the mutation sites associated with the point mutation and short insertion/deletion variant. As described above, the set of differential mutation sites is a set of differential mutation sites formed via comparison with the healthy subject mutation site dataset at block 206, and thus, leaving mutation sites belonging to the set of differential mutation sites as described above can exclude point mutations and short insertion/deletion variation information due to sequencing errors. In some embodiments, the computing device 130 may also perform further filtering on the above-described left-over point mutation and short insertion/deletion variation information. Methods for filtering point mutation and short insertion/deletion variation information are described below in conjunction with FIG. 3. Here, the description is omitted.

In some embodiments, the manner of determining the number of positive mutation sites includes, for example: the computing device 130 obtains, for the filtered mutation sites, corresponding region mutation sites of the probe file; then removing the driving gene in the mutation site of the corresponding region; and determining the number of positive mutation sites based on the mutation sites of the corresponding region of the deleted driver gene and a predetermined mutation frequency threshold. The manner of determining the predetermined mutation frequency threshold will be described below with reference to fig. 4. Here, the description is omitted.

In some embodiments, the computing device 130 determines the positive mutation site based on the corresponding regional mutation site using a plurality of terms selected from the group consisting of reads in which the mutation site is located, alignment quality of reads, quality of bases of the mutation site, forward and directional ratios of the mutation reads, end orientation of reads, multiple mutation data of the same site, length covered by the mutation reads and corresponding counterpart reads, and ratio of number of unmatched reference genomic bases to total number of reads bases on the mutation reads; and determining a tumor mutation load value based on the determined number of positive mutation sites and the length of the probe, for detecting the tumor mutation load

In the scheme, sequencing is performed based on tumor DNA (ctDNA) dissociated from peripheral blood of the object to be detected, mutation site selection is performed, and a difference mutation site set is generated based on a comparison result of mutation site information determined based on the ctDNA from the peripheral blood of the object to be detected and a healthy object mutation site data set so as to filter the determined point mutation and short insertion/deletion mutation information.

The effect of the method for detecting tumor mutation burden of the present disclosure will be described below in conjunction with fig. 7 and 8. Fig. 7 shows a comparison of bTMB obtained based on blood ctDNA sequencing results and tTMB obtained based on tissue DNA sequencing results, according to an embodiment of the disclosure. In fig. 7, 59 new non-small cell lung cancer test objects (e.g., patients) are shown at each point (e.g., points 710 and 712), and the abscissa and ordinate of each point respectively represent bTMB (i.e., tumor mutation burden calculated based on ctDNA sequencing results of blood samples) and tTMB (i.e., tumor mutation burden obtained based on DNA sequencing results of tissue samples) of the corresponding new non-small cell lung cancer test object. As shown in fig. 7, since the blood ctDNA and the tissue DNA are different biological samples from the same subject, the tumor mutation loads detected based on the two samples have a certain consistency, but are not very high. For example, the Spearman rank correlation parameter used in FIG. 7 to indicate the consistency of bTMB with tTMB is 0.70.

Fig. 8 shows a comparison graph of bTMB obtained based on high frequency variation data of blood ctDNA sequencing results and tTMB obtained based on tissue DNA sequencing results, according to an embodiment of the disclosure. Shown at each point in fig. 8 are a plurality of subjects, and the abscissa of each point indicates the tumor mutation burden (bTMB) obtained based on the ctDNA sequencing result of the blood sample of the subject represented by the point; the ordinate of each point indicates the tumor mutation load (tTMB) obtained based on the DNA sequencing result of the tumor tissue sample of the test subject. Unlike fig. 7, bTMB and tTMB for each point in fig. 8 were calculated based on high frequency mutation data. As shown in fig. 8, the tTMB of the object to be measured represented by the points in the first area 810 and the second area 820 is greater than the predetermined tTMB threshold indicated by the dashed line 860. The bmtmb of the object under test represented by points in the second region 820 and the fourth region 840 in fig. 8 is greater than a predetermined bmtb threshold indicated by a dashed line 870. As shown in fig. 8, the values of bTMB and tTMB of the objects to be measured in the second region 820 are both high, and the comparison result has consistency; the bmtb and tTMB values of the objects to be tested in the third region 830 are both relatively low, and the comparison results are consistent. For a small number of objects under test in the second region 820, such as the object under test indicated by point 850, there is a difference in the consistency of its bTMB and tTMB, i.e., its tTMB is greater than a predetermined tTMB threshold indicated by dashed line 860 and its tTMB is less than a predetermined bTMB threshold indicated by dashed line 870. Therefore, the consistency of bTMB and tTMB detected based on high frequency mutation shown in fig. 8 is relatively good. Wherein the Spearman rank correlation parameter for indicating consistency is 0.93.

As can be seen from the consistency control results of bTMB and tTMB shown in fig. 7 and fig. 8, the scheme of the present disclosure for detecting TMB based on the sequencing result of peripheral blood-free tumor dna (ctdna) of the test subject can accurately, conveniently and stably detect the tumor mutation load.

Fig. 3 shows a flow diagram of a method 300 for filtering point mutation and short insertion/deletion variant information, according to an embodiment of the present disclosure. It should be understood that the method 300 may be performed, for example, at the electronic device 900 depicted in fig. 9. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 300 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 302, the computing device 130 determines whether the remaining mutation sites belong to a predetermined SNP data set. In some embodiments, if the computing device 130 is to filter through the set of differential mutation sites while leaving the mutation sites for artifact, common SNP database filtering.

At block 304, if the computing device 130 determines that the remaining mutation sites belong to the predetermined SNP data set, the mutation sites are left.

At block 306, the computing device 130 filters the remaining mutation sites associated with the point mutations and the short insertion/deletion variants based on at least one of a number of supported reads (reads) for the mutation sites, a sequencing depth, and a mutation frequency.

By adopting the above means, the present disclosure can efficiently filter out false positive mutation sites.

Fig. 4 shows a flow diagram of a method 400 for determining a predetermined sudden change frequency threshold according to an embodiment of the present disclosure. It should be understood that method 400 may be performed, for example, at electronic device 900 depicted in fig. 9. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 400 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 402, the computing device 130 determines whether the mutation type is a short insertion/deletion.

At block 404, if the computing device 130 determines that the mutation type is a short insertion/deletion, it is determined whether the corresponding mutation site occurs in the poly region.

At block 406, the computing device 130 determines that the corresponding mutation site occurs in the poly region such that the predetermined mutation frequency threshold is greater than or equal to a predetermined value. If the computing device 130 is able to determine that a short insertion/deletion occurs in the poly region, the corresponding mutation frequency criterion needs to be increased.

By adopting the above means, the present disclosure can improve the accuracy of the determined positive mutation site.

Fig. 5 shows a flow diagram of a method 500 for generating a set of differential mutation sites, according to an embodiment of the present disclosure. It should be understood that the method 500 may be performed, for example, at the electronic device 900 depicted in fig. 9. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 500 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 502, the computing device 130 compares the number of occurrences of each of the subject's mutation sites in the healthy subject mutation site dataset to the number of variant reads that occur to filter the subject's mutation sites. As described above, the sequencing errors are based on Weibull distribution, and therefore, the computing device 130 may compare the occurrence counts of the mutant sites of the test object with the occurrence counts of the mutant site information of the 41 healthy objects for each mutant site of the test object, using the sets of mutant site information of a plurality (e.g., 41) of healthy objects as the reference data set, so as to filter the mutant sites of the test object.

At block 504, the computing device 130 generates first weber distribution data associated with the corresponding healthy object mutation sites based on the healthy object mutation sites corresponding to the filtered mutation sites of the object under test. For example, the computing device 130 determines data of a healthy subject mutation site corresponding to the mutation site of the subject based on the mutation site of the subject remaining after the filtering process at 502; weibull distribution calculations are then performed based on this data to generate first Weibull distribution data associated with the corresponding mutation sites in the healthy subject.

At block 506, computing device 130 generates second weber distribution data associated with the mutation site of the object to be tested based on the filtered mutation site of the object to be tested.

At block 508, the computing device 130 determines whether the difference between the second weber distribution data and the first weber distribution data is greater than or equal to a predetermined difference threshold.

At block 510, if computing device 130 determines that the difference between the second weber distribution data and the first weber distribution data is greater than or equal to a predetermined difference threshold, retaining the mutation sites of the test subject to generate the set of differential mutation sites based on all of the retained mutation sites of the test subject. For example, if the computing device confirms that the difference value between the second weibull distribution data associated with a certain mutation site of the object to be tested after the filtering process at 502 and the first weibull distribution data of the corresponding mutation site of the healthy object is greater than or equal to a certain preset difference value, it indicates that the mutation site of the object to be tested is not noise data caused by sequencing errors, and thus the mutation site of the object to be tested is retained. The computing device 130 forms a set of differential mutation sites filtered through sequencing errors based on all the mutation sites of the retained object to be tested.

By adopting the means, the method can quickly and accurately filter out the variation data of the object to be detected caused by sequencing errors, so that the accuracy of the detected tumor mutation load is improved.

A method for detecting tumor mutational burden according to an embodiment of the present disclosure will be described below in conjunction with fig. 6. Fig. 6 shows a flow diagram of a method 600 for detecting tumor mutational burden, in accordance with an embodiment of the present disclosure. It should be understood that method 600 may be performed, for example, at electronic device 900 depicted in fig. 9. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 600 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 602, the computing device 130 generates a plurality of variant result data for a plurality of healthy subjects, respectively, based on ctDNA sequencing results of the obtained blood samples of the plurality of healthy subjects, for generating a healthy subject mutation site dataset based on the plurality of variant result data. For example, blood samples of a plurality (e.g., without limitation, 41) of healthy subjects are first collected and then sequenced against the blood samples of the plurality of healthy subjects to obtain a plurality of blood ctDNA sequencing results. The computing device 130 obtains a plurality of variation result data using the common software samtools, respectively, based on the obtained ctDNA sequencing results of the healthy subject, then combines the obtained plurality of variation result data, and records the number of samples in which each variation occurs, to generate a healthy subject mutation site data set. This dataset was used as a reference dataset for subsequent exclusion of sequencing errors and sequencing artifacts.

At block 604, a region file, i.e., a probe file, covering the genome for sequencing of the object to be tested is prepared, and a file of a coding sequence (CDS) region covered by the probe file is obtained. It should be understood that in the probe design process, the Tm of the probe, the length of the probe, the GC content, the secondary structure of the probe, the complexity, the direction of the probe, the number of the probes, and the like are comprehensively considered, so as to enhance the capture efficiency and uniformity of the probe.

At block 606, the drive variants included in the collection probe region are selected. The drive variation is the collected reported variation with definite meaning to the occurrence and development of cancer. For example, the computing device 130 may intersect the collected drive variants with the probe regions determined via processing at block 604. Among them, Driver mutations are the main cause of carcinogenesis, and are usually located in Driver genes. For example, 724 genes, including the driver genes that can undergo driver mutations or lead to new functions, that are associated with cancer, for example, can be found in some databases, such as the COSMIC database. The drive flare input computing device 130 described above may be collected for processing at block 606. Known driver mutations), such as EGFR, MET, BRAF, PIK3CA, NF1, KRAS, and NOTCH family.

At block 608, the computing device 130 determines a mutation base type and a mutation frequency corresponding to each mutation site based on ctDNA sequencing results of a blood sample of the test subject. For example, the computing device 130 obtains the mutation base type and mutation frequency corresponding to each mutation site of the object to be detected using the known detection mutation software samtools.

At block 610, the computing device 130 compares the mutation site information of the test subject to the healthy subject mutation site dataset to generate a set of differential mutation sites. For example, the computing device 130 compares the number of occurrences of each mutation site of the test object in the healthy object mutation site dataset with the occurrences of reads to filter the mutation site information of the test object. For example, if a mutation site of the test object appears in the mutation data of only one healthy object, and the mutation reads of the healthy object have only 2, and the sequencing depth of the healthy object has 1000X, the mutation site of the test object is retained. The computing device 130 determines healthy subject mutation sites in the healthy subject mutation site data set that correspond to the filtered retained mutation sites of the test subject and then performs Weibull distribution calculations on the corresponding healthy subject mutation site data to generate first Weibull distribution data. Then, the computing device 130 performs Weibull distribution calculation on the mutation site data of the filtered and retained object to be detected to generate second weber distribution data; the computing device 130 then determines whether the difference of the second weber distribution data and the first weber distribution data is greater than or equal to a predetermined difference threshold. If the difference value is larger than or equal to the preset difference threshold value, the mutation site information of the object to be detected is reserved, otherwise, the mutation site information of the object to be detected is considered as noise data caused by errors occurring in the sequencing process, and therefore the mutation site information of the object to be detected is filtered. Thereafter, the computing device 130 generates a set of differential mutation sites based on all the mutation sites of the retained object to be tested.

At block 612, the computing device 130 determines point mutations and short insertion/deletion variations of the subject based on ctDNA sequencing results of a blood sample of the subject. For example, computing device 130 captures point mutations and short insertion/deletion mutations of the test subject using the software Lianti and PINDEL, which are known to detect point mutations and short insertion/deletion mutations.

At block 614, the computing device 130 filters out point mutations and short insertion/deletion variations that do not belong to the set of differential mutation sites. For example, computing device 130 determines whether the point mutation and short insertion/deletion variation of the test subject determined at block 612 belong to the set of differential mutation sites generated at block 610. If the point mutation and the short insertion/deletion variation of the object to be detected are determined to belong to the difference mutation site set, the point mutation and the short insertion/deletion variation of the object to be detected are reserved, otherwise, the point mutation and the short insertion/deletion variation of the object to be detected are removed. It should be appreciated that the criteria for the loading variation at block 612 are relatively broad, while the criteria for forming the variation information at block 610 are relatively tight. The computing device 130 facilitates verification of variant result data by preserving point mutations and short insertion/deletion variants of the object under test in the result file at both block 612 and block 610.

At block 616, the computing device 130 performs artifact, common SNP database filtering, and filtering based on mutation site support reads number, sequencing depth, and mutation frequency for the retained point mutations and short insertion/deletion variants of the test object. For example, point mutations and short insertion/deletion variations with too low a mutation frequency may not be variations that actually exist in the test object itself, but may be noisy data due to, for example, PCR during subsequent sequencing. Therefore, a predetermined mutation frequency threshold is required to filter out point mutations and short insertion/deletion mutations with too low mutation frequency, so as to improve the accuracy of positive results.

At block 618, the computing device 130 selects the corresponding region mutation site of the probe file. For example, the computing device 130 determines whether the location of the mutation site is in the corresponding region of the probe file and, if not, removes the mutation site. It should be understood that during the actual sequencing process, both sides of the probe may be dragged and decorated such that not all of the measured reads are in the corresponding regions of the probe file, and thus, the computing device 130 may preserve the mutation sites of the corresponding regions of the probe file based on the position of the mutation sites.

At block 620, the computing device 130 removes the driver gene in the mutation site of the corresponding region of the selected probe file. It is understood that the driver genes have a clear impact on cancer development and progression. Therefore, elimination of the driver gene when calculating TMB is beneficial to reducing associated bias.

At block 622, the computing device 130 determines positive mutation sites for the corresponding region mutation sites of the removed driver gene. For example, if the computing device 130 determines that the mutation type is short insertion/shortage, it is further determined whether the corresponding region mutation site occurs in the poly region. If it is determined that a poly region occurs, the corresponding mutation frequency criterion is increased. In some embodiments, the computing device 130 determines the number of positive mutation sites by calculating a plurality of terms selected from the group consisting of reads where the mutation sites are located, alignment quality for reads, quality of bases of the mutation sites, forward and directional ratios of the mutation reads, end orientation of the reads, multiple mutation data for the same site, coverage length of the mutation reads and its counterpart reads, and ratio of number of unmatched reference genomic bases to total reads bases on the mutation reads. The positive mutation site is determined by integrating the standards of the conditions of the mutation, the sequencing condition, the base quality, whether other mutations exist around the mutation, the condition of the gene where the mutation exists (such as no repeated sequences and the like), and the like, so that the accuracy of judging the positive mutation site can be improved.

At block 624, the computing device 130 generates a tumor mutation burden value based on the determined number of positive mutation sites, the length of the probe. For example, the number of positive mutation sites is divided by the length of the probe to obtain a tumor mutation load value.

In the scheme, the mutation site of the object to be detected is selected by performing panel sequencing result (sequenced BAM file) based on the tumor DNA (ctDNA) dissociated from the peripheral blood of the object to be detected, so as to obtain the mutation site capable of calculating the tumor mutation load, and the site with positive mutation is judged, so that the TMB of the object to be detected is obtained. The method can accurately, conveniently and stably detect the tumor mutation load.

FIG. 9 schematically illustrates a block diagram of an electronic device 900 suitable for use in implementing embodiments of the present disclosure. The device 900 may be a device for implementing the execution of the methods 200-600 illustrated in fig. 2-6, and the computing device 130 illustrated in fig. 1. As shown in fig. 1, device 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)902 or loaded from a storage unit 907 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the device 900 can also be stored. The CPU901, ROM902, and RAM903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 907, a processing unit 901 performs the respective methods and processes described above, for example performing the methods 200 to 600. For example, in some embodiments, the methods 200-600 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 907. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM902 and/or communications unit 909. When the computer program is loaded into the RAM903 and executed by the CPU901, one or more operations of the methods 200 to 600 described above may be performed. Alternatively, in other embodiments, CPU901 may be configured to perform one or more acts of methods 200-600 by any other suitable means (e.g., by way of firmware).

It should be further appreciated that the present disclosure may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above are merely alternative embodiments of the present disclosure and are not intended to limit the present disclosure, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method for detecting tumor mutational burden comprising:

obtaining a circulating tumor gene (ctDNA) sequencing result of a blood sample of a to-be-detected object;

determining a mutation base type and a mutation frequency associated with each site based on the ctDNA sequencing result so as to generate mutation site information of the object to be tested;

comparing the mutation site information of the subject to be tested with a healthy subject mutation site dataset generated based on ctDNA sequencing results of a plurality of healthy subjects to generate a set of differential mutation sites;

determining point mutation and short insertion/deletion mutation information of the object to be detected; and

filtering the determined point mutation and short insertion/deletion variation information using the set of differential mutation sites to determine the number of positive mutation sites for detecting tumor mutation burden,

wherein determining the number of positive mutation sites for detecting tumor mutational burden comprises:

aiming at the mutant sites left after filtration, obtaining the corresponding region mutant sites of the probe file;

removing the driving gene in the mutation site of the corresponding region; and

the number of positive mutation sites is determined based on the mutation sites of the corresponding region of the deleted driver gene and a predetermined mutation frequency threshold.

2. The method of claim 1, wherein comparing the mutation site information of the test subject to a healthy subject mutation site dataset to generate a set of differential mutation sites comprises:

comparing the number of times of occurrence of each site in the mutation sites of the object to be detected in the mutation site data set of the healthy object with the number of the occurred variant reads so as to filter the mutation sites of the object to be detected;

generating first weber distribution data associated with the corresponding healthy object mutation sites based on the healthy object mutation sites corresponding to the filtered mutation sites of the object to be detected;

generating second Weber distribution data associated with the mutation sites of the object to be detected based on the filtered mutation sites of the object to be detected; and

in response to determining that the difference between the second weibull distribution data and the first weibull distribution data is greater than or equal to a predetermined difference threshold, retaining the mutation sites of the test subject to generate the set of differential mutation sites based on all of the retained mutation sites of the test subject.

3. The method of claim 1, wherein filtering for the determined point mutation and short insertion/deletion variation information comprises:

determining whether a mutation site associated with the point mutation and short insertion/deletion variation belongs to the set of differential mutation sites; and

leaving mutation sites associated with the point mutation and short insertion/deletion variant in response to determining that the mutation sites associated with the point mutation and short insertion/deletion variant belong to the set of differential mutation sites.

4. The method of claim 3, wherein filtering the determined point mutation and short insertion/deletion variation information further comprises:

determining whether the remaining mutation sites belong to a predetermined SNP data set;

in response to determining that the remaining mutation site belongs to the predetermined SNP data set, the mutation site is left.

5. The method of claim 4, wherein filtering the determined point mutation and short insertion/deletion variation information further comprises:

filtering the remaining mutation sites associated with the point mutation and the short insertion/deletion variation based on at least one of number of supported reads (reads) of the mutation sites, sequencing depth, and mutation frequency.

6. The method of claim 1, wherein the predetermined mutation frequency threshold is determined via:

in response to determining that the mutant base type is a short insertion/deletion, determining whether a corresponding mutation site occurs in a poly region; and

in response to determining that the corresponding mutation site occurs in a poly region, causing the predetermined mutation frequency threshold to be greater than or equal to a predetermined value.

7. The method of claim 1, wherein determining the number of positive mutation sites for detecting tumor mutational burden comprises:

determining a positive mutation site based on the mutation site in a corresponding region by utilizing a plurality of terms of reads where the mutation site is located, the comparison quality of the reads, the quality of the base of the mutation site, the forward direction and direction ratio of the mutation reads, the tail end trend of the reads, multiple mutation data of the same site, the coverage length of the mutation reads and corresponding matched reads, the number of unmatched reference genome base on the mutation reads and the ratio of the total reads base number; and

and determining a tumor mutation load value based on the determined number of the positive mutation sites and the length of the probe.

8. The method of claim 1, wherein the healthy subject mutation site dataset is generated based on ctDNA sequencing results of a plurality of healthy subjects comprising:

generating a plurality of variant result data of a plurality of healthy subjects based on ctDNA sequencing results of the plurality of healthy subjects, respectively;

merging the plurality of variant result data; and

counting a sample number of each variation in the plurality of variation result data for generating a mutation site data set of the healthy subject.

9. A computing device, comprising:

at least one processing unit;

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the apparatus to perform the steps of the method of any of claims 1 to 8.

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a machine, implements the method of any of claims 1-8.