CN114446393A

CN114446393A - Method, electronic device and computer storage medium for predicting liver cancer feature type

Info

Publication number: CN114446393A
Application number: CN202210095636.3A
Authority: CN
Inventors: 尤冬; 张丽文; 刘阳
Original assignee: Shanghai Zhiben Medical Laboratory Co ltd; Origimed Technology Shanghai Co ltd
Current assignee: Shanghai Zhiben Medical Laboratory Co ltd; Origimed Technology Shanghai Co ltd
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-05-06
Anticipated expiration: 2042-01-26
Also published as: CN114446393B

Abstract

The present disclosure relates to a method, computing device, and storage medium for predicting a characteristic type of a liver cancer. The method comprises the following steps: generating genomic variation data of a plurality of predetermined genes with respect to the tumor sample of the subject based on the comparison result data with respect to the tumor sample of the subject; acquiring clinical data about a subject to be tested; determining tumor mutation load information about the test subject; obtaining immune checkpoint molecule expression data about an object to be tested; generating input data for a predictive model based at least on the genomic variation data, the clinical data, the tumor mutation burden information, and the immune checkpoint molecule expression data; and based on a prediction model trained through multiple samples, the prediction model being constructed based on a neural network model. The method can improve the reliability of predicting the characteristic type of the liver cancer and has good generalization in clinical application.

Description

Method, electronic device and computer storage medium for predicting liver cancer feature type

Technical Field

The present disclosure relates generally to bioinformatics processing, and in particular, to methods, electronic devices, and computer storage media for predicting liver cancer feature types.

Background

Research shows that the annual recurrence rate of patients with hepatocellular carcinoma (HCC) is as high as 50 percent, and is a main factor influencing the postoperative long-term survival of patients with early HCC. Therefore, it is of great significance to accurately predict the characteristic type of liver cancer for assisting in determining the recurrence risk of liver cancer.

Conventional methods for predicting the risk of liver cancer recurrence include, for example: the recurrence risk of early HCC after operation is predicted by constructing a cytosine-guanine dinucleotide (CpG) methylation label, or an early HCC recurrence prediction model is constructed based on multidimensional information such as imaging group, visual analysis, clinical pathology and the like. Although the traditional prediction method shows a certain prediction capability, the methylation level detection of the whole genome is not conventionally used in clinical work at present, so that the support of clinical evidence is lacked, and the interpretation of image results such as image omics needs to be assisted by abundant expert experience, is time-consuming and labor-consuming, and further has a limited practical application value of clinical transformation.

In conclusion, the conventional methods for predicting the risk of liver cancer recurrence have the following disadvantages: either with expert experience or with lack of support from clinical evidence, it is difficult to simultaneously compromise clinically useful generalization and reliability of predicted outcomes.

Disclosure of Invention

The present disclosure provides a method, an electronic device, and a computer storage medium for predicting a characteristic type of liver cancer, which can not only improve the reliability of predicting a characteristic type of liver cancer, but also have good generalization in clinical applications.

According to a first aspect of the present disclosure, a method for predicting a characteristic type of a liver cancer is provided. The method comprises the following steps: generating genomic variation data of a plurality of predetermined genes with respect to the tumor sample of the subject based on the comparison result data with respect to the tumor sample of the subject; acquiring clinical data about a subject to be tested; determining tumor mutation load information about the test subject; obtaining immune checkpoint molecule expression data about an object to be tested; generating input data for a predictive model based at least on the genomic variation data, the clinical data, the tumor mutation burden information, and the immune checkpoint molecule expression data; and extracting features of the input data based on a prediction model trained through multiple samples so as to predict the liver cancer feature type based on the extracted features, wherein the prediction model is constructed based on a neural network model.

According to a second aspect of the present invention, there is also provided a computing device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform the method of the first aspect of the disclosure.

According to a third aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first aspect of the disclosure.

In some embodiments, generating input data for the predictive model comprises: determining tumor purity data about the tumor sample based on the proportion of tumor cells in tumor tissue of the tumor sample; and calculating a sequencing depth based on the alignment result data; and generating input data for a predictive model based on the genomic variation data, the clinical data, the tumor mutation burden information, the immune checkpoint molecule expression data, the tumor purity data, and the calculated sequencing depth.

In some embodiments, generating input data for the predictive model based on the genomic variation data, the clinical data, the tumor mutation burden information, the immune checkpoint molecule expression data, the tumor purity data, and the calculated sequencing depth comprises: generating candidate features based on genomic variation data, clinical data, tumor mutation load information, immune checkpoint molecule expression data, tumor purity data, and the calculated sequencing depth; determining the contribution degree of candidate characteristics contributing to the classification of the liver cancer characteristic type; according to the descending order of the contribution degrees, performing descending order on the candidate features; and determining candidate features having a ranking order less than or equal to a predetermined order threshold as input data for the prediction model.

In some embodiments, the genomic variation data for the plurality of predetermined genes for the tumor sample of the test subject comprises: single base substitution data, short and long insertion deletion data, copy number variation data, and gene rearrangement data of a plurality of predetermined genes in a tumor sample of a subject

In some embodiments, the clinical data about the subject to be tested includes at least: gender information, age information and tumor stage information about the subject to be tested.

In some embodiments, the plurality of predetermined genes belongs to a predetermined set of genes and the immune checkpoint molecule expression data is programmed death ligand 1 expression data.

In some embodiments, the predictive model is constructed based on a multi-layer feed-forward network trained by an error-inverse propagation algorithm.

In some embodiments, predicting the type of liver cancer feature based on the extracted features comprises: based on the extracted features, a prediction result about the liver cancer feature type of the object to be tested is determined, the prediction result indicating the liver cancer primary focus feature type or the liver cancer recurrence/metastasis feature type.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

Fig. 1 shows a schematic diagram of a system for implementing a method of predicting a characteristic type of a liver cancer, according to an embodiment of the present disclosure.

Fig. 2 shows a flow chart of a method for predicting a liver cancer signature type according to an embodiment of the present disclosure.

FIG. 3 shows a topological schematic of a predictive model according to an embodiment of the present disclosure.

Fig. 4 illustrates a ROC curve diagram for a predictive model according to some embodiments of the present disclosure.

FIG. 5 illustrates a ROC curve diagram for a predictive model according to further embodiments of the present disclosure.

FIG. 6 shows a flow diagram of a method for generating input data for a predictive model, according to an embodiment of the disclosure.

Fig. 7 shows a statistical result diagram for evaluation based on the selected 30 candidate feature input prediction models, according to an embodiment of the disclosure.

Fig. 8 illustrates a ROC curve diagram for a predictive model according to some embodiments of the present disclosure where the predetermined sequential threshold is 30.

FIG. 9 schematically illustrates a block diagram of an electronic device suitable for use to implement embodiments of the present disclosure.

Like or corresponding reference characters designate like or corresponding parts throughout the several views.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the accompanying drawings, it is to be understood that the disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "including" and variations thereof as used herein is intended to be open-ended, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.

As described above, the conventional methods for predicting the characteristic type of liver cancer have disadvantages in that: either with expert experience or lack of support from clinical evidence, it is difficult to compromise both the applicable generalization and the reliability of the prediction.

To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present disclosure propose a scheme for predicting a type of liver cancer feature. The scheme comprises the following steps: by obtaining genomic variation data, clinical data, tumor mutation load information, and immune checkpoint molecule expression data for a plurality of predetermined genes in a tumor sample of a subject; and generating input data at least based on the genome variation data, the clinical data, the tumor mutation load information and the immune checkpoint molecular expression data so as to input a prediction model constructed based on a neural network model to predict the characteristic type of the liver cancer, wherein the clinical parameters, the genome mutation information and other multidimensional data used for generating the input data are supported by clinical evidence, so that the reliability of a prediction result is improved; in addition, the method uses the prediction model which is established based on the neural network algorithm and trained by the sample to perform complex feature transformation on the multi-dimensional reliable data so as to obtain the prediction result of the liver cancer feature type, so that the reliability of predicting the liver cancer feature type can be improved, and the method has good generalization in clinical application.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method for predicting a liver cancer signature type according to an embodiment of the present disclosure. As shown in fig. 1, the system 100 includes: computing device 110, server 140, sequencing device 130, network 150. In some embodiments, the computing device 110, the server 140, the sequencing device 130 interact with data via the network 150.

As for the sequencing apparatus 130, it is used, for example, for sequencing a tumor sample with respect to a subject to be tested; and sending the generated sequencing data of the tumor sample, and/or the alignment result data of the sequencing data of the tumor sample and the sequencing data of the reference genome to the computing device 110.

With respect to server 140, it is used, for example, to send clinical data about the subject under test to computing device 110. In some embodiments, the server 140 may also send sequencing data, alignment data, and clinical data about the tumor sample of the subject to be tested to the computing device 110.

With respect to computing device 110, it is used, for example, to predict a type of liver cancer signature. In particular, the computing device 110 may generate genomic variation data for a plurality of predetermined genes for a tumor sample of a subject to be tested; acquiring clinical data about a subject to be tested; and determining tumor mutation load information about the test subject. Computing device 110 may also obtain immune checkpoint molecule expression data regarding the subject to be tested; and generating input data for a predictive model based at least on the genomic variation data, the clinical data, the tumor mutation burden information, and the immune checkpoint molecule expression data. Computing device 110 may also predict the liver cancer feature type based on a predictive model trained via multiple samples.

In some embodiments, computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device. The computing device 110 includes, for example: a genome variation data generation unit 112, a clinical data acquisition unit 114, a tumor mutation load information determination unit 116, an immune checkpoint molecule expression data acquisition unit 118, an input data generation unit 120, and a liver cancer feature type prediction unit 122. The genomic variant data generating unit 112, the clinical data acquiring unit 114, the tumor mutation load information determining unit 116, the immune checkpoint molecule expression data acquiring unit 118, the input data generating unit 120, and the liver cancer characteristic type predicting unit 122 may be configured on one or more computing devices 110.

And a genome variation data generating unit 112 for generating genome variation data of a plurality of predetermined genes of the tumor sample of the test subject based on the comparison result data of the tumor sample of the test subject.

And a clinical data acquisition unit 114 for acquiring clinical data about the subject to be tested.

And a tumor mutation load information determination unit 116 for determining tumor mutation load information on the object to be tested.

Obtaining immune checkpoint molecule expression data regarding the subject to be tested.

Regarding the input data generating unit 120 for generating input data for the prediction model based on at least the genomic variation data, the clinical data, the tumor mutation burden information and the immune checkpoint molecule expression data.

Regarding the liver cancer feature type prediction unit 122, it is used to extract features of input data based on a prediction model trained through multiple samples, so as to predict the liver cancer feature type based on the extracted features, the prediction model is constructed based on a neural network model.

A method for predicting a characteristic type of liver cancer according to an embodiment of the present disclosure will be described below with reference to fig. 2. Fig. 2 shows a flow diagram of a method 200 for predicting a liver cancer signature type in accordance with an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 900 depicted in fig. 9. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 202, the computing device 110 generates genomic variation data for a plurality of predetermined genes for a tumor sample of a test subject based on the alignment result data for the tumor sample of the test subject.

A plurality of predetermined genes belonging to, for example, a predetermined gene set. The plurality of predetermined genes includes, for example: multiple genes involved in genetic variation associated with liver cancer, and multiple hot genes involved in genetic variation associated with multiple other cancers other than liver cancer (e.g., high mutation frequency, high grade of clinical evidence, association with multiple cancers). In some embodiments, the predetermined set of genes comprises 450 genes. In some embodiments, the genomic variation data for a plurality of predetermined genes in a tumor sample from a test subject includes, for example: single base substitution data, short and long insertion deletion data, copy number variation data, and gene rearrangement data of a plurality of predetermined genes with respect to a tumor sample of a subject.

At step 204, the computing device 110 obtains clinical data about the subject under test. For example, computing device 110 obtains clinical data about the subject to be tested from server 140.

Clinical data concerning the subject to be tested, which for example at least comprise: gender information, age information and tumor stage information about the subject to be tested. It should be understood that the computing device 110 may also acquire other clinical data about the subject under test.

At step 206, the computing device 110 determines tumor mutation burden information about the subject to be tested. Tumor Mutation Burden (TMB) is defined as the number of genetic variations detected per million bases for somatic gene coding errors, base substitutions, gene insertions or deletions. It is understood that TMB may indirectly reflect the ability and extent of a tumor to produce a new antigen.

A method for determining tumor mutation load information about a test subject, which comprises, for example: acquiring a target region by adopting a designed sample probe, sequencing the target region to obtain a sequencing result, comparing the sequencing result with a reference genome to obtain comparison result data corresponding to the target region, and performing mutation detection (for example, detection on the mutation of SNP (single nucleotide polymorphism) and INDEL (intrinsic negative extension) on the target region obtained by sequencing based on the comparison result data to locate all mutation sites; filtering the mutation sites by adopting GATK software; annotation for genes at the site of mutation via filtering to filter out unwanted mutations (reproductive mutations, mutations caused by driver genes, unrelated mutations) based on the annotation results; and counting the number of the reserved mutations for calculating the tumor mutation load information of the object to be tested.

At step 208, the computing device 110 obtains immune checkpoint molecule expression data for the subject to be tested. In some embodiments, the immune checkpoint molecule expression data is programmed death ligand 1(PD-L1) expression data. Researches show that the expression of the PD-L1 protein is related to clinical pathological features of a patient with hepatocellular carcinoma, specifically, the positive expression of PD-L1 is related to tumor infiltration depth and TNM staging, the expression of PD-L1 in normal liver tissues is only localized in cytoplasm, the expression of cell membranes of PD-L1 in liver cancer tissues is found, and the expression and localization of PD-L1 are related to pathological staging of hepatocellular carcinoma. Therefore, it is necessary to use PD-L1 data for predicting the characteristic type of liver cancer.

As to the manner of obtaining the immune checkpoint molecule expression data on the subject to be tested, it includes, for example: the expression of PD-L1 protein was examined by immunohistochemical SP method.

At step 210, the computing device 110 generates input data for a predictive model based at least on the genomic variation data, the clinical data, the tumor mutation burden information, and the immune checkpoint molecule expression data.

A method for generating input data of a prediction model, for example, includes: determining tumor purity data about the tumor sample based on the proportion of tumor cells in tumor tissue of the tumor sample; calculating a sequencing depth based on the comparison result data; and generating input data for a predictive model based on the genomic variation data, the clinical data, the tumor mutation burden information, the immune checkpoint molecule expression data, the tumor purity data, and the calculated sequencing depth. The method 700 for generating input data for a predictive model is described in detail below in conjunction with FIG. 7. Here, the description is omitted.

At step 212, the computing device 110 extracts features of the input data based on a predictive model trained via the multi-sample to predict a liver cancer feature type based on the extracted features, the predictive model being constructed based on a neural network model.

With respect to the prediction model, it is constructed based on a multi-layer feedforward network trained by an error inverse propagation algorithm, for example. In some embodiments, the prediction model is constructed based on a bp (back propagation) neural network, for example. The BP neural network is a multi-layer feedforward network trained according to an error inverse propagation algorithm. The BP neural network is able to learn and store a large number of input-output pattern mappings without prior disclosure of mathematical equations describing such mappings. The learning rule of the BP neural network uses the steepest descent method, and the weight and the threshold value of the network are continuously adjusted through back propagation, so that the error square sum of the network is minimum.

Fig. 3 shows a topological schematic of a predictive model 300 according to an embodiment of the present disclosure. As shown in fig. 3, the prediction model 300 includes an input layer 310, a hidden layer 320, and an output layer 330. The neurons of the input layer 310 are configured to receive the input data generated at step 210 and to pass to the neurons of the hidden layer 320. The hidden layer 320 is an internal information processing layer for information transformation. It should be understood that the hidden layer 320 can be designed as a single hidden layer or a multi-hidden layer structure according to the requirement of information change capability. And the information of each neuron transmitted to the output layer by the last hidden layer is further processed to complete the forward propagation processing process of one learning. The output layer 330 is used for outputting the prediction result about the liver cancer feature type. It should be appreciated that when the prediction actually output by the output layer 330 does not match the expected output, the back propagation phase of the error is entered. The error is transmitted back to the hidden layer 320 and the input layer 310 layer by layer through the output layer 330, and the weight of each layer is corrected in a mode of error gradient descending. Through the repeated information forward propagation and error backward propagation processes, the process of continuously adjusting the weight of each layer is also the process of learning and training of the neural network, and the process is carried out until the error output by the network is reduced to an acceptable degree or preset learning times.

The calculation method of the prediction model will be described below with reference to formula (1) or (2).

z^(l)＝W^(l)*f_l-1(z^(l-1))+b^(l) (1)

a^(l)＝f_l(W^(l)*a^(l-1)+b^(l)) (2)

In the above formulas (1) and (2), b^(l)Representing the bias from layer l-1 to layer l,

m^(l)represents the number of layer I neurons. f. of_l() Represents the l-th layer neuron activation function. W^(l)Representing the weight matrix from layer l-1 to layer l,

z^(l)represents the net input (net activity value) of layer I neurons,

a^(l)represents the output (activity value) of layer I neurons,

it should be understood that the extracted features as training samples do not all play a role in classifying the liver cancer feature types, so that relatively important features need to be selected by a feature selection method for training a classifier for classifying the liver cancer feature types. In some embodiments, features are filtered by using feature importance attributes in a random forest model to derive correlations of features of a training sample or input features with liver cancer feature type classifications. Specifically, in feature selection, through training a prediction model for multiple times, an intersection of a certain amount of training features and last training features is selected and reserved each time, and a certain number of times is circulated, so that features (or candidate features contributing to classification of liver cancer feature types) which are important to influence of a classification task are obtained and serve as candidate features. After candidate features that significantly contribute to the impact of the classification task are obtained, the present disclosure sorts the candidate features in descending order according to the degree of contribution, and then retains candidate features whose sort order is less than or equal to a predetermined order threshold (e.g., without limitation, 10) as input data of the prediction model.

Regarding the output of the prediction model, which is, for example, a prediction result regarding the liver cancer feature type of the subject to be tested, for example, if the output result is "1", it indicates that the liver cancer feature type regarding the subject to be tested belongs to a liver cancer recurrence/metastasis feature type (i.e., liver cancer recurrence or metastasis feature type). And if the output result is '0', indicating that the liver cancer characteristic type of the object to be detected belongs to the liver cancer primary focus characteristic type.

For example, the present disclosure inputs the trained prediction models based on the input data of 899 samples, and 49 samples with the output result of "1" indicate that the liver cancer feature types of the 49 samples are predicted to belong to the liver cancer recurrence/metastasis feature types. 787 samples with an output result of "0" indicate that the liver cancer feature types of the 787 samples are predicted to belong to the liver cancer primary focus feature type. Of the 899 samples, 69 actual patients with recurrent metastasis had a positive rate of 71.0% for the type of primary liver cancer feature. The actual primary focus patients are 830 patients, and the positive rate of the predicted liver cancer recurrence/metastasis characteristic type is 94.8%.

Fig. 4 illustrates a ROC curve diagram for a predictive model according to some embodiments of the present disclosure. The ROC Curve is a graphical method for showing the trade-off between true rate and false positive rate of the prediction model, and in 899 sample data, AUC (Area Under Curve, which is defined as the Area enclosed by the coordinate axes) is 0.829. It should be appreciated that the closer the AUC is to 1.0, the more plausible or reliable the prediction method is.

In the scheme, the method comprises the steps of obtaining genome variation data, clinical data, tumor mutation load information and immune checkpoint molecule expression data of a plurality of predetermined genes of a tumor sample of a to-be-detected object; and generating input data at least based on the genome variation data, the clinical data, the tumor mutation load information and the immune checkpoint molecular expression data so as to input a prediction model constructed based on a neural network model to predict the characteristic type of the liver cancer, wherein the clinical parameters, the genome mutation information and other multidimensional data used for generating the input data are supported by clinical evidence, so that the reliability of a prediction result is improved; in addition, the method uses the prediction model which is established based on the neural network algorithm and trained by the sample to perform complex feature transformation on the multi-dimensional reliable data so as to obtain the prediction result of the liver cancer feature type, so that the reliability of predicting the liver cancer feature type can be improved, and the method has good generalization in clinical application.

In some embodiments, the genomic variation data used to generate the input data at step 210 is gene variation data for a panel of 450 genes of a tumor sample of the test subject. In some embodiments, the 450 genes are, for example, the 450 genes shown in table 1 below.

TABLE 1

In some embodiments, the genomic variation data of the tumor sample of the test subject is, for example, single-base substitution data, short and long insertion deletion data, copy number variation data, and gene rearrangement data of 450 genes shown in table 1 above.

For example, the present disclosure inputs input data of 899 samples generated based on genomic variation data, clinical data, tumor mutation load information, immune checkpoint molecule expression data, tumor purity data, and calculated sequencing depth of 450 genes of the samples, respectively, into the trained prediction model, and 69 samples with an output result of "1", that is, indicating that the liver cancer feature type of the 69 samples is predicted to belong to the liver cancer recurrence/metastasis feature type. There are 830 samples with output result "0", which indicates that the liver cancer feature types of the 830 samples are predicted to belong to the liver cancer primary focus feature type. Of the 899 samples, 69 actual patients with recurrent metastasis had a positive rate of 100% for the type of primary liver cancer feature. The actual primary focus patients are 830 patients, and the positive rate of the predicted liver cancer recurrence/metastatic focus characteristic type is 100%. FIG. 5 illustrates a ROC curve diagram for a predictive model according to further embodiments of the present disclosure. As shown in fig. 5, AUC of the 899 sample data was 1.0. Therefore, the input data of the prediction model is generated by the genome variation data of 450 predetermined genes, and the reliability of the prediction result can be obviously improved.

A method 600 for generating input data for a predictive model according to an embodiment of the disclosure will be described below in conjunction with fig. 6. FIG. 6 shows a flow diagram of a method 600 for generating input data for a predictive model, according to an embodiment of the disclosure. It should be understood that method 600 may be performed, for example, at electronic device 900 depicted in fig. 9. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 600 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 602, the computing device 110 determines tumor purity data for the tumor sample based on a proportion of tumor cells in tumor tissue of the tumor sample.

At step 604, the computing device 110 calculates a sequencing depth based on the alignment result data.

At step 606, the computing device 110 generates candidate features based on the genomic variation data, clinical data, tumor mutation burden information, immune checkpoint molecule expression data, tumor purity data, and calculated sequencing depth.

At step 608, the computing device 110 determines a degree of contribution of the candidate features that contribute to the classification of the liver cancer feature type. For example, the computing device 110 may cycle through several times by training the prediction model, each time by selecting a certain number of intersections between the candidate features and the previous candidate features for retention, to obtain candidate features that contribute to the impact of the classification task.

At step 610, the computing device 110 sorts the candidate features in descending order of decreasing contribution.

At step 612, the computing device 110 determines candidate features having a rank order less than or equal to a predetermined order threshold as input data for the predictive model. For example, the candidate features are sorted in descending order according to the contribution degree, and then the candidate features with the sorting order smaller than or equal to a predetermined order threshold value are reserved.

As for the predetermined order threshold, it is, for example, without limitation, 10. In some embodiments, the predetermined order threshold is, for example, 30.

For example, in the embodiment with the predetermined sequential threshold of 30, 899 samples of input data are respectively input into the trained prediction model, and 51 samples with the output result of "1" indicate that the liver cancer feature types of 51 samples are predicted to belong to the liver cancer recurrence/metastasis feature types. There are 808 samples with an output result of "0", which indicates that the liver cancer feature types of the 808 samples are predicted to belong to the liver cancer primary focus feature type. In other words, the positive rate of the characteristic pattern of the primary liver cancer focus was 73.9% in the 899 cases, in which 69 actual patients with recurrent metastasis were observed. The actual primary focus patients are 830 patients, and the positive rate of the predicted liver cancer recurrence/metastasis characteristic type is 97.3%. Fig. 7 shows a statistical result diagram of 224 examples of test data evaluated based on the selected 30 candidate feature input prediction models according to an embodiment of the present disclosure. Fig. 8 illustrates a ROC curve diagram for a predictive model according to some embodiments of the present disclosure where the predetermined sequential threshold is 30. In the 899 sample data, AUC was 0.856. It can be seen that the prediction performance is better than the AUC of 0.829 of the prediction model in the embodiment with the predetermined order threshold of 10.

By adopting the above means, the present disclosure can significantly improve the performance of the classifier of the prediction model by selecting the features having significant influence on the classifier of the prediction model as the input features of the model from the extracted features, thereby improving the accuracy of the prediction result regarding the liver cancer feature type.

FIG. 9 schematically illustrates a block diagram of an electronic device 900 suitable for use to implement embodiments of the present disclosure. The apparatus 900 may be for implementing the methods 200, 600 illustrated in fig. 2, 9, 6. As shown in fig. 9, device 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)902 or loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the device 900 can also be stored. The CPU 901, ROM 902, and RAM903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 908, a processing unit 901 performs the respective methods and processes described above, e.g. performing the methods 200, 600. For example, in some embodiments, the methods 200, 600 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 900 via ROM 902 and/or communications unit 909. When loaded into RAM903 and executed by CPU 901, may perform one or more of the operations of methods 200, 600 described above. Alternatively, in other embodiments, the CPU 901 may be configured in any other suitable manner (e.g., by way of firmware) to perform one or more acts of the methods 200, 600.

It should be further appreciated that the present disclosure may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, such as punch cards or in-groove raised structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above are merely alternative embodiments of the present disclosure and are not intended to limit the present disclosure, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method for predicting a characteristic type of liver cancer, comprising:

generating genomic variation data of a plurality of predetermined genes with respect to the tumor sample of the subject based on the comparison result data with respect to the tumor sample of the subject;

acquiring clinical data about a subject to be tested;

determining tumor mutation load information about the test subject;

obtaining immune checkpoint molecule expression data about an object to be tested;

generating input data for a predictive model based at least on the genomic variation data, the clinical data, the tumor mutation burden information, and the immune checkpoint molecule expression data; and

extracting features of the input data based on a prediction model trained through multiple samples so as to predict a liver cancer feature type based on the extracted features, the prediction model being constructed based on a neural network model.

2. The method of claim 1, wherein generating input data for a predictive model comprises:

determining tumor purity data about the tumor sample based on the proportion of tumor cells in tumor tissue of the tumor sample;

calculating a sequencing depth based on the comparison result data; and

generating input data for a predictive model based on genomic variation data, clinical data, tumor mutation burden information, immune checkpoint molecule expression data, tumor purity data, and calculated sequencing depth.

3. The method of claim 2, wherein generating input data for the predictive model further comprises:

generating candidate features based on genomic variation data, clinical data, tumor mutation load information, immune checkpoint molecule expression data, tumor purity data, and the calculated sequencing depth;

determining the contribution degree of candidate characteristics contributing to the classification of the liver cancer characteristic type;

according to the descending order of the contribution degrees, performing descending order on the candidate features; and

candidate features having a ranking order less than or equal to a predetermined order threshold are determined as input data to the prediction model.

4. The method of claim 1, wherein the genomic variation data for a plurality of predetermined genes from a tumor sample from a test subject comprises: single base substitution data, short and long insertion deletion data, copy number variation data, and gene rearrangement data of a plurality of predetermined genes with respect to a tumor sample of a subject.

5. The method of claim 1, wherein the clinical data about the subject to be tested includes at least: gender information, age information and tumor stage information about the subject to be tested.

6. The method of claim 1, wherein the plurality of predetermined genes belongs to a predetermined set of genes, the immune checkpoint molecule expression data being programmed death ligand 1 expression data.

7. The method of claim 1, wherein the predictive model is constructed based on a multi-layer feed-forward network trained by an inverse error propagation algorithm.

8. The method of claim 1, wherein predicting a liver cancer feature type based on the extracted features comprises:

determining a prediction result about the liver cancer feature type of the subject to be tested based on the extracted features, the prediction result indicating the liver cancer primary focus feature type or the liver cancer recurrence/metastasis feature type.

9. A computing device, comprising:

at least one processing unit;

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the apparatus to perform the steps of the method of any of claims 1 to 8.

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a machine, implements the method of any of claims 1-8.