CN113724875A

CN113724875A - Method, device and equipment for predicting cancer recurrence rate

Info

Publication number: CN113724875A
Application number: CN202111059325.3A
Authority: CN
Inventors: 杨爱民; 滑电波
Original assignee: Beijing Sitairui Health Technology Co ltd
Current assignee: Beijing Sitairui Health Technology Co ltd
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2021-11-30

Abstract

The embodiment of the invention provides a method, a device and equipment for predicting recurrence rate of cancer, wherein the method comprises the following steps: acquiring health index data of a patient; carrying out feature extraction processing on the health index data to obtain feature component data of the health index data; and inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate. The embodiment of the invention realizes rapid prediction analysis, obtains the prediction result with high accuracy, and enables the prediction model to be more effective and accurate.

Description

Method, device and equipment for predicting cancer recurrence rate

Technical Field

The present invention relates to the field of machine learning technologies, and in particular, to a method, an apparatus, and a device for predicting a cancer recurrence rate.

Background

Machine learning has achieved relatively outstanding results in the aspects of natural language processing, computer vision and speech recognition, and the machine learning algorithm is gradually comprehensive and has corresponding classification. Meanwhile, in the medical field, a large amount of cancer data has been collected and provided to the machine learning research community. Compared with the traditional cancer diagnosis mode, the machine learning does not use explicit instructions, but finds and identifies specific patterns from complex data sets by means of pattern recognition and reasoning, and can effectively predict cancers. However, prediction of cancer recurrence remains one of the most interesting and challenging tasks in the field of machine learning at present.

As the dimensionality of the data set increases, the number of samples required for algorithm learning increases exponentially. In some applications, it is disadvantageous to encounter such large data sets, and learning from the large data sets requires more memory and processing power. In addition, as the dimensionality increases, the sparsity of the data may increase. Exploring the same dataset in a high-dimensional vector space is more difficult than exploring the same sparse dataset.

Disclosure of Invention

The invention provides a method, a device and equipment for predicting cancer recurrence rate. The rapid prediction analysis is realized, and the prediction result with high accuracy is obtained, so that the prediction model is more effective and accurate.

To solve the above technical problem, an embodiment of the present invention provides the following solutions:

a method of predicting cancer recurrence rate, the method comprising:

acquiring health index data of a patient;

carrying out feature extraction processing on the health index data to obtain feature component data of the health index data;

and inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate.

Optionally, the cancer prediction model is trained by:

acquiring a training sequence data set for training;

performing feature extraction on the training sequence data set to obtain a training feature component of the training sequence data set;

and inputting the training characteristic components into a preset prediction model for processing to obtain a trained cancer prediction model.

Optionally, performing feature extraction on the training sequence data set to obtain a training feature component of the training sequence data set, including:

carrying out data preprocessing on the training sequence data set to obtain a data preprocessing result;

and performing dimensionality reduction on the data preprocessing result by using a preset feature extraction algorithm to obtain a training feature component of the sequence data set.

Optionally, performing data preprocessing on the training sequence data set to obtain a data preprocessing result, including:

cleaning the training sequence data set to obtain first data and second data; the first data comprise non-collected training sequence data and/or training sequence data with deletion number larger than or equal to a preset threshold, and the second data comprise training sequence data with deletion number smaller than a preset threshold;

performing data clearing processing on the first data to obtain primary processing data;

performing data completion processing on the second data to obtain secondary processing data;

and performing character coding processing on character data in the secondary processing data to obtain a data preprocessing result.

Optionally, performing dimension reduction processing on the data preprocessing result by using a preset feature extraction algorithm to obtain a training feature component of the sequence dataset, including:

extracting at least one target linear combination from the data preprocessing result by utilizing a Principal Component Analysis (PCA) algorithm;

and performing dimensionality reduction processing on the target linear combination to obtain a training characteristic component of the sequence data set.

Optionally, the training feature component is input into a preset prediction model for processing, so as to obtain a trained cancer prediction model, including:

inputting the training characteristic components into a preset gradient descent tree GBDT prediction model for processing to obtain an iterative gradient direction of each training characteristic component;

fitting the training characteristic components and the iterative gradient direction of each training characteristic component to obtain fitting leaf node information; the fitting leaf node information comprises the number of fitting leaf nodes and area data corresponding to each leaf node;

obtaining a target output value of a fitting leaf node according to the fitting leaf node information;

and carrying out model iteration on the target output value to obtain a trained cancer prediction model.

Optionally, the trained cancer prediction model is:

wherein, I (x ∈ R_m+1J) is a function of 0/1; f. of_m+1(x) Fitting function of decision Tree for round m +1, c_m+1Is the target output value of the m +1 th round.

The present invention also provides a device for predicting a recurrence rate of cancer, the device comprising:

the acquisition module is used for acquiring health index data of a patient;

the processing module is used for carrying out feature extraction processing on the health index data to obtain feature component data of the health index data; and inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate.

The present invention provides a computing device comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the method for predicting a recurrence rate of cancer as described above.

The present invention provides a computer readable storage medium storing instructions which, when executed on a computer, cause the computer to perform the method as described above.

The scheme of the invention at least comprises the following beneficial effects:

according to the scheme, the health index data of the patient is acquired; carrying out feature extraction processing on the health index data to obtain feature component data of the health index data; inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate; the characteristics of the data set can be analyzed and counted, learning adjustment can be performed by combining a large number of data sets, rapid prediction analysis is achieved, a prediction result with high accuracy is obtained, and the prediction model is more effective and accurate.

Drawings

FIG. 1 is a schematic flow chart of a method for predicting recurrence rate of cancer according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a device for predicting cancer recurrence rate according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

As shown in fig. 1, an embodiment of the present invention provides a method for predicting a recurrence rate of cancer, including:

step 11, acquiring health index data of a patient;

step 12, performing feature extraction processing on the health index data to obtain feature component data of the health index data;

and step 13, inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate.

In the embodiment, health index data affecting cancer patients are collected, characteristic component data are extracted from the health index data, and the characteristic component data are input into a trained cancer prediction model for prediction processing to obtain a prediction result of cancer recurrence rate; the rapid prediction analysis is realized, and the prediction result with high accuracy is obtained, so that the prediction model is more effective and accurate.

The health index data preferably includes: tumor module data; basic module information data; microenvironment module information data; immunization module information data; nutrition module information data; lifestyle module information data; psychology module information data; motion module information data; and advanced operation module information data;

the health index data comprises a plurality of module information data for representing the current physical condition of the cancer patient, and the modules for representing the current physical condition of the cancer patient comprise a tumor module, a basic module, a microenvironment module, an immunization module, a nutrition module, a life style module, a psychological module, an exercise module and an advanced operation module; each module is characterized by multi-dimensional secondary characteristics, namely each module comprises characteristic data of multi-dimensional secondary.

In an alternative embodiment of the present invention, the cancer prediction model is trained by the following process:

step a, acquiring a training sequence data set for training;

b, performing feature extraction on the training sequence data set to obtain a training feature component of the training sequence data set;

and c, inputting the training characteristic components into a preset prediction model for processing to obtain a trained cancer prediction model.

In this embodiment, the cancer prediction model extracts the training feature component of the sequence data set from the training sequence data set, and inputs the training feature component of the sequence data set into the preset prediction model for processing, so as to finally obtain the trained cancer prediction model.

In an optional embodiment of the present invention, step b includes:

b1, preprocessing the data of the training sequence data set to obtain a data preprocessing result;

and b2, performing dimensionality reduction on the data preprocessing result by using a preset feature extraction algorithm to obtain a training feature component of the sequence data set.

In this embodiment, the training sequence data set is preferably subjected to data preprocessing by a data filtering technology to obtain a data preprocessing result; the preset feature extraction algorithm is preferably a Principal Component Analysis (PCA) algorithm, and the preset feature extraction algorithm is used for performing dimensionality reduction on the data preprocessing result to obtain a training feature Component of the sequence data set.

The data filtering technology comprises data cleaning processing and data completion processing;

in an alternative embodiment of the present invention, step b1 includes:

b11, performing cleaning treatment on the training sequence data set to obtain first data and second data; the first data comprise non-collected training sequence data and/or training sequence data with deletion number larger than or equal to a preset threshold, and the second data comprise training sequence data with deletion number smaller than a preset threshold;

b12, performing data clearing processing on the first data to obtain primary processing data;

b13, performing data completion processing on the second data to obtain secondary processing data;

and b14, performing character coding processing on the character data in the secondary processing data to obtain a data preprocessing result.

In this embodiment, a training sequence data set is subjected to data cleaning processing, and first data and second data are obtained by screening, where the first data are training sequence data that are not acquired and/or training sequence data whose missing number is greater than or equal to a preset threshold; the second data is training sequence data with the missing number smaller than a preset threshold value; the missing number refers to the missing number of the whole training sequence data in the training sequence data set, and the setting of a preset threshold value can control the execution efficiency of subsequent data processing;

performing data clearing processing on first data in the training sequence data set, and clearing the training sequence data which are not acquired and/or the training sequence data with the missing number larger than or equal to a preset threshold value from the training sequence data set; then, performing data completion processing on the second data, and performing completion on the second data in a preferred mean value interpolation mode through training sequence data with the missing number smaller than a preset threshold value to obtain secondary processing data; the text data in the secondary processing data is preferably subjected to file coding processing through One-Hot coding, so that the preprocessing of a training sequence data set is realized, and an N-dimensional training sequence data set can be arranged into an M-dimensional training sequence data set, wherein N is greater than M;

the One-Hot encoding is a method of performing computer data encoding on text data in the secondary processing data, and the method of performing data preprocessing on the training sequence data set includes data clearing processing, data completion processing, and text encoding processing.

In a specific embodiment, the text data for the index for evaluating the life status includes "good", "general", and "poor", and the "good" is encoded as (1,0,0), "general" is encoded as (0,1,0), and "poor" is encoded as (0,0,1) by One-Hot encoding.

In an alternative embodiment of the present invention, step b2 includes:

step b21, extracting at least one target linear combination from the data preprocessing result by using a PCA algorithm;

and b22, performing dimensionality reduction processing on the target linear combination to obtain a training characteristic component of the sequence data set.

In this embodiment, a target linear combination is preferably extracted from the data preprocessing result by a PCA algorithm; the target linear combination may be:

wherein β ═ β (β)₁,β₂,…β_p)′，x_i＝(x_i1,x_i2,…x_ip)′；β_jWeight coefficient for the corresponding jth variable, β' is the set of weight coefficients for each corresponding training sequence data, P is the dimension of the set of training sequence data, x_i1,x_i2,…x_ipTo observe a variable, x_ijI is an observation object of the training sequence data set; the weight coefficient beta' of each training sequence data is used for comprehensively characterizing the recurrence condition of the cancer patient;

preferably, the dimensionality after dimensionality reduction can be preset, training characteristic components in the data preprocessing result are extracted according to a PCA algorithm, and specifically, the training characteristic components are extracted through a variance formula:

wherein var (f) is the variance of the calculated data; n is the number of samples; beta is a_jWeight coefficient, x, for corresponding jth training feature component_ijIs a set of observed variables;

the condition that the 1 st training characteristic component needs to be satisfied can be obtained:

where V is the abbreviation of the variance Var (f), i.e. V is the variance of the calculated data, β₁Weight coefficient, beta, for the corresponding 1 st training feature component₁' is a transposed matrix of weight coefficients for the 1 st training feature component;

the obtained k-th training feature component needs to satisfy the condition:

wherein V is a shorthand of the above variance Var (f), β_kWeights for corresponding k-th training feature componentCoefficient of gravity, beta_k' is a transpose of the weight coefficients of the corresponding k-th training feature component;

the training feature components with the preset dimensionality reduced are obtained through the PCA algorithm, the calculation time of the algorithm is greatly reduced, and meanwhile, the variance of the training feature components obtained through principal component analysis is maximized through determining the weight coefficient. The variance is sequentially the 1 st training characteristic component, the 2 nd training characteristic component, … and the kth training characteristic component from large to small, wherein the 1 st training characteristic component is the largest variance; the training characteristic components obtained through the PCA algorithm are linear combinations of data preprocessing results obtained from the training sequence data set, are not related to each other, and represent important training characteristic components in the data preprocessing results obtained from the training sequence data set; assuming that each health index information module is characterized by multi-dimensional secondary features, the dimension of the secondary features is up to 125, the dimension k after dimension reduction is set to be 15, 65 dimensions of preprocessed data are reduced to 15 dimensions through a PCA algorithm, main feature components of high-dimensional patient full-sequence data are extracted, and the calculation time of the algorithm is greatly reduced.

In an optional embodiment of the present invention, step c includes:

step c1, inputting the training characteristic components into a preset GBDT prediction model for processing to obtain the iterative gradient direction of each training characteristic component;

step c2, fitting the training feature components and the iterative gradient direction of each training feature component to obtain fitting leaf node information; the fitting leaf node information comprises the number of fitting leaf nodes and area data corresponding to each leaf node;

step c3, obtaining target output values of the fitting leaf nodes according to the fitting leaf node information;

and c4, carrying out model iteration on the target output value to obtain a trained cancer prediction model.

In this embodiment, the set of training feature components is T { (x1, y1), (x2, y2), …, (xn, yn) }; reducing error function values by a GBDT (Gradient Boosting Decision Tree) algorithm in a forward segmentation regression mode and continuously adding new Decision trees without changing the parameters of the existing Decision trees, so as to realize model iteration on target data values and further obtain a trained cancer prediction model;

optionally, the loss function corresponding to the set of training feature components is:

wherein, l (f) is a loss function corresponding to the training feature component, and (xi, yi) is the training feature component.

When the model after m rounds of learning is fm (x) and the loss function is L (y, fm (x)), the iterative gradient direction of the (m + 1) th training feature component is obtained as follows:

wherein r is_i,m+1The iterative gradient direction of the (m + 1) th training characteristic component; l (y, f (xi)) is a loss function corresponding to f (xi); f (xi) is a training feature component; fm (xi) is the ith component of the model after m rounds of learning.

Note that the gradient direction is a partial differential in the loss function calculation and takes a negative sign.

Fitting the training characteristic components and the iterative gradient direction of each training characteristic component to obtain the number J of fitting leaf nodes and area data R corresponding to each leaf node_m+1,j(j＝1,2,…J)；

By fitting the number J of leaf nodes and the area data R corresponding to each leaf node_m+1J (J ═ 1,2, … J), solving for the target output value of the fitted leaf node for each fitted leaf node in the fitted leaf node information, can be represented by the formula:

obtaining a target output value;

wherein, c_m+1，jThe target output value is c is a constant, the specific c is an initialization hyper-parameter, and L (y, fm (xi)) is a loss function corresponding to fm (xi);

the target output value is the optimal output value of the fitting leaf node, and the minimum loss function is used as the optimal output value of the fitting leaf node.

The fitting function of the decision tree of the (m + 1) th round can also be obtained as:

wherein, I (x ∈ R_m+1J) is a 0/1 function, which is used to determine whether the leaf node belongs to a fitting leaf node, if it belongs to the fitting leaf node, 1 is returned, and if it does not belong to the fitting leaf node, 0 is returned; h is_m+1(x) Fitting function of decision Tree for round m +1, c_m+1Is the target output value of the m +1 th round;

in an alternative embodiment of the present invention, the trained cancer prediction model is:

In this embodiment, the prediction model f is generated by continuously iterating the values of the loss function to reduce_m+1(x)。

As shown in fig. 2, in a specific embodiment, historical health index data of a patient is obtained through data collection, which may specifically include systemic disease information, family genetic history information, unnecessary dependency information, age of the patient, obesity degree of the patient, and the like of the patient, the health index data of the cancer patient is subjected to data sorting to obtain a training sequence data set for training, the training sequence data set is subjected to data preprocessing, which specifically includes completing missing values of a small part of missing training sequence data in the training sequence data set, removing non-acquired training sequence data, encoding text data in the training sequence data set by One-Hot, performing data dimensionality reduction on the full sequence data set subjected to data preprocessing through a PCA algorithm to extract training feature components in the training sequence data set, and learning and training the training characteristic components through a GBDT algorithm to generate a trained cancer prediction model.

Embodiments of the invention predict a cancer-associated index in a patient by combining PCA dimension reduction with GBDT methods. In the prediction model, hundreds of parameter indexes related to cancer on a patient are integrated, a large-scale data set consisting of hundreds of data items is subjected to data preprocessing by using a PCA algorithm, and the processed data result is operated by combining a GBDT algorithm, so that the required result of the related indexes for cancer prediction is finally obtained.

The prediction model in the above embodiment of the present invention not only can analyze and count the characteristics of the data set, but also can obtain a prediction result with an accuracy of 84% by performing a fast prediction analysis on a new patient (new sample data) by combining with the learning adjustment of processing a large number of patient data sets, and is more effective and accurate than the conventional method.

The prediction model in the embodiment of the invention is also transversely compared with the prediction results of other similar algorithm models, and the result shows that the comprehensive index of the prediction result of the model is the best, namely the model is more effective.

As shown in fig. 3, the present invention also provides a device 30 for predicting a recurrence rate of cancer, the device 30 comprising:

an obtaining module 31, configured to obtain health index data of a patient;

the processing module 32 is configured to perform feature extraction processing on the health index data to obtain feature component data of the health index data; and inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate.

Optionally, the cancer prediction model is trained by:

acquiring a training sequence data set for training;

Optionally, the trained cancer prediction model is:

wherein, I (x ∈ R_m+1J) is a function of 0/1; f. of_m+1(x) Fitting function of decision Tree for round m +1, c_m+1Is the target output value of the m +1 th round. The present invention is also directed to a method for predicting a cancer recurrence rate, and all the implementations in the above method embodiments are applicable to the embodiment of the present invention, and the same technical effects can be achieved.

The present invention also provides a computing device comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

Embodiments of the present invention also provide a computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to perform the method as described above.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

Furthermore, it is to be noted that in the device and method of the invention, it is obvious that the individual components or steps can be decomposed and/or recombined. These decompositions and/or recombinations are to be regarded as equivalents of the present invention. Also, the steps of performing the series of processes described above may naturally be performed chronologically in the order described, but need not necessarily be performed chronologically, and some steps may be performed in parallel or independently of each other. It will be understood by those skilled in the art that all or any of the steps or elements of the method and apparatus of the present invention may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or any combination thereof, which can be implemented by those skilled in the art using their basic programming skills after reading the description of the present invention.

Thus, the objects of the invention may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. The object of the invention is thus also achieved solely by providing a program product comprising program code for implementing the method or the apparatus. That is, such a program product also constitutes the present invention, and a storage medium storing such a program product also constitutes the present invention. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future. It is further noted that in the apparatus and method of the present invention, it is apparent that each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be regarded as equivalents of the present invention. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for predicting the recurrence rate of cancer, comprising:

acquiring health index data of a patient;

2. The method of predicting cancer recurrence rate of claim 1, wherein said cancer prediction model is trained by the following process:

acquiring a training sequence data set for training;

3. The method of claim 2, wherein the feature extraction of the training sequence data set to obtain the training feature component of the training sequence data set comprises:

4. The method of claim 3, wherein the pre-processing the training sequence data set to obtain a pre-processing result comprises:

5. The method of claim 3, wherein the performing dimension reduction on the data preprocessing result by using a predetermined feature extraction algorithm to obtain the training feature component of the sequence data set comprises:

6. The method of claim 2, wherein the step of inputting the training feature components into a predetermined prediction model for processing to obtain a trained cancer prediction model comprises:

7. The method of claim 6, wherein the trained cancer prediction model is:

8. An apparatus for predicting a recurrence rate of cancer, the apparatus comprising:

the acquisition module is used for acquiring health index data of a patient;

9. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction which causes the processor to execute the operation corresponding to the prediction method of cancer recurrence rate as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7.