CN113724875A - Method, device and equipment for predicting cancer recurrence rate - Google Patents

Method, device and equipment for predicting cancer recurrence rate Download PDF

Info

Publication number
CN113724875A
CN113724875A CN202111059325.3A CN202111059325A CN113724875A CN 113724875 A CN113724875 A CN 113724875A CN 202111059325 A CN202111059325 A CN 202111059325A CN 113724875 A CN113724875 A CN 113724875A
Authority
CN
China
Prior art keywords
data
training
processing
cancer
prediction model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111059325.3A
Other languages
Chinese (zh)
Inventor
杨爱民
滑电波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sitairui Health Technology Co ltd
Original Assignee
Beijing Sitairui Health Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sitairui Health Technology Co ltd filed Critical Beijing Sitairui Health Technology Co ltd
Priority to CN202111059325.3A priority Critical patent/CN113724875A/en
Publication of CN113724875A publication Critical patent/CN113724875A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The embodiment of the invention provides a method, a device and equipment for predicting recurrence rate of cancer, wherein the method comprises the following steps: acquiring health index data of a patient; carrying out feature extraction processing on the health index data to obtain feature component data of the health index data; and inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate. The embodiment of the invention realizes rapid prediction analysis, obtains the prediction result with high accuracy, and enables the prediction model to be more effective and accurate.

Description

Method, device and equipment for predicting cancer recurrence rate
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a method, an apparatus, and a device for predicting a cancer recurrence rate.
Background
Machine learning has achieved relatively outstanding results in the aspects of natural language processing, computer vision and speech recognition, and the machine learning algorithm is gradually comprehensive and has corresponding classification. Meanwhile, in the medical field, a large amount of cancer data has been collected and provided to the machine learning research community. Compared with the traditional cancer diagnosis mode, the machine learning does not use explicit instructions, but finds and identifies specific patterns from complex data sets by means of pattern recognition and reasoning, and can effectively predict cancers. However, prediction of cancer recurrence remains one of the most interesting and challenging tasks in the field of machine learning at present.
As the dimensionality of the data set increases, the number of samples required for algorithm learning increases exponentially. In some applications, it is disadvantageous to encounter such large data sets, and learning from the large data sets requires more memory and processing power. In addition, as the dimensionality increases, the sparsity of the data may increase. Exploring the same dataset in a high-dimensional vector space is more difficult than exploring the same sparse dataset.
Disclosure of Invention
The invention provides a method, a device and equipment for predicting cancer recurrence rate. The rapid prediction analysis is realized, and the prediction result with high accuracy is obtained, so that the prediction model is more effective and accurate.
To solve the above technical problem, an embodiment of the present invention provides the following solutions:
a method of predicting cancer recurrence rate, the method comprising:
acquiring health index data of a patient;
carrying out feature extraction processing on the health index data to obtain feature component data of the health index data;
and inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate.
Optionally, the cancer prediction model is trained by:
acquiring a training sequence data set for training;
performing feature extraction on the training sequence data set to obtain a training feature component of the training sequence data set;
and inputting the training characteristic components into a preset prediction model for processing to obtain a trained cancer prediction model.
Optionally, performing feature extraction on the training sequence data set to obtain a training feature component of the training sequence data set, including:
carrying out data preprocessing on the training sequence data set to obtain a data preprocessing result;
and performing dimensionality reduction on the data preprocessing result by using a preset feature extraction algorithm to obtain a training feature component of the sequence data set.
Optionally, performing data preprocessing on the training sequence data set to obtain a data preprocessing result, including:
cleaning the training sequence data set to obtain first data and second data; the first data comprise non-collected training sequence data and/or training sequence data with deletion number larger than or equal to a preset threshold, and the second data comprise training sequence data with deletion number smaller than a preset threshold;
performing data clearing processing on the first data to obtain primary processing data;
performing data completion processing on the second data to obtain secondary processing data;
and performing character coding processing on character data in the secondary processing data to obtain a data preprocessing result.
Optionally, performing dimension reduction processing on the data preprocessing result by using a preset feature extraction algorithm to obtain a training feature component of the sequence dataset, including:
extracting at least one target linear combination from the data preprocessing result by utilizing a Principal Component Analysis (PCA) algorithm;
and performing dimensionality reduction processing on the target linear combination to obtain a training characteristic component of the sequence data set.
Optionally, the training feature component is input into a preset prediction model for processing, so as to obtain a trained cancer prediction model, including:
inputting the training characteristic components into a preset gradient descent tree GBDT prediction model for processing to obtain an iterative gradient direction of each training characteristic component;
fitting the training characteristic components and the iterative gradient direction of each training characteristic component to obtain fitting leaf node information; the fitting leaf node information comprises the number of fitting leaf nodes and area data corresponding to each leaf node;
obtaining a target output value of a fitting leaf node according to the fitting leaf node information;
and carrying out model iteration on the target output value to obtain a trained cancer prediction model.
Optionally, the trained cancer prediction model is:
Figure BDA0003255724790000031
wherein, I (x ∈ Rm+1J) is a function of 0/1; f. ofm+1(x) Fitting function of decision Tree for round m +1, cm+1Is the target output value of the m +1 th round.
The present invention also provides a device for predicting a recurrence rate of cancer, the device comprising:
the acquisition module is used for acquiring health index data of a patient;
the processing module is used for carrying out feature extraction processing on the health index data to obtain feature component data of the health index data; and inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate.
The present invention provides a computing device comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the method for predicting a recurrence rate of cancer as described above.
The present invention provides a computer readable storage medium storing instructions which, when executed on a computer, cause the computer to perform the method as described above.
The scheme of the invention at least comprises the following beneficial effects:
according to the scheme, the health index data of the patient is acquired; carrying out feature extraction processing on the health index data to obtain feature component data of the health index data; inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate; the characteristics of the data set can be analyzed and counted, learning adjustment can be performed by combining a large number of data sets, rapid prediction analysis is achieved, a prediction result with high accuracy is obtained, and the prediction model is more effective and accurate.
Drawings
FIG. 1 is a schematic flow chart of a method for predicting recurrence rate of cancer according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a device for predicting cancer recurrence rate according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
As shown in fig. 1, an embodiment of the present invention provides a method for predicting a recurrence rate of cancer, including:
step 11, acquiring health index data of a patient;
step 12, performing feature extraction processing on the health index data to obtain feature component data of the health index data;
and step 13, inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate.
In the embodiment, health index data affecting cancer patients are collected, characteristic component data are extracted from the health index data, and the characteristic component data are input into a trained cancer prediction model for prediction processing to obtain a prediction result of cancer recurrence rate; the rapid prediction analysis is realized, and the prediction result with high accuracy is obtained, so that the prediction model is more effective and accurate.
The health index data preferably includes: tumor module data; basic module information data; microenvironment module information data; immunization module information data; nutrition module information data; lifestyle module information data; psychology module information data; motion module information data; and advanced operation module information data;
the health index data comprises a plurality of module information data for representing the current physical condition of the cancer patient, and the modules for representing the current physical condition of the cancer patient comprise a tumor module, a basic module, a microenvironment module, an immunization module, a nutrition module, a life style module, a psychological module, an exercise module and an advanced operation module; each module is characterized by multi-dimensional secondary characteristics, namely each module comprises characteristic data of multi-dimensional secondary.
In an alternative embodiment of the present invention, the cancer prediction model is trained by the following process:
step a, acquiring a training sequence data set for training;
b, performing feature extraction on the training sequence data set to obtain a training feature component of the training sequence data set;
and c, inputting the training characteristic components into a preset prediction model for processing to obtain a trained cancer prediction model.
In this embodiment, the cancer prediction model extracts the training feature component of the sequence data set from the training sequence data set, and inputs the training feature component of the sequence data set into the preset prediction model for processing, so as to finally obtain the trained cancer prediction model.
In an optional embodiment of the present invention, step b includes:
b1, preprocessing the data of the training sequence data set to obtain a data preprocessing result;
and b2, performing dimensionality reduction on the data preprocessing result by using a preset feature extraction algorithm to obtain a training feature component of the sequence data set.
In this embodiment, the training sequence data set is preferably subjected to data preprocessing by a data filtering technology to obtain a data preprocessing result; the preset feature extraction algorithm is preferably a Principal Component Analysis (PCA) algorithm, and the preset feature extraction algorithm is used for performing dimensionality reduction on the data preprocessing result to obtain a training feature Component of the sequence data set.
The data filtering technology comprises data cleaning processing and data completion processing;
in an alternative embodiment of the present invention, step b1 includes:
b11, performing cleaning treatment on the training sequence data set to obtain first data and second data; the first data comprise non-collected training sequence data and/or training sequence data with deletion number larger than or equal to a preset threshold, and the second data comprise training sequence data with deletion number smaller than a preset threshold;
b12, performing data clearing processing on the first data to obtain primary processing data;
b13, performing data completion processing on the second data to obtain secondary processing data;
and b14, performing character coding processing on the character data in the secondary processing data to obtain a data preprocessing result.
In this embodiment, a training sequence data set is subjected to data cleaning processing, and first data and second data are obtained by screening, where the first data are training sequence data that are not acquired and/or training sequence data whose missing number is greater than or equal to a preset threshold; the second data is training sequence data with the missing number smaller than a preset threshold value; the missing number refers to the missing number of the whole training sequence data in the training sequence data set, and the setting of a preset threshold value can control the execution efficiency of subsequent data processing;
performing data clearing processing on first data in the training sequence data set, and clearing the training sequence data which are not acquired and/or the training sequence data with the missing number larger than or equal to a preset threshold value from the training sequence data set; then, performing data completion processing on the second data, and performing completion on the second data in a preferred mean value interpolation mode through training sequence data with the missing number smaller than a preset threshold value to obtain secondary processing data; the text data in the secondary processing data is preferably subjected to file coding processing through One-Hot coding, so that the preprocessing of a training sequence data set is realized, and an N-dimensional training sequence data set can be arranged into an M-dimensional training sequence data set, wherein N is greater than M;
the One-Hot encoding is a method of performing computer data encoding on text data in the secondary processing data, and the method of performing data preprocessing on the training sequence data set includes data clearing processing, data completion processing, and text encoding processing.
In a specific embodiment, the text data for the index for evaluating the life status includes "good", "general", and "poor", and the "good" is encoded as (1,0,0), "general" is encoded as (0,1,0), and "poor" is encoded as (0,0,1) by One-Hot encoding.
In an alternative embodiment of the present invention, step b2 includes:
step b21, extracting at least one target linear combination from the data preprocessing result by using a PCA algorithm;
and b22, performing dimensionality reduction processing on the target linear combination to obtain a training characteristic component of the sequence data set.
In this embodiment, a target linear combination is preferably extracted from the data preprocessing result by a PCA algorithm; the target linear combination may be:
Figure BDA0003255724790000061
wherein β ═ β (β)12,…βp)′,xi=(xi1,xi2,…xip)′;βjWeight coefficient for the corresponding jth variable, β' is the set of weight coefficients for each corresponding training sequence data, P is the dimension of the set of training sequence data, xi1,xi2,…xipTo observe a variable, xijI is an observation object of the training sequence data set; the weight coefficient beta' of each training sequence data is used for comprehensively characterizing the recurrence condition of the cancer patient;
preferably, the dimensionality after dimensionality reduction can be preset, training characteristic components in the data preprocessing result are extracted according to a PCA algorithm, and specifically, the training characteristic components are extracted through a variance formula:
Figure BDA0003255724790000071
wherein var (f) is the variance of the calculated data; n is the number of samples; beta is ajWeight coefficient, x, for corresponding jth training feature componentijIs a set of observed variables;
the condition that the 1 st training characteristic component needs to be satisfied can be obtained:
Figure BDA0003255724790000072
where V is the abbreviation of the variance Var (f), i.e. V is the variance of the calculated data, β1Weight coefficient, beta, for the corresponding 1 st training feature component1' is a transposed matrix of weight coefficients for the 1 st training feature component;
the obtained k-th training feature component needs to satisfy the condition:
Figure BDA0003255724790000073
wherein V is a shorthand of the above variance Var (f), βkWeights for corresponding k-th training feature componentCoefficient of gravity, betak' is a transpose of the weight coefficients of the corresponding k-th training feature component;
the training feature components with the preset dimensionality reduced are obtained through the PCA algorithm, the calculation time of the algorithm is greatly reduced, and meanwhile, the variance of the training feature components obtained through principal component analysis is maximized through determining the weight coefficient. The variance is sequentially the 1 st training characteristic component, the 2 nd training characteristic component, … and the kth training characteristic component from large to small, wherein the 1 st training characteristic component is the largest variance; the training characteristic components obtained through the PCA algorithm are linear combinations of data preprocessing results obtained from the training sequence data set, are not related to each other, and represent important training characteristic components in the data preprocessing results obtained from the training sequence data set; assuming that each health index information module is characterized by multi-dimensional secondary features, the dimension of the secondary features is up to 125, the dimension k after dimension reduction is set to be 15, 65 dimensions of preprocessed data are reduced to 15 dimensions through a PCA algorithm, main feature components of high-dimensional patient full-sequence data are extracted, and the calculation time of the algorithm is greatly reduced.
In an optional embodiment of the present invention, step c includes:
step c1, inputting the training characteristic components into a preset GBDT prediction model for processing to obtain the iterative gradient direction of each training characteristic component;
step c2, fitting the training feature components and the iterative gradient direction of each training feature component to obtain fitting leaf node information; the fitting leaf node information comprises the number of fitting leaf nodes and area data corresponding to each leaf node;
step c3, obtaining target output values of the fitting leaf nodes according to the fitting leaf node information;
and c4, carrying out model iteration on the target output value to obtain a trained cancer prediction model.
In this embodiment, the set of training feature components is T { (x1, y1), (x2, y2), …, (xn, yn) }; reducing error function values by a GBDT (Gradient Boosting Decision Tree) algorithm in a forward segmentation regression mode and continuously adding new Decision trees without changing the parameters of the existing Decision trees, so as to realize model iteration on target data values and further obtain a trained cancer prediction model;
optionally, the loss function corresponding to the set of training feature components is:
Figure BDA0003255724790000081
wherein, l (f) is a loss function corresponding to the training feature component, and (xi, yi) is the training feature component.
When the model after m rounds of learning is fm (x) and the loss function is L (y, fm (x)), the iterative gradient direction of the (m + 1) th training feature component is obtained as follows:
Figure BDA0003255724790000082
wherein r isi,m+1The iterative gradient direction of the (m + 1) th training characteristic component; l (y, f (xi)) is a loss function corresponding to f (xi); f (xi) is a training feature component; fm (xi) is the ith component of the model after m rounds of learning.
Note that the gradient direction is a partial differential in the loss function calculation and takes a negative sign.
Fitting the training characteristic components and the iterative gradient direction of each training characteristic component to obtain the number J of fitting leaf nodes and area data R corresponding to each leaf nodem+1,j(j=1,2,…J);
By fitting the number J of leaf nodes and the area data R corresponding to each leaf nodem+1J (J ═ 1,2, … J), solving for the target output value of the fitted leaf node for each fitted leaf node in the fitted leaf node information, can be represented by the formula:
Figure BDA0003255724790000091
obtaining a target output value;
wherein, cm+1,jThe target output value is c is a constant, the specific c is an initialization hyper-parameter, and L (y, fm (xi)) is a loss function corresponding to fm (xi);
the target output value is the optimal output value of the fitting leaf node, and the minimum loss function is used as the optimal output value of the fitting leaf node.
The fitting function of the decision tree of the (m + 1) th round can also be obtained as:
Figure BDA0003255724790000092
wherein, I (x ∈ Rm+1J) is a 0/1 function, which is used to determine whether the leaf node belongs to a fitting leaf node, if it belongs to the fitting leaf node, 1 is returned, and if it does not belong to the fitting leaf node, 0 is returned; h ism+1(x) Fitting function of decision Tree for round m +1, cm+1Is the target output value of the m +1 th round;
in an alternative embodiment of the present invention, the trained cancer prediction model is:
Figure BDA0003255724790000093
wherein, I (x ∈ Rm+1J) is a function of 0/1; f. ofm+1(x) Fitting function of decision Tree for round m +1, cm+1Is the target output value of the m +1 th round.
In this embodiment, the prediction model f is generated by continuously iterating the values of the loss function to reducem+1(x)。
As shown in fig. 2, in a specific embodiment, historical health index data of a patient is obtained through data collection, which may specifically include systemic disease information, family genetic history information, unnecessary dependency information, age of the patient, obesity degree of the patient, and the like of the patient, the health index data of the cancer patient is subjected to data sorting to obtain a training sequence data set for training, the training sequence data set is subjected to data preprocessing, which specifically includes completing missing values of a small part of missing training sequence data in the training sequence data set, removing non-acquired training sequence data, encoding text data in the training sequence data set by One-Hot, performing data dimensionality reduction on the full sequence data set subjected to data preprocessing through a PCA algorithm to extract training feature components in the training sequence data set, and learning and training the training characteristic components through a GBDT algorithm to generate a trained cancer prediction model.
Embodiments of the invention predict a cancer-associated index in a patient by combining PCA dimension reduction with GBDT methods. In the prediction model, hundreds of parameter indexes related to cancer on a patient are integrated, a large-scale data set consisting of hundreds of data items is subjected to data preprocessing by using a PCA algorithm, and the processed data result is operated by combining a GBDT algorithm, so that the required result of the related indexes for cancer prediction is finally obtained.
The prediction model in the above embodiment of the present invention not only can analyze and count the characteristics of the data set, but also can obtain a prediction result with an accuracy of 84% by performing a fast prediction analysis on a new patient (new sample data) by combining with the learning adjustment of processing a large number of patient data sets, and is more effective and accurate than the conventional method.
The prediction model in the embodiment of the invention is also transversely compared with the prediction results of other similar algorithm models, and the result shows that the comprehensive index of the prediction result of the model is the best, namely the model is more effective.
As shown in fig. 3, the present invention also provides a device 30 for predicting a recurrence rate of cancer, the device 30 comprising:
an obtaining module 31, configured to obtain health index data of a patient;
the processing module 32 is configured to perform feature extraction processing on the health index data to obtain feature component data of the health index data; and inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate.
Optionally, the cancer prediction model is trained by:
acquiring a training sequence data set for training;
performing feature extraction on the training sequence data set to obtain a training feature component of the training sequence data set;
and inputting the training characteristic components into a preset prediction model for processing to obtain a trained cancer prediction model.
Optionally, performing feature extraction on the training sequence data set to obtain a training feature component of the training sequence data set, including:
carrying out data preprocessing on the training sequence data set to obtain a data preprocessing result;
and performing dimensionality reduction on the data preprocessing result by using a preset feature extraction algorithm to obtain a training feature component of the sequence data set.
Optionally, performing data preprocessing on the training sequence data set to obtain a data preprocessing result, including:
cleaning the training sequence data set to obtain first data and second data; the first data comprise non-collected training sequence data and/or training sequence data with deletion number larger than or equal to a preset threshold, and the second data comprise training sequence data with deletion number smaller than a preset threshold;
performing data clearing processing on the first data to obtain primary processing data;
performing data completion processing on the second data to obtain secondary processing data;
and performing character coding processing on character data in the secondary processing data to obtain a data preprocessing result.
Optionally, performing dimension reduction processing on the data preprocessing result by using a preset feature extraction algorithm to obtain a training feature component of the sequence dataset, including:
extracting at least one target linear combination from the data preprocessing result by utilizing a Principal Component Analysis (PCA) algorithm;
and performing dimensionality reduction processing on the target linear combination to obtain a training characteristic component of the sequence data set.
Optionally, the training feature component is input into a preset prediction model for processing, so as to obtain a trained cancer prediction model, including:
inputting the training characteristic components into a preset gradient descent tree GBDT prediction model for processing to obtain an iterative gradient direction of each training characteristic component;
fitting the training characteristic components and the iterative gradient direction of each training characteristic component to obtain fitting leaf node information; the fitting leaf node information comprises the number of fitting leaf nodes and area data corresponding to each leaf node;
obtaining a target output value of a fitting leaf node according to the fitting leaf node information;
and carrying out model iteration on the target output value to obtain a trained cancer prediction model.
Optionally, the trained cancer prediction model is:
Figure BDA0003255724790000121
wherein, I (x ∈ Rm+1J) is a function of 0/1; f. ofm+1(x) Fitting function of decision Tree for round m +1, cm+1Is the target output value of the m +1 th round. The present invention is also directed to a method for predicting a cancer recurrence rate, and all the implementations in the above method embodiments are applicable to the embodiment of the present invention, and the same technical effects can be achieved.
The present invention also provides a computing device comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the method for predicting a recurrence rate of cancer as described above.
Embodiments of the present invention also provide a computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to perform the method as described above.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
Furthermore, it is to be noted that in the device and method of the invention, it is obvious that the individual components or steps can be decomposed and/or recombined. These decompositions and/or recombinations are to be regarded as equivalents of the present invention. Also, the steps of performing the series of processes described above may naturally be performed chronologically in the order described, but need not necessarily be performed chronologically, and some steps may be performed in parallel or independently of each other. It will be understood by those skilled in the art that all or any of the steps or elements of the method and apparatus of the present invention may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or any combination thereof, which can be implemented by those skilled in the art using their basic programming skills after reading the description of the present invention.
Thus, the objects of the invention may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. The object of the invention is thus also achieved solely by providing a program product comprising program code for implementing the method or the apparatus. That is, such a program product also constitutes the present invention, and a storage medium storing such a program product also constitutes the present invention. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future. It is further noted that in the apparatus and method of the present invention, it is apparent that each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be regarded as equivalents of the present invention. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A method for predicting the recurrence rate of cancer, comprising:
acquiring health index data of a patient;
carrying out feature extraction processing on the health index data to obtain feature component data of the health index data;
and inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate.
2. The method of predicting cancer recurrence rate of claim 1, wherein said cancer prediction model is trained by the following process:
acquiring a training sequence data set for training;
performing feature extraction on the training sequence data set to obtain a training feature component of the training sequence data set;
and inputting the training characteristic components into a preset prediction model for processing to obtain a trained cancer prediction model.
3. The method of claim 2, wherein the feature extraction of the training sequence data set to obtain the training feature component of the training sequence data set comprises:
carrying out data preprocessing on the training sequence data set to obtain a data preprocessing result;
and performing dimensionality reduction on the data preprocessing result by using a preset feature extraction algorithm to obtain a training feature component of the sequence data set.
4. The method of claim 3, wherein the pre-processing the training sequence data set to obtain a pre-processing result comprises:
cleaning the training sequence data set to obtain first data and second data; the first data comprise non-collected training sequence data and/or training sequence data with deletion number larger than or equal to a preset threshold, and the second data comprise training sequence data with deletion number smaller than a preset threshold;
performing data clearing processing on the first data to obtain primary processing data;
performing data completion processing on the second data to obtain secondary processing data;
and performing character coding processing on character data in the secondary processing data to obtain a data preprocessing result.
5. The method of claim 3, wherein the performing dimension reduction on the data preprocessing result by using a predetermined feature extraction algorithm to obtain the training feature component of the sequence data set comprises:
extracting at least one target linear combination from the data preprocessing result by utilizing a Principal Component Analysis (PCA) algorithm;
and performing dimensionality reduction processing on the target linear combination to obtain a training characteristic component of the sequence data set.
6. The method of claim 2, wherein the step of inputting the training feature components into a predetermined prediction model for processing to obtain a trained cancer prediction model comprises:
inputting the training characteristic components into a preset gradient descent tree GBDT prediction model for processing to obtain an iterative gradient direction of each training characteristic component;
fitting the training characteristic components and the iterative gradient direction of each training characteristic component to obtain fitting leaf node information; the fitting leaf node information comprises the number of fitting leaf nodes and area data corresponding to each leaf node;
obtaining a target output value of a fitting leaf node according to the fitting leaf node information;
and carrying out model iteration on the target output value to obtain a trained cancer prediction model.
7. The method of claim 6, wherein the trained cancer prediction model is:
Figure FDA0003255724780000021
wherein, I (x ∈ Rm+1J) is a function of 0/1; f. ofm+1(x) Fitting function of decision Tree for round m +1, cm+1Is the target output value of the m +1 th round.
8. An apparatus for predicting a recurrence rate of cancer, the apparatus comprising:
the acquisition module is used for acquiring health index data of a patient;
the processing module is used for carrying out feature extraction processing on the health index data to obtain feature component data of the health index data; and inputting the characteristic component data of the health index data into a trained cancer prediction model for prediction processing to obtain a prediction result of the cancer recurrence rate.
9. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction which causes the processor to execute the operation corresponding to the prediction method of cancer recurrence rate as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7.
CN202111059325.3A 2021-09-10 2021-09-10 Method, device and equipment for predicting cancer recurrence rate Pending CN113724875A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111059325.3A CN113724875A (en) 2021-09-10 2021-09-10 Method, device and equipment for predicting cancer recurrence rate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111059325.3A CN113724875A (en) 2021-09-10 2021-09-10 Method, device and equipment for predicting cancer recurrence rate

Publications (1)

Publication Number Publication Date
CN113724875A true CN113724875A (en) 2021-11-30

Family

ID=78683086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111059325.3A Pending CN113724875A (en) 2021-09-10 2021-09-10 Method, device and equipment for predicting cancer recurrence rate

Country Status (1)

Country Link
CN (1) CN113724875A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818824A (en) * 2017-04-10 2018-03-20 平安科技(深圳)有限公司 A kind of health model construction method and terminal for health evaluating
CN110660481A (en) * 2019-09-27 2020-01-07 颐保医疗科技(上海)有限公司 Artificial intelligence technology-based primary liver cancer recurrence prediction method
CN110956303A (en) * 2019-10-12 2020-04-03 未鲲(上海)科技服务有限公司 Information prediction method, device, terminal and readable storage medium
CN111739642A (en) * 2020-06-23 2020-10-02 杭州和壹医学检验所有限公司 Colorectal cancer risk prediction method and system, computer equipment and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818824A (en) * 2017-04-10 2018-03-20 平安科技(深圳)有限公司 A kind of health model construction method and terminal for health evaluating
CN110660481A (en) * 2019-09-27 2020-01-07 颐保医疗科技(上海)有限公司 Artificial intelligence technology-based primary liver cancer recurrence prediction method
CN110956303A (en) * 2019-10-12 2020-04-03 未鲲(上海)科技服务有限公司 Information prediction method, device, terminal and readable storage medium
CN111739642A (en) * 2020-06-23 2020-10-02 杭州和壹医学检验所有限公司 Colorectal cancer risk prediction method and system, computer equipment and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘军等: "《基因芯片制备及数据分析技术》", 30 April 2015, 西安电子科技大学出版社 *
刘建平PINARD: "梯度提升树(GBDT)原理小结", 《HTTP://WWW.CNBLOGS.COM/PINARD/P/6140514.HTML》 *

Similar Documents

Publication Publication Date Title
Iftikhar et al. An evolution based hybrid approach for heart diseases classification and associated risk factors identification
Alfaro et al. Adabag: An R package for classification with boosting and bagging
Zhou et al. Subspace segmentation-based robust multiple kernel clustering
CN108647226B (en) Hybrid recommendation method based on variational automatic encoder
Gal et al. Latent Gaussian processes for distribution estimation of multivariate categorical data
Jemai et al. FBWN: An architecture of fast beta wavelet networks for image classification
CN111785329A (en) Single-cell RNA sequencing clustering method based on confrontation automatic encoder
CN114067915A (en) scRNA-seq data dimension reduction method based on deep antithetical variational self-encoder
Nguyen et al. A Bayesian nonparametric approach for multi-label classification
CN109409434B (en) Liver disease data classification rule extraction method based on random forest
CN114781441A (en) EEG motor imagery classification method and multi-space convolution neural network model
CN113204640B (en) Text classification method based on attention mechanism
Nahiduzzaman et al. Detection of various lung diseases including COVID-19 using extreme learning machine algorithm based on the features extracted from a lightweight CNN architecture
Qiu et al. Comparative study on the classification methods for breast cancer diagnosis
CN114220164A (en) Gesture recognition method based on variational modal decomposition and support vector machine
CN117520914A (en) Single cell classification method, system, equipment and computer readable storage medium
CN113724875A (en) Method, device and equipment for predicting cancer recurrence rate
Liu et al. Identification of rice disease under complex background based on PSOC-DRCNet
CN115423076A (en) Directed hypergraph chain prediction method based on two-step framework
Prezja et al. Adaptive variance thresholding: A novel approach to improve existing deep transfer vision models and advance automatic knee-joint osteoarthritis classification
Antoniades et al. Speeding up feature selection: A deep-inspired network pruning algorithm
Neto et al. Nasirt: Automl based learning with instance-level complexity information
Heinrich et al. Hierarchical Neural Simulation-Based Inference Over Event Ensembles
Al-Shamery et al. A New Approach of Rough Set Theory for‎ Feature Selection and Bayes Net Classifier‎ Applied on Heart Disease Dataset
Trelin et al. Binary stochastic filtering: A method for neural network size minimization and supervised feature selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211130

RJ01 Rejection of invention patent application after publication