CN111383766A

CN111383766A - Computer data processing method, device, medium and electronic equipment

Info

Publication number: CN111383766A
Application number: CN201811622785.0A
Authority: CN
Inventors: 孙颖; 姚季金; 刘水清; 郎超; 梁玮
Original assignee: Yidu Cloud Beijing Technology Co Ltd; Sun Yat Sen University Cancer Center
Current assignee: Yidu Cloud Beijing Technology Co Ltd; Sun Yat Sen University Cancer Center
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2020-07-07

Abstract

The embodiment of the invention provides a computer data processing method, a computer data processing device, a computer data processing medium and electronic equipment, and belongs to the technical field of electric data processing. The computer data processing method comprises the following steps: acquiring a first preset number of current disease characteristics of a target object related to a target disease; selecting a second preset number of target current disease features from the first preset number of current disease features; and obtaining the current end point characteristic of the target object according to the current disease characteristic of the target and by utilizing a trained survival prediction model based on machine learning. The technical scheme of the embodiment of the invention can train a survival prediction model based on machine learning based on the historical data of the target disease to realize the prediction of the end point characteristic of the target disease of the target object.

Description

Computer data processing method, device, medium and electronic equipment

Technical Field

The invention relates to the technical field of electrical data processing, in particular to a computer data processing method, a computer data processing device, a computer data processing medium and electronic equipment.

Background

According to the world health organization, 40% of nasopharyngeal carcinoma occurs in China worldwide, with the most common being in the Guangdong, and is therefore also called "Guangdong tumor". Because the nasopharyngeal carcinoma is hidden in the diseased part, more than 70 percent of patients who visit the hospital for the first time are accompanied by peripheral skull invasion and cervical lymph node metastasis, thereby causing the poor prognosis of the nasopharyngeal carcinoma patients. How to improve the prognosis evaluation effect of nasopharyngeal carcinoma patients has been the research focus of scholars all over the world.

Therefore, a new computer data processing method, apparatus, computer readable medium and electronic device are needed.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

Embodiments of the present invention provide a computer data processing method, apparatus, medium, and electronic device, so as to overcome, at least to some extent, the problem in the related art that the prognosis effect of a related disease cannot be accurately predicted.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

According to an aspect of the present disclosure, there is provided a computer data processing method including: acquiring a first preset number of current disease characteristics of a target object related to a target disease; selecting a second preset number of target current disease features from the first preset number of current disease features; and obtaining the current end point characteristic of the target object according to the current disease characteristic of the target and by utilizing a trained survival prediction model based on machine learning.

In an exemplary embodiment of the present disclosure, the method further comprises: acquiring historical medical record data meeting preset conditions as a sample; randomly distributing the samples into a training set and a testing set; training the survival prediction model by using the training set; validating the survival prediction model using the test set.

In one exemplary embodiment of the present disclosure, the target disease is nasopharyngeal carcinoma; wherein training the survival prediction model using the training set comprises: extracting the first preset number of historical disease features and historical endpoint features associated with nasopharyngeal carcinoma from the training set; selecting the second preset number of target historical disease features from the first preset number of historical disease features; and training the survival prediction model according to the second preset number of target historical disease features and the historical end point features thereof.

In an exemplary embodiment of the present disclosure, selecting the second preset number of target historical disease features from the first preset number of historical disease features comprises: judging whether the first preset number of historical disease features have significant influence on the historical endpoint features through a hypothesis testing process by utilizing multi-factor analysis of variance; selecting the second preset number of target historical disease features that have a significant impact on the historical endpoint features from the first preset number of historical disease features.

In an exemplary embodiment of the present disclosure, the first predetermined number of historical disease characteristics includes nasopharyngeal carcinoma patients gender, age, WHO pathology type at the time of initial diagnosis, smoking history, drinking history, family history, T-stage, N-stage, hemoglobin, and albumin values.

In an exemplary embodiment of the present disclosure, the preset conditions are historical case history data with nasopharyngeal carcinoma as a diagnosis of discharge, no metastasis at the initial diagnosis, and follow-up data.

In an exemplary embodiment of the present disclosure, the follow-up data includes time to death, time to metastasis, site of metastasis, time to relapse, and site of relapse; the historical endpoint characteristics include overall survival, metastasis-free survival, and relapse-free survival; extracting historical endpoint features associated with nasopharyngeal carcinoma from the training set, comprising: and obtaining the historical endpoint characteristics according to the follow-up data.

According to an aspect of the present disclosure, there is provided a computer data processing apparatus comprising: the current disease characteristic acquisition module is configured to acquire a first preset number of current disease characteristics of the target object related to the target disease; a target disease feature acquisition module configured to select a second preset number of target current disease features from the first preset number of current disease features; and the endpoint characteristic prediction module is configured to obtain the current endpoint characteristic of the target object according to the current disease characteristic of the target and by utilizing a trained survival prediction model based on machine learning.

According to an aspect of the present disclosure, there is provided a computer device comprising a memory and a computer program stored on the memory and executable on a processor, the processor implementing the method according to any of the embodiments described above when executing the program.

According to an aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the method according to any of the embodiments described above.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

in the technical solutions provided in some embodiments of the present invention, on one hand, a trained machine learning-based survival prediction model is used to automatically learn a complex relationship between a target disease characteristic and an endpoint characteristic of a target disease, so that the trained survival prediction model can be used to improve the accuracy of prognosis effect evaluation of the target disease; on the other hand, the second preset number of target current disease features can be selected from the first preset number of current disease features, so that the calculation amount in the model prediction process can be reduced, the complexity of the model is reduced, and the calculation speed is increased.

In other embodiments of the present invention, the method can be specifically applied to the prediction of the end point characteristics of the nasopharyngeal cancer patient, a nasopharyngeal cancer survival prediction model is constructed, the real world history medical record data is processed, the nasopharyngeal cancer survival prediction model is cleaned and processed, then the machine learning method is used for training the nasopharyngeal cancer survival prediction model, and the complex relationship between the nasopharyngeal cancer patient disease characteristics and the end point characteristics is automatically learned, so that for the patient needing the nasopharyngeal cancer survival prediction, the model can provide the evaluation of the nasopharyngeal cancer survival rate of the nasopharyngeal cancer patient in different years, for example, the model can automatically provide the nasopharyngeal cancer survival related to the total survival, the metastasis-free survival and the recurrence-free survival prediction of the nasopharyngeal cancer with different disease characteristics, namely, the nasopharyngeal cancer prognosis end point can be predicted based on various nasopharyngeal cancer related factors of the nasopharyngeal cancer patient, therefore, related ideas can be provided for diagnosis and treatment of doctors, opinions can be provided for individualized diagnosis and treatment of nasopharyngeal carcinoma, recommendation of different treatment schemes can be carried out according to different survival rates, diagnosis and treatment levels and work efficiency of doctors are improved, and meanwhile, treatment effects of nasopharyngeal carcinoma patients can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 schematically shows a flow diagram of a computer data processing method according to an embodiment of the invention;

FIG. 2 schematically shows a flow diagram of a computer data processing method according to another embodiment of the invention;

FIG. 3 schematically shows a flowchart of one embodiment of step S230 in FIG. 2;

FIG. 4 schematically shows a flowchart of one embodiment of step S232 in FIG. 3;

FIG. 5 is a schematic illustration of a visual alignment chart showing the prognosis and disease characterization of a nasopharyngeal carcinoma patient;

FIG. 6 is a schematic diagram showing the use of the nasopharyngeal carcinoma survival prediction model;

FIG. 7 schematically shows a block diagram of a computer data processing apparatus according to an embodiment of the present invention;

FIG. 8 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Fig. 1 schematically shows a flow chart of a computer data processing method according to an embodiment of the invention. The execution subject of the computer data processing method may be a device having a calculation processing function, such as a server and/or a mobile terminal.

As shown in fig. 1, a computer data processing method provided by an embodiment of the present invention may include the following steps.

In step S110, a first preset number of current disease features of the target object related to the target disease are obtained.

In the embodiments of the present invention, the target disease may be nasopharyngeal carcinoma (NPC), which is exemplified in the following description, but the present invention is not limited thereto, and when the target disease is different, the corresponding disease characteristics may be changed accordingly.

In an embodiment of the present invention, if the target disease is nasopharyngeal cancer, the target subject may be a nasopharyngeal cancer patient, the first predetermined number may be 10, and the current disease characteristic may be sex (Gender), Age (Age), WHO Pathology type (World Health Organization Pathology type), smoking History (cook History), drinking History (Alcohol History), family History (family History), T stage (T stage), N stage (N stage), Hemoglobin (Hemoglobin), Albumin (Albumin), or the like, or data of the nasopharyngeal cancer patient at the time of initial diagnosis.

It should be noted that the current disease characteristics are not limited to the above examples, and may include, for example, the time of admission to the patient for the first diagnosis of nasopharyngeal carcinoma, whether the patient has heart disease, whether the patient has hypertension, whether the patient has coronary heart disease, whether the patient has diabetes, and M stages.

Wherein, T (Tumor) in the stage of Tumor Node Metastasis refers to the condition of primary Tumor focus, and is sequentially represented by T1-T4 along with the increase of Tumor volume and the increase of affected range of adjacent tissues; n (node) refers to regional lymph node (regional lymph node) involvement. When the lymph node is not affected, it is denoted by N0. Along with the increase of the degree and range of the lymph node, sequentially representing N1-N4; m (Metastasis) refers to distant metastases (usually blood-tract metastases), with those without distant metastases denoted M0 and those with distant metastases denoted M1. On this basis, a combination of three indexes of TNM is used to mark out a specific stage.

In step S120, a second preset number of target current disease features are selected from the first preset number of current disease features.

The specific selection method may be determined according to the method of the model training process described below.

In step S130, a current endpoint feature of the target object is obtained according to the target current disease feature and by using the trained machine learning-based survival prediction model.

For example, the current endpoint characteristic may include any one or more of OS (Overall survival), DMFS (distance metastasis free survival), RRFS (Region free survival), LRFS (Local recurrence free survival), DFS (distance free survival/no progression) and the like.

According to the computer data processing method provided by the embodiment of the invention, on one hand, the complex relation between the target disease characteristic and the end point characteristic of the target disease is automatically learned through the trained survival prediction model based on machine learning, so that the trained survival prediction model can be utilized to improve the accuracy of prognosis effect evaluation of the target disease; on the other hand, the second preset number of target current disease features can be selected from the first preset number of current disease features, so that the calculation amount in the model prediction process can be reduced, the complexity of the model is reduced, and the calculation speed is increased.

Fig. 2 schematically shows a flow chart of a computer data processing method according to another embodiment of the invention. The execution subject of the computer data processing method may be a device having a calculation processing function, such as a server and/or a mobile terminal.

As shown in fig. 2, the computer data processing method according to the embodiment of the present invention may further include the following steps.

In step S210, historical medical record data satisfying a preset condition is acquired as a sample.

In the embodiment of the invention, the preset condition can be historical case history data with nasopharyngeal carcinoma diagnosis in discharge, no metastasis in initial diagnosis and follow-up data.

Specifically, the historical medical record data is cleaned, structured and normalized. Extracting historical medical record data meeting the following conditions from an electronic medical record system of a hospital: nasopharyngeal carcinoma is diagnosed as discharge; primary diagnosis without metastasis; there are follow-up data.

In an embodiment of the present invention, the follow-up data may specifically include information of death time, metastasis site, recurrence time, recurrence site, and the like of the nasopharyngeal carcinoma patient. According to the follow-up data, historical endpoint characteristics or historical endpoint indexes such as OS, DMFS, LRFS, RRFS, DFS and the like of the nasopharyngeal carcinoma patients can be counted.

In step S220, the samples are randomly assigned as a training set and a test set.

In step S230, the survival prediction model is trained using the training set.

In step S240, the survival prediction model is validated using the test set.

In the embodiment of the invention, before the model training, the model training set and the test set are divided.

For example, the sample may be divided into two separate parts, including a training set (train set) and a test set (testset). Where the training set is used to estimate the model and the test set is used to verify how well the model will ultimately be selected to be optimal. The division ratio may be 80% of the total sample for the training set and 20% of the total sample for the test set, both of which may be randomly drawn from the samples.

In the embodiment of the invention, the effect evaluation is carried out on the test set, and the model with the best effect is selected as the survival prediction model. For example, the accuracy, recall rate, and AUC (Area Under ROC Curve) can be evaluated by using an evaluation model, and the accuracy in this test is 92%, the recall rate is 88%, and the AUC is 0.86. Among them, the ROC curve (receiver operating characteristic curve) is also called sensitivity curve (sensitivity curve).

Fig. 3 schematically shows a flowchart of one embodiment of step S230 in fig. 2.

In step S231, the first preset number of historical disease features and historical endpoint features associated with nasopharyngeal carcinoma are extracted from the training set.

In an embodiment of the present invention, the follow-up data may include death time, metastasis site, recurrence time, and recurrence site; the historical endpoint characteristics may include overall survival, metastasis-free survival, and relapse-free survival.

In an embodiment of the present invention, extracting the historical endpoint characteristics related to nasopharyngeal carcinoma from the training set may include: and obtaining the historical endpoint characteristics according to the follow-up data.

Specifically, after the historical medical record data meeting the preset condition is acquired, data cleaning, structuring and normalization can be further performed on the historical medical record data from population informatics, past history and diagnosis, and data set indexes including but not limited to the following are extracted from the historical medical record data: sex, age of first diagnosis, time of admission, whether heart disease is present, whether hypertension is present, whether coronary heart disease is present, whether diabetes is present, T stage, N stage, M stage, hemoglobin, albumin and other indexes.

In the embodiment of the invention, in the process of establishing the data set of the survival prediction model, the data source can be cleaned, structured and normalized by using a natural language technology, including entity identification, regularization and the like, and the obtained data set can comprise a primary field, a secondary field, an index name, an index English set field definition, an output format, a unit set data type set processing method and the like.

Among them, in natural language processing, text cleaning is greatly affected by tasks to be done, such as whether it is necessary to replace units, money, numeric symbols, numbers. The corresponding content may be replaced with standard content using a regularization tool.

In particular, the primary field may be a primary data source form, such as a demographic, a survey form. The secondary field may be a sub-directory, which examines the upper gastrointestinal angiogram inside the form, and the index may be the following set of indices: gender, age, WHO pathological type, smoking history, drinking history, family history, T stage, N stage, hemoglobin, albumin. The index name may include an index definition. The index english set field definition may refer to an english translation of the index. The output format may be character or numeric. Unit sets, e.g., some values with units, have unit values. The processing method of the unit set data type set can be structured and logical judgment processing.

In step S232, the second preset number of target historical disease features is selected from the first preset number of historical disease features.

In an embodiment of the present invention, the first predetermined number of historical disease characteristics may include sex, age, WHO type of pathology at the time of initial diagnosis, smoking history, drinking history, family history, T stage, N stage, hemoglobin and albumin values, etc. of the nasopharyngeal carcinoma patient.

In step S233, the survival prediction model is trained according to the second preset number of target historical disease features and their historical endpoint features.

For example, the second predetermined number may be 5, but the present invention is not limited thereto.

In the embodiment of the invention, the nasopharyngeal carcinoma real world data can be used for disease modeling so as to predict the survival of the nasopharyngeal carcinoma. Can use the electronic medical record data of the nasopharyngeal carcinoma patients to carry out the structurization and normalization aiming at the nasopharyngeal carcinoma indexes. Nasopharyngeal cancer indicators can include, for example, but are not limited to, indicators of gender, age, type of WHO pathology, smoking history, drinking history, family history, TNM staging, hemoglobin, albumin, etc., and endpoint characteristics or indicators can include, but are not limited to, overall survival, metastasis-free survival, relapse-free survival, etc. The indexes can be screened for the significant influence on the end point indexes through a multi-factor regression method, the optimal indexes are screened out, and models for different survival end points are constructed through the optimal indexes.

Fig. 4 schematically shows a flowchart of one embodiment of step S232 in fig. 3.

In step S2321, it is determined whether the first preset number of historical disease features have a significant impact on the historical endpoint features through a hypothesis testing process using multi-factor analysis of variance.

In step S2322, the second preset number of target historical disease features that have a significant impact on the historical endpoint features are selected from the first preset number of historical disease features.

In the embodiment of the invention, the multi-factor variance analysis is firstly carried out to screen out the probability/possibility (P) of occurrence of the terminal point indexes including survival, transfer-free survival and relapse-free survival, which is less than 0.01, and invalid hypothesis can be denied as an evaluation index. And training the survival prediction model by using an index of which the screening P is less than 0.01.

Here, by using 10 indexes originally related to nasopharyngeal carcinoma suspicion, and using multi-factor anova, also called "multidirectional anova", whether a plurality of factors have significant influence on dependent variables is judged through a hypothesis testing process (i.e. P < 0.01):

the dependent variable of the multi-factor formula is equal to factor 1 main effect + factor 2 main effect + … + factor 10 main effect + factor interaction effect 1+ factor interaction effect 2+ … factor interaction effect m + random error

Here, a multi-factor variance is taken as an example to perform parameter estimation by using a generalized linear model, and m is 10, and finally 5 indexes with significant influence are evaluated for modeling.

Specifically, the above-mentioned multi-factor formula dependent variable may be the above-mentioned end-point index, such as OS, and the factors 1 to 10 may be, for example, sex, age WHO pathological type, smoking history, drinking history, family history, TNM stage, hemoglobin, albumin index, respectively. The P value is calculated as follows:

generally, the statistical quantity of the test is represented by X, and when H0 is true, the value C of the statistical quantity can be calculated from the sample data, and the P value can be obtained from the concrete distribution of the test statistical quantity X. Specifically, the method comprises the following steps:

the P value of the left test is the probability that the test statistic X is less than the sample statistic C: p ═ P { X < C };

the P value for the right test is the probability that the test statistic X is greater than the sample statistic C: p ═ P { X > C };

the P value of the two-sided test is 2 times the probability that the test statistic X falls within the tail region where the sample statistic C is the endpoint: P-2P { X > C } (when C is located at the right end of the profile) or P-2P { X < C } (when C is located at the left end of the profile).

If X follows normal distribution and t distribution, the distribution curve is symmetric about the vertical axis, so the value of P can be expressed as P { | X | > C }.

After the P value is calculated, a test can be concluded by comparing a given significance level α with the P value:

if α > P value, the original hypothesis is rejected at a significance level of α.

If α ≦ P value, the original hypothesis is accepted at the level of significance α.

In practice, when α is equal to P, i.e. the value C of the statistic is just equal to the threshold value, the sample size is increased and the sampling test is performed again for the sake of caution.

The original hypothesis was established <0.01, and the original hypothesis was accepted at a significance level of α.

In the embodiment of the invention, 5 indexes are selected from 10 indexes and refer to the parameters for counting positive. 5 disease characteristics of the patient counted from the historical electronic medical record are used as input of the model, and the end point index of the patient obtained through statistics is used as a label value. Namely, the 10 indexes are subjected to statistical meta regression, and the indexes with statistical significance are extracted to establish a survival prediction model.

FIG. 5 is a schematic illustration of a visual alignment chart showing the prognosis and disease characterization of nasopharyngeal carcinoma patients.

And (4) selecting a radial basis kernel function by using the indexes screened out by the 5 items, and modeling according to parameters by combining a grid search and cross validation method. As shown in fig. 5, based on the multi-factor regression analysis, 5 prediction indexes are integrated, and then line segments with scales are drawn on the same plane according to a certain proportion, so as to express the correlation among the variables in the prediction model.

Fig. 6 schematically shows a schematic diagram of the use of the nasopharyngeal carcinoma survival prediction model.

As shown in fig. 6, a newly diagnosed nasopharyngeal carcinoma patient is identified on a big data, intelligent database platform, using OS as an example.

The computer data processing method provided by the embodiment of the invention can be particularly applied to the prediction of the end point characteristic of a nasopharyngeal carcinoma patient, a nasopharyngeal carcinoma survival prediction model is constructed, the history medical record data of the real world is processed, the washing processing is carried out, then the training of the nasopharyngeal carcinoma survival prediction model is carried out by utilizing a machine learning method, and the complex relation between the disease characteristic and the end point characteristic of the nasopharyngeal carcinoma patient is automatically learned, so that for the patient needing the nasopharyngeal carcinoma survival prediction, the model can give the evaluation of the survival rate of the nasopharyngeal carcinoma patient in different years, for example, the model can automatically give the prediction of the survival rate of the nasopharyngeal carcinoma with different disease characteristics about the total survival, the survival without metastasis and the recurrence, namely, the prediction of the nasopharyngeal carcinoma end point can be carried out based on various nasopharyngeal carcinoma related factors of the nasopharyngeal carcinoma patient, therefore, related ideas can be provided for diagnosis and treatment of doctors, opinions can be provided for individualized diagnosis and treatment of nasopharyngeal carcinoma, recommendation of different treatment schemes can be carried out according to different survival rates, diagnosis and treatment levels and work efficiency of doctors are improved, and meanwhile, treatment effects of nasopharyngeal carcinoma patients can be improved.

Embodiments of the apparatus of the present invention are described below, which can be used to perform the above-described computer data processing method of the present invention.

Fig. 7 schematically shows a block diagram of a computer data processing apparatus according to an embodiment of the present invention.

As shown in fig. 7, the computer data processing apparatus 700 according to the embodiment of the present invention may include a current disease characteristic obtaining module 710, a target disease characteristic obtaining module 720, and an endpoint characteristic predicting module 730.

The current disease characteristic obtaining module 710 may be configured to obtain a first preset number of current disease characteristics of the target object related to the target disease.

The target disease feature acquisition module 720 may be configured to select a second preset number of target current disease features from the first preset number of current disease features.

The endpoint feature prediction module 730 may be configured to obtain a current endpoint feature of the target object based on the target current disease feature and using a trained machine learning-based survival prediction model.

In an exemplary embodiment, the computer data processing apparatus 700 may further include: the sample acquisition module can be configured to acquire historical medical record data meeting preset conditions as a sample; a data set partitioning module that may be configured to randomly allocate the samples into a training set and a test set; a model training module configurable to train the survival prediction model using the training set; a model validation module may be configured to validate the survival prediction model using the test set.

In exemplary embodiments, the disease of interest is nasopharyngeal carcinoma; wherein the model training module may include: a feature extraction unit, which may be configured to extract the first preset number of historical disease features and historical endpoint features associated with nasopharyngeal carcinoma from the training set; a feature selection unit, which may be configured to select the second preset number of target historical disease features from the first preset number of historical disease features; a model training unit may be configured to train the survival prediction model according to the second preset number of target historical disease features and historical endpoint features thereof.

In an exemplary embodiment, the feature selection unit may include: a determining subunit, which may be configured to determine whether the first preset number of historical disease features have a significant impact on the historical endpoint features through a hypothesis testing process using multi-factor analysis of variance; a feature selection subunit, which may be configured to select the second preset number of target historical disease features that have a significant impact on the historical endpoint features from the first preset number of historical disease features.

In an exemplary embodiment, the first predetermined number of historical disease characteristics may include nasopharyngeal carcinoma patient gender, age, WHO type at initial diagnosis, smoking history, drinking history, family history, T-stage, N-stage, hemoglobin, and albumin values.

In an exemplary embodiment, the preset conditions may be historical case history data with discharge diagnosis of nasopharyngeal carcinoma, no metastasis from initial diagnosis, and follow-up data.

In an exemplary embodiment, the follow-up data may include time to death, time to metastasis, site of metastasis, time to relapse, and site of relapse; the historical endpoint characteristics may include overall survival, metastasis-free survival, and relapse-free survival. Wherein the feature extraction unit may include: an endpoint feature obtaining subunit may be configured to obtain the historical endpoint feature from the follow-up data.

For details that are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the above-described embodiments of the computer data processing method of the present invention for the functional modules of the computer data processing apparatus of the exemplary embodiments of the present invention correspond to the steps of the above-described exemplary embodiments of the computer data processing method.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing an electronic device of an embodiment of the present invention. The computer system 800 of the electronic device shown in fig. 8 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 807 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for system operation are also stored. The CPU801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted into the storage section 807 as necessary.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present application when executed by the Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules and/or units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described modules and/or units may also be disposed in a processor. Wherein the names of such modules and/or units do not in some way constitute a limitation on the modules and/or units themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the computer data processing method as described in the above embodiments.

For example, the electronic device may implement the following as shown in fig. 1: step S110, acquiring a first preset number of current disease characteristics of a target object related to a target disease; step S120, selecting a second preset number of target current disease characteristics from the first preset number of current disease characteristics; and step S130, obtaining the current endpoint characteristic of the target object according to the current disease characteristic of the target and by utilizing a trained survival prediction model based on machine learning.

As another example, the electronic device may implement the steps shown in fig. 2 to 4.

It should be noted that although in the above detailed description several modules and/or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more of the modules and/or units described above may be embodied in one module and/or unit according to embodiments of the invention. Conversely, the features and functions of one module and/or unit described above may be further divided into embodiments by a plurality of modules and/or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A computer data processing method, comprising:

acquiring a first preset number of current disease characteristics of a target object related to a target disease;

selecting a second preset number of target current disease features from the first preset number of current disease features;

and obtaining the current end point characteristic of the target object according to the current disease characteristic of the target and by utilizing a trained survival prediction model based on machine learning.

2. The method of claim 1, further comprising:

acquiring historical medical record data meeting preset conditions as a sample;

randomly distributing the samples into a training set and a testing set;

training the survival prediction model by using the training set;

validating the survival prediction model using the test set.

3. The method of claim 2, wherein the disease of interest is nasopharyngeal carcinoma; wherein training the survival prediction model using the training set comprises:

extracting the first preset number of historical disease features and historical endpoint features associated with nasopharyngeal carcinoma from the training set;

selecting the second preset number of target historical disease features from the first preset number of historical disease features;

and training the survival prediction model according to the second preset number of target historical disease features and the historical end point features thereof.

4. The method of claim 3, wherein selecting the second preset number of target historical disease features from the first preset number of historical disease features comprises:

judging whether the first preset number of historical disease features have significant influence on the historical endpoint features through a hypothesis testing process by utilizing multi-factor analysis of variance;

selecting the second preset number of target historical disease features that have a significant impact on the historical endpoint features from the first preset number of historical disease features.

5. The method of claim 3 or 4, wherein the first predetermined number of historical disease characteristics comprises nasopharyngeal carcinoma patient gender, age, WHO pathology type at initial diagnosis, smoking history, drinking history, family history, T-staging, N-staging, hemoglobin and albumin values.

6. The method of claim 3, wherein the predetermined conditions are discharge diagnosis of nasopharyngeal carcinoma, no metastasis from initial diagnosis, and historical case history data with follow-up data.

7. The method of claim 6, wherein the follow-up data comprises time to death, time to metastasis, site of metastasis, time to relapse, and site of relapse; the historical endpoint characteristics include overall survival, metastasis-free survival, and relapse-free survival;

extracting historical endpoint features associated with nasopharyngeal carcinoma from the training set, comprising:

and obtaining the historical endpoint characteristics according to the follow-up data.

8. A computer data processing apparatus, comprising:

the current disease characteristic acquisition module is configured to acquire a first preset number of current disease characteristics of the target object related to the target disease;

a target disease feature acquisition module configured to select a second preset number of target current disease features from the first preset number of current disease features;

and the endpoint characteristic prediction module is configured to obtain the current endpoint characteristic of the target object according to the current disease characteristic of the target and by utilizing a trained survival prediction model based on machine learning.

9. A computer device comprising a memory and a computer program stored on the memory and executable on a processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.