CN114021732B - Proportional risk regression model training method, device and system and storage medium - Google Patents

Proportional risk regression model training method, device and system and storage medium Download PDF

Info

Publication number
CN114021732B
CN114021732B CN202111156675.1A CN202111156675A CN114021732B CN 114021732 B CN114021732 B CN 114021732B CN 202111156675 A CN202111156675 A CN 202111156675A CN 114021732 B CN114021732 B CN 114021732B
Authority
CN
China
Prior art keywords
local
global
data provider
regression model
survival
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111156675.1A
Other languages
Chinese (zh)
Other versions
CN114021732A (en
Inventor
徐松
刘兵
包仁义
张凯
蒋锦鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yidu Cloud Beijing Technology Co Ltd
Original Assignee
Yidu Cloud Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yidu Cloud Beijing Technology Co Ltd filed Critical Yidu Cloud Beijing Technology Co Ltd
Priority to CN202111156675.1A priority Critical patent/CN114021732B/en
Publication of CN114021732A publication Critical patent/CN114021732A/en
Application granted granted Critical
Publication of CN114021732B publication Critical patent/CN114021732B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Primary Health Care (AREA)
  • Bioethics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Optimization (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Epidemiology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Pathology (AREA)

Abstract

The embodiment of the application discloses a proportional risk regression model training method, a device and a system and a computer readable storage medium, wherein the method comprises the following steps: determining a global survival time increasing sequence according to the maximum survival time and the minimum survival time provided by each data provider; then, each data provider takes the global survival time increasing sequence as an expanded time dimension to generate an intermediate survival analysis result corresponding to the survival time in the global survival time increasing sequence; and then, performing multi-party safety calculation according to the survival analysis intermediate result provided by each data provider to determine a global survival analysis intermediate result. Thus, each data provider need not expose its entire lifetime; and moreover, each data provider can calculate intermediate results on the same time dimension (global survival time increasing sequence) without multiple communications, so that communication resources are greatly saved.

Description

Proportional risk regression model training method, device and system and storage medium
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a proportional risk regression model training method, device and system and a computer readable storage medium.
Background
In the medical field, due to the fact that the privacy protection requirement on medical data is high, but medical data owned by each hospital is very limited, the medical data which can be used for machine learning and artificial intelligence model training are very limited and poor in quality, and therefore popularization and application of the artificial intelligence technology in the medical field are restricted.
The method for horizontal federal Learning (fed Learning) can set an initial model and initial model parameters through a server; each data provider downloads the model from the server, trains the model by using the private data, and then returns the parameters to be updated to the server; the server aggregates the model parameters returned by each data provider, updates the model, and issues the latest model to each data provider, and the iteration is carried out until the model with the expected precision is achieved, so that the problems can be solved.
However, when data analysis is performed on multi-party medical data by using the existing horizontal federal learning method, for example, survival analysis, frequent data exchange and multi-party safety calculation between a master controller and a plurality of data providers are often required, and particularly when network delay is high and the number of data providers is large, technical problems of long time consumption, slow response and serious calculation resource occupation are caused.
Disclosure of Invention
In order to solve the above problems, embodiments of the present application provide a method, an apparatus, a system, and a computer-readable storage medium for training a proportional risk regression model.
According to a first aspect of the embodiments of the present application, there is provided a proportional risk regression model training method applied to a master controller, the method including: sending the first proportional risk regression model to each data provider, wherein the proportional risk regression model comprises at least one explanatory variable; determining a global survival time increasing sequence according to the local maximum survival time and the local minimum survival time provided by each data provider; sending the global survival time increasing sequence to each data provider, so that each data provider performs survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and the sample value of the interpretation variable in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence; performing multi-party safety calculation according to the intermediate results of the local survival analysis provided by each data provider to obtain intermediate results of the global survival analysis; and updating the first proportional risk regression model according to the global survival analysis intermediate result to determine a second proportional risk regression model.
According to an embodiment of the present application, determining a global time-to-live increasing sequence according to a local maximum time-to-live and a local minimum time-to-live provided by each data provider includes: determining global maximum survival time and global minimum survival time according to the local maximum survival time and the local minimum survival time provided by each data provider; and determining a global survival time increasing sequence according to the survival time step, the global maximum survival time and the global minimum survival time.
According to an embodiment of the present application, updating parameters in the first proportional risk regression model according to the intermediate result of the global survival analysis of the terminal includes: calculating the gradient of the model loss function relative to each interpretation variable according to the global survival analysis intermediate result; and updating parameters in the first proportional risk regression model according to the gradient so as to enable the loss function value of the terminal model to continuously converge.
According to an embodiment of the present application, the survival analysis intermediate result is an encrypted value, and accordingly, the multi-party security calculation is performed according to the local survival analysis intermediate result provided by each data provider to obtain the global survival analysis intermediate result, including: decrypting the local survival analysis intermediate result provided by each data provider to obtain a decrypted local survival analysis intermediate result; and performing multi-party safety calculation according to the decrypted intermediate result of the local survival analysis to obtain an intermediate result of the global survival analysis.
According to an embodiment of the present application, the method further includes: performing multi-party safety calculation according to the local explanatory variable reference values provided by each data provider to obtain a global explanatory variable reference value; and sending the global interpretation variable reference value to each data provider so that each data provider can standardize the sample value of the interpretation variable in the local medical data.
According to an embodiment of the present application, the local explanatory variable reference value is an encrypted value, and accordingly, according to the local explanatory variable reference value provided by each data provider, the multiparty security calculation is performed to obtain the global explanatory variable reference value, including: decrypting the local explanatory variable reference value provided by each data provider to obtain a decrypted local explanatory variable reference value; and performing multi-party safety calculation according to the decrypted local explanatory variable reference value to obtain a global explanatory variable reference value.
According to an embodiment of the present application, the normalization is z-score normalization, and accordingly, the multi-party security calculation is performed according to the local explanatory variable reference values provided by each data provider to obtain the global explanatory variable reference value, including: calculating to obtain a global interpretation variable average value according to the sum of the number of the local samples provided by each data provider and the interpretation variable values of the local samples; sending the global interpretation variable average value to each school data provider so that each data provider can calculate the local interpretation variable sample variance; accordingly, sending the global interpretation variable reference value to each data provider for each data provider to normalize the sample value of the interpretation variable in the local medical data, including: performing multi-party safety calculation according to the local explanatory variable sample variance provided by each data provider to obtain a global explanatory variable sample variance and a global explanatory variable standard deviation; and returning the standard deviation of the global interpretation variable to each data provider so that each data provider can carry out z-score standardization on the sample value of the interpretation variable in the local medical data.
According to an embodiment of the present application, the normalizing is maximum and minimum normalization, and accordingly, the multi-party security calculation is performed according to the local explanatory variable reference values provided by each data provider to obtain the global explanatory variable reference value, including: calculating to obtain a global interpretation variable maximum value and a global interpretation variable minimum value according to the local interpretation variable maximum value and the local interpretation variable minimum value provided by each data provider; accordingly, sending the global interpretation variable reference value to each data provider for each data provider to normalize the sample values of the interpretation variable in the local medical data, comprising: and sending the maximum value and the minimum value of the global interpretation variable to each data provider so as to enable each data provider to carry out maximum and minimum standardization on the sample value of the interpretation variable in the local medical data.
According to a second aspect of the embodiments of the present application, there is provided a proportional risk regression model training method applied to a data provider, the method including: receiving a first proportional risk regression model sent by a master controller, wherein the proportional risk regression model comprises at least one explanation variable; providing local maximum survival time and local minimum survival time for a master control party; receiving a global survival time increasing sequence sent by a master controller; performing survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and the sample value of the interpretation variable in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence; and providing the local survival analysis intermediate result to the master.
According to an embodiment of the present application, before providing the local maximum lifetime and the local minimum lifetime to the master, the method further includes: encrypting the local maximum survival time and the local minimum survival time to obtain the encrypted local maximum survival time and the encrypted local minimum survival time; correspondingly, providing the local maximum survival time and the local minimum survival time for the master party comprises the following steps: and providing the encrypted local maximum survival time and the encrypted local minimum survival time for the master control party.
According to an embodiment of the present application, before providing the intermediate result of the local survival analysis to the master, the method further includes: encrypting the intermediate result of the local survival analysis to obtain an encrypted intermediate result of the local survival analysis; correspondingly, the method for providing the intermediate result of the local survival analysis to the master comprises the following steps: and providing the encrypted intermediate result of the local survival analysis to the master controller.
According to an embodiment of the present application, the method further includes: providing the reference values of the interpretation variables in the local medical data to the master controller so that the master controller can calculate the global interpretation variable reference values; and standardizing sample values of the interpretation variables in the local medical data according to the global interpretation variable reference value provided by the master control party.
According to an embodiment of the present application, the normalizing is z-score normalization, and accordingly, the normalizing the sample value of the interpretation variable in the local medical data according to the global interpretation variable reference value provided by the master comprises: calculating the variance of a local explanatory variable sample according to the global explanatory variable average value sent by the master controller; providing a local interpretation variable sample variance to the master; and performing z-score standardization on sample values of the interpretation variables in the local medical data according to the global interpretation variable standard deviation sent by the master controller.
According to an embodiment of the present application, the normalizing is maximum and minimum normalization, and accordingly, the normalizing the sample value of the interpretation variable in the local medical data according to the global interpretation variable reference value provided by the master includes: and carrying out maximum and minimum standardization on sample values of the interpretation variables in the local medical data according to the maximum value and the minimum value of the global interpretation variables sent by the master controller.
According to a third aspect of the embodiments of the present application, there is further provided a proportional risk regression model training method, including: the master control party sends the first proportional risk regression model to each data provider, and the proportional risk regression model comprises at least one explanation variable; each data providing party provides the sum of the local maximum survival time, the local minimum survival time, the number of the ground samples and the interpretation variable value of the local samples to the master controller; the master controller determines a global survival time increasing sequence according to the local maximum survival time and the local minimum survival time provided by each data provider, and sends the global survival time increasing sequence to each data provider; each data provider performs survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and the sample value of the interpretation variable in the local medical data sent by the master controller to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence, and provides the local survival analysis intermediate result for the master controller; and the master control party performs multi-party safety calculation according to the intermediate local survival analysis result provided by each data provider to obtain an intermediate global survival analysis result, and updates parameters in the first proportional risk regression model according to the intermediate global survival analysis result to determine a second proportional risk regression model.
According to a fourth aspect of the embodiments of the present application, there is provided a training apparatus for a proportional risk regression model, which is applied to a master control amplifier, the training apparatus including: the model distribution module is used for sending the first case risk regression model to each data provider, and the model parameters of the proportional risk regression model comprise at least one explanation variable; the global survival time increasing sequence determining module is used for determining a global survival time increasing sequence according to the local maximum survival time and the local minimum survival time provided by each data provider; the global survival time increasing sequence issuing module is used for sending the global survival time increasing sequence to each data provider so that each data provider can perform survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and the sample value of the interpretation variable in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence; the global survival analysis intermediate result calculation module is used for performing multi-party safety calculation according to the local survival analysis intermediate result provided by each data provider to obtain a global survival analysis intermediate result; and the model updating module is used for updating the first proportional risk regression model according to the global survival analysis intermediate result so as to determine a second proportional risk regression model.
According to a fifth aspect of the embodiments of the present application, there is provided a training apparatus for a proportional risk regression model, which is applied to a data provider, and includes: the model receiving module is used for receiving a first proportional risk regression model sent by a master controller, and the proportional risk regression model comprises at least one explanation variable; the local survival time providing module is used for providing local maximum survival time and local minimum survival time for the main control party; the global survival time data receiving module is used for receiving a global survival time increasing sequence sent by the master controller; the local survival analysis module is used for carrying out survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and the sample value of the interpretation variable in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence; and the survival analysis intermediate result providing module is used for providing the local survival analysis intermediate result for the main control party.
According to a sixth aspect of embodiments herein, there is provided a computer-readable storage medium comprising a set of computer-executable instructions, which when executed, perform any one of the above-mentioned proportional-risk regression model training methods.
The embodiment of the application provides a proportional risk regression model training method, which is characterized in that a federal learning method is utilized, a distributed proportional risk regression model is trained based on multi-party medical data, and specifically, in the training process: firstly, determining a global survival time increasing sequence according to the maximum survival time and the minimum survival time provided by each data provider; then, each data provider takes the global survival time increasing sequence as an expanded time dimension to generate a survival analysis intermediate result corresponding to the survival time in the global survival time increasing sequence; and then, performing multi-party safety calculation according to the survival analysis intermediate result provided by each data provider to determine a global survival analysis intermediate result.
In the process of federal learning, each data provider only needs to provide maximum survival time and minimum survival time without exposing all the survival time of the data provider, so that the privacy of private data of each data provider can be further protected; and moreover, the global survival time increasing sequence is used as the expanded time dimension, so that each data provider can calculate the intermediate result in the same time dimension without carrying out multiple communications, the communication bandwidth is greatly saved, the calculation times are reduced, and the response time is further shortened.
It is to be understood that the teachings of the embodiments of the present application do not necessarily require that all of the above advantages be achieved, but that certain technical solutions may achieve certain technical effects, and that other implementations of the embodiments of the present application may also achieve other advantages not mentioned above.
Drawings
The foregoing and other objects, features and advantages of exemplary embodiments of the present application will be readily understood by reading the following detailed description with reference to the accompanying drawings. Several embodiments of the present application are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
FIG. 1 is a schematic diagram illustrating a system structure and an application scenario of a proportional risk regression model training method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a flow of implementing a proportional risk regression model training method on a master control side according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating an implementation of the proportional risk regression model training method in the data provider according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating an interaction flow between a master and a data provider in a proportional risk regression model training method according to another embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a structure of a proportional-risk regression model training device in a primary controller according to an embodiment of the present application;
fig. 6 is a schematic diagram illustrating a composition structure of a proportional risk regression model training device at a data provider according to an embodiment of the present application.
Detailed Description
The principles and spirit of embodiments of the present application will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and to implement the embodiments of the present application, and are not intended to limit the scope of the embodiments of the present application in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the embodiments to those skilled in the art.
In the description of the following examples:
federal learning, which means that a joint learning model based on distributed data is established, in the model training process, relevant information of the model can be exchanged among all parties, raw data does not directly participate in the exchange, and the process can effectively protect the privacy of users or data.
Horizontal federated learning, also known as sample-partitioned federated learning, involves the same feature space and different sample spaces in the participant's data set.
Longitudinal federated learning, also known as feature-based federated learning, involves the same sample space, different feature spaces for each participant.
Survival analysis, which means to make statistical inference on one or more non-negative random variables and study survival phenomena and response time and statistical rules thereof; the method is a statistical method considering both the result and the survival time, and can fully utilize incomplete information provided by the truncated data to analyze main factors influencing the survival time.
The initiation event refers to an event that reflects the time-to-live initiation feature.
Failure events, which refer to the death observed in a part of the subjects during the follow-up of the survival analysis, can lead to accurate survival time, and are called failure events, also called death events and end-point events.
The survival time refers to the time from the beginning of a certain start event to the occurrence of an end event.
The explanatory variables, also called independent variables and explanatory variables, act on dependent variables in the model according to a certain rule.
The result variable refers to the amount of variation directly caused by the variation of the explanatory variable.
The semi-parametric model comprises a parametric part and a non-parametric part at the same time.
The proportional risk regression model is a semi-parametric regression model, can be used to describe the influence of a plurality of characteristics which do not change with time on the mortality at a certain moment, and is commonly used in survival analysis.
z-score normalized, data normalized to a standard normal distribution, i.e., mean 0 and standard deviation 1.
And (3) carrying out maximum and minimum standardization by using the maximum value and the minimum value in the data column, wherein the standardized numerical value is between 0 and 1, and the calculation mode is that the data and the minimum value in the data column are subjected to difference and then divided by the extreme difference.
Gradient, which means that the directional derivative of a certain function at that point takes the maximum value along that direction, i.e. the function changes the fastest and the rate of change is the greatest along that direction at that point.
The hessian matrix is a square matrix formed by second-order partial derivatives of a multivariate function and describes the local curvature of the function.
The technical solutions of the embodiments of the present application are further described in detail below with reference to the accompanying drawings and specific embodiments.
Fig. 1 shows a distributed proportion risk regression model training system applying the proportion risk regression model training method according to the embodiment of the present application.
In the embodiment of the present application, the distributed proportional hazards regression model is a horizontal federal proportional hazards regression model that uses at least one explanatory variable, such as patient's own genes, medication, treatment modality, etc., as a model parameter. The inputs received are primarily patient data with values for interpretation variables, and the results output are whether an initiating event and a failure event occurred, and the patient's survival time.
The training system of the distributed proportional risk regression model is also distributed, and includes a master 101 and a plurality of data providers (e.g., data provider 102 and data provider 103, etc., and only two data providers are shown for economy, but in practice, the data providers generally refer to two or more data providers).
The master controller 101 is mainly responsible for coordinating data providers to perform federal learning, so as to implement distributed model training of the comparative example risk regression model 1011. Each data provider, for example, the data provider 102, does not directly send medical data to the master controller 101 or other data providers (for example, the data provider 103), but downloads the proportional risk regression model 1011 from the master controller 101 to obtain the local proportional risk regression model 1021, and inputs the local medical data 1022 into the local proportional risk regression model 1021 to obtain the local survival time; then, the data provider can count and calculate the local survival time to obtain the intermediate result of the local survival analysis, such as the number of dead individuals, the sum of the interpretation variables corresponding to the risk individuals, and the like; then, sending the intermediate result of the local survival analysis to the master control server 101, and then performing multi-party security calculation by the master control server 101 according to the intermediate result of the local survival analysis provided by each data provider to finally obtain an intermediate result of the global survival analysis; then, the master controller 101 can calculate the gradient of the model loss function relative to each interpretation variable and the hessian matrix according to the intermediate result of the global survival analysis, update the model parameters by using a Newton method, and issue the adjusted proportional risk regression model 1011 and the model parameters thereof to each data provider; and each data provider performs next model training according to the updated proportional risk regression model. Thus, iteration is performed for multiple times until the proportional risk regression model 1011 converges, so as to achieve higher model accuracy or achieve better model effect.
In practical applications, besides the above model optimization method, any other suitable optimization method can be used, such as a gradient descent method, a quasi-newton method, or a conjugate gradient method.
Fig. 2 illustrates operations performed by the master 101 when the proportional risk regression model training method of the present application is applied to the distributed survival model training.
Referring to fig. 2, in the proportional risk regression model training method of the present application, operations performed by the master controller 101 mainly include: operation 210, sending a first proportional risk regression model to each data provider, the proportional risk regression model including at least one explanatory variable; operation 220, determining a global time-to-live increasing sequence according to the local maximum time-to-live and the local minimum time-to-live provided by each data provider; operation 230, sending the global survival time increment sequence to each data provider, so that each data provider performs survival analysis according to the global survival time increment sequence, the first proportional risk regression model and the sample values of the interpretation variables in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increment sequence; an operation 240, performing multi-party security calculation according to the intermediate local survival analysis result provided by each data provider to obtain an intermediate global survival analysis result; at operation 250, the terminal first proportional risk regression model is updated according to the global survival analysis intermediate result to determine a second proportional risk regression model.
In this embodiment, the proportional hazards regression model is a distributed COX regression model, and in practical applications, the implementer may use any other suitable proportional hazards regression model, such as Weibull regression model, log logistic regression model, Gamma model, etc., according to the implementation.
In operation 210, the first scale analysis regression model is an initial model to be distributed to each data provider for the current round of distributed training.
The explanatory variables are various factors that do not change with time and that have an influence on the life time. For example, in the case of a patient survival model, patient's own genes, medication, treatment procedures, etc. are all major factors affecting the patient's survival, and can be used as explanatory variables for the patient survival model output (patient survival time), and the values of these variables are usually proportional values representing the degree of influence.
Setting at least one explanation variable as a model parameter of the proportional risk regression model, giving some initial values according to expert experience or domain knowledge when the proportional risk regression model is initialized, and continuously adjusting the value of the explanation variable through distributed training of the federal learning proportional risk regression model to enable the proportional risk regression model to continuously converge. After the proportional risk regression model converges, the explanatory variables can be used as the basis for researching the influence factors and the influence degree of the influence factors. In operation 220, the survival time is a length of time set for performing the survival analysis, e.g., 1 year, 3 years, or 5 years, etc. Each data provider can set different survival time according to different requirements to carry out survival analysis.
For example, the data provider 102 mainly studies survival rates of 1 year, 3 years, and 5 years, so the survival times are set to 1, 3, and 5 (years); and the data provider 103 mainly studies survival rates for 2 years, 4 years, and 6 years, so the survival time is set to 2, 4, or 6 (years).
In federal learning, statistics and multi-party calculations are typically performed on data that is kept the same for the same time of life. For this reason, a time dimension applicable to all data providers is required to be obtained, that is, the time-to-live nodes in the time dimension can cover the time-to-live corresponding to the local medical data of all data providers, for example, for the data provider 102 and the data provider 103, the time dimension at least includes 1, 2, 3, 4, 5, and 6 (years) of time-to-live nodes, so that each data provider performs statistical or multi-party security calculation on the same time-to-live node.
In the existing federal learning scheme, in order to obtain the time dimension, at least a master control node and each provider need to communicate to obtain the time dimension applicable to all data providers; it is sometimes necessary to perform multiple rounds of communication between data providers to obtain the survival nodes of other data providers, statistics corresponding to a certain survival time node, and the like. Therefore, on one hand, the suitable lifetime of the data provider has to be exposed to the master controller or other data providers, which causes potential safety hazards; on the other hand, frequent data exchange and multi-party safety calculation between the master and multiple data providers are frequently required, which causes technical problems of long time consumption, slow response, occupation of calculation resources and the like.
Therefore, the proportional risk regression model training method abandons the method, and can determine the global survival time increasing sequence only according to the local maximum survival time and the local minimum survival time provided by each data provider.
As such, the data provider 102 only needs to provide the local maximum lifetime "5 years" and the local minimum lifetime "1 year" to the master 101; the data provider 102 only needs to provide the local maximum lifetime "6 years" and the local minimum lifetime "2 years" to the master 101; the master 101 determines a global maximum lifetime "6 years" and a local minimum lifetime "1 year" upon receiving the local maximum lifetime and the local minimum lifetime; then, with "1 year" as a step size, the following global time-to-live increasing sequence in years can be determined: {1,2,3,4,5,6}.
Besides, instead of using "1 year" as the step length, the global time-to-live increasing sequence can also be presumed according to the distribution rule of some time points. Assuming that some point in time of failure is to comply with a certain distribution function F (min, max), a sequence of increasing global time-to-live times can be deduced from this function.
In this way, each data provider only needs to expose the local maximum and minimum survival time, but does not need to expose the local total survival time; the master 101 can determine the global time-to-live sequence as a uniform time dimension by only 1 round of communication according to the local maximum and minimum time-to-live. The risk that each data provider exposes data is greatly reduced, and the communication times are reduced.
In operation 230, the master controller 101 sends the global time-to-live increment sequence to each data provider, and each data provider performs a survival analysis according to the global time-to-live increment sequence, the first proportional risk regression model, and the sample value of the interpretation variable in the local data to obtain an intermediate result of the local survival analysis corresponding to the global time-to-live increment sequence.
For example, for a survival analysis system for calculating the survival rate of a patient, the number of dead individuals, the sum of interpretation variables corresponding to the risk individuals, and the like are all survival analysis intermediate results required for calculating the survival rate of the patient.
The survival analysis intermediate result generally comprises two parts, one part is a statistical value calculated by local medical data, for example, the sum of corresponding interpretation variables of risk individuals; and the other part is an output result obtained by inputting local medical data into a local proportional risk regression model.
After obtaining the intermediate result of the local survival analysis, the data provider only needs to fill the intermediate result of the local survival analysis into the data set corresponding to the local survival time in the global survival time increasing sequence.
Therefore, communication with other data providers is not required to be carried out for determining the survival time set by other data providers, and the intermediate result of the local survival time corresponding to the local survival analysis is not required to be sent to other data providers with the same survival time setting.
Through operation 240, the master 101 collects the intermediate local survival analysis results provided by the data provider 102 and the data provider 103 and corresponding to the global survival time increment sequence, that is, the survival time nodes in the global survival time increment sequence can be used as a unified time dimension to perform multi-party security calculation, respectively, so as to obtain the intermediate global survival analysis result.
In operation 250, the first proportional risk regression model is updated and optimized according to the intermediate result of global survival analysis, which mainly means that a model loss function is calculated according to the intermediate result of global survival analysis, and model parameters are adjusted and updated according to values of the model loss function so that a model loss function value is continuously converged to achieve an expected model accuracy.
The second proportional risk regression model is a model with higher precision obtained after training, the used prediction method and prediction target of the second proportional risk regression model are the same as those of the first proportional risk regression model, but model parameters change after training, and some parameters used in the prediction process change correspondingly, so that the obtained result is different from that obtained by the first proportional risk regression model and is more accurate than that obtained by the first proportional risk regression model.
As can be seen from the above description, in the federal learning process, each data provider only needs to provide the maximum lifetime and the minimum lifetime without exposing all the lifetimes, so that the privacy of the private data of each data provider can be further protected; and the global survival time increasing sequence is used as the expanded time dimension, so that each data provider can calculate the intermediate result in the same time dimension without carrying out multiple communications, the communication bandwidth is greatly saved, the calculation times are reduced, and the response time is further shortened.
Fig. 3 shows operations performed by the data provider 101 or the data provider 103 when the distributed survival model training is performed by applying the proportional risk regression model training method of the present application.
Referring to fig. 3, in the proportional risk regression model training method of the present application, operations performed by the data provider 101 or the data provider 103 mainly include: operation 310, receiving a first proportional risk regression model sent by a master, wherein the proportional risk regression model comprises at least one explanation variable; operation 320, providing the local maximum lifetime and the local minimum lifetime to the master; operation 330, receiving a whole office lifetime increasing sequence sent by the master; operation 340, performing survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and the sample value of the interpretation variable in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence; at operation 350, the intermediate results of the local liveness analysis are provided to the master.
Taking the data provider 102 as an example:
in operation 310, the data provider 102 receives the proportional risk regression model 1011 sent by the master 101 to obtain the local proportional risk regression model 1021. It should be noted that the local proportional risk regression model 1021 is identical to the proportional risk regression model 1011 (the first proportional risk regression model) sent by the master 101 and is only used for the current training, and the updated proportional risk regression model 1011 is obtained again from the master 101 when the next training is started.
In operation 320, the local maximum lifetime and the local minimum lifetime are the maximum and minimum values among the lifetimes of the data provider's main study. For example, the data provider 102 mainly studies survival rates of 1 year, 3 years, and 5 years, so the survival times are set to 1, 3, and 5 (years), where 1 year is the local minimum survival time and 5 years is the local maximum survival time.
The local maximum lifetime and the local minimum lifetime are provided to the master 101, and are mainly used for obtaining the global lifetime increasing sequence.
In the proportional risk regression model training method, the data provider 102 does not need to provide the whole local survival time, and only needs to provide the local maximum survival time of 5 years and the local minimum survival time of 1 year to the master controller 101, so that the risk of exposing data is greatly reduced.
In operation 330, the received global time-to-live increment sequence is determined by the master 101 according to the local maximum time-to-live and the local minimum time-to-live provided by each data provider. In this way, the data provider 102 does not need to communicate with other data providers, such as the data provider 103, and the number of times of communication with other data providers is greatly reduced.
Subsequently, the data provider 10 may perform calculation by using the local proportional risk regression model 1021 through operation 340 to obtain a local survival analysis intermediate result corresponding to the global survival time increment sequence, and send the local survival analysis intermediate result to the master controller 101, so that the master controller 101 may calculate the global survival analysis intermediate result according to the local survival analysis intermediate result provided by each data provider.
It should be noted that the embodiment shown in fig. 2 and fig. 3 is only one of the most basic embodiments of the proportional risk regression model training method of the present application, and the implementer may further refine and expand the training method based on the embodiment.
FIG. 4 shows another embodiment of the proportional hazards regression model training method of the present application, which is based on the embodiments shown in FIGS. 2 and 3 and further optimizes the process of normalizing local medical data by each data provider; and the encryption and decryption operations are added to the data sent to the master controller by each data provider.
Specifically, in another embodiment shown in fig. 4, the operations performed by the master and the data provider and the interaction between them mainly include:
operation 4010, the master sends the initialized or updated first proportional risk regression model and the model parameters to each data provider, where the model parameters of the proportional risk regression model include at least one explanatory variable;
Operation 4020, each data providing party provides to the master party the sum of the local maximum lifetime, the local minimum lifetime, the number of the local samples and the values of the interpretation variables of the local samples, wherein each transmitted data is an encrypted value;
operation 4030, the master controller decrypts the received data, determines a global time-to-live increasing sequence according to the local maximum time-to-live and the local minimum time-to-live provided by each data provider, and calculates to obtain a global interpretation variable average value according to the sum of the number of local samples provided by each data provider and the interpretation variable values of the local samples;
for determining the global time-to-live increment sequence, please refer to the description of operation 220 in the foregoing embodiment, which is not repeated herein.
It is exemplified how the master calculates the global interpretation variable mean. Assume that the medical data for the local sample interpretation variables (e.g., patient age) of data provider a are: [ 50, 20, 70 ], number of samples is 4, and sum of sample values is 140; the medical data of the same local sample interpretation variable of data provider a is: (40, 40, 60, 80), the number of samples is 4, and the sum of the sample values is 210.
In this case, the specific method of calculating the global mean value of the interpretation variables based on the sum of the number of the local samples provided by each data provider and the value of the interpretation variables of the local samples includes:
1) Respectively acquiring the sample numbers and the sum of the interpretation variable values of the data providers A and B, and carrying out safe summation to obtain the global sample number 7 and the global sum of the interpretation variable values 450;
2) the global interpretation variable mean value 50 is obtained by dividing the global interpretation variable value sum 450 by the global sample number 7.
Operation 4040, the master sends the global interpretation variable average to each data provider, so that each data provider calculates a local interpretation variable sample variance;
operation 4050, each data provider calculates a local explanatory variable sample variance from the global explanatory variable mean value provided by the master;
wherein the variable samples are interpreted locally; is a global interpretation variable mean; n is the number of samples of the local interpretation variable; the variable sample variance is interpreted locally.
In operation 4060, each data provider provides a local interpretation variable sample variance to the master, where the local interpretation variable sample variance provided by each data provider is an encrypted value;
operation 4070, the main controller decrypts the received data, and performs multi-party security calculation according to the local explanatory variable sample variance provided by each data provider to obtain a global explanatory variable sample variance and a global explanatory variable standard deviation;
Operation 4080, the primary controller returns the global interpretation variable standard deviation and the global time-to-live increment sequence to each data provider;
operating 4090, each data provider standardizes the sample values of the interpretation variables in the local medical data by z-score, and performs survival analysis according to the global survival time increment sequence, the first proportional risk regression model and the standardized sample values of the interpretation variables in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increment sequence;
for example, after obtaining the global standard deviation si of the interpretive variable, each data provider normalizes the sample of the combined variable in the local medical data by performing the following calculation:
1) solving a sample value xi of an interpretation variable in the local medical data;
2) the normalization process is performed using the following formula:
zij=(xij-xi)/si
wherein: zij is a normalized variable value; xij is the actual variable value.
3) And exchanging the signs before the inverse indexes.
The normalized values of the z-score values fluctuate around 0, with values greater than 0 indicating a higher than average level and values less than 0 indicating a lower than average level.
Operation 4100, each data provider provides a local survival analysis intermediate result to the master controller, where the local survival analysis intermediate result of each data provider is an encrypted value;
Any suitable encryption method may be used for the encryption method, for example, an encryption method satisfying the homomorphism. The encryption method satisfying the addition homomorphism mainly refers to an encryption method satisfying f (a) + f (B) ═ f (a + B), for example, Paillier algorithm, Gentry algorithm, or the like.
In operation 4110, the master controller decrypts each data provided by each data provider, performs multi-party security calculation according to the intermediate local survival analysis result provided by each data provider to obtain an intermediate global survival analysis result, and updates the first proportional risk regression model according to the intermediate global survival analysis result to determine a second proportional risk regression model.
In the present embodiment, through operations 440 to 460, the master 101 may acquire a global interpretation variable average value through the sum of the number of local samples and the interpretation variable values of the local samples in cooperation with the data provider 102 and the data provider 104; obtaining a global interpretation variable standard deviation through a local interpretation variable sample variance; and then z-score normalization is carried out on the sample values of the interpretation variables in the local medical data by using the average value of the global interpretation variables and the standard deviation of the global interpretation variables.
In this way, in the process of standardizing the local medical data samples, the data provider only needs to provide the number of the local samples, the sum of the interpretation variable values of the local samples and the variance of the local interpretation variable samples to carry out z-score standardization on the sample values of the interpretation variables in the local medical data. Without exposing the raw medical data or statistics.
Secondly, since the encryption and decryption operations are increased for the data sent by each data provider to the master, such as the maximum lifetime, the global minimum lifetime, the number of local samples, the interpretation variable values of the local samples, and the survival analysis results, the risk of data exposure is further reduced.
Furthermore, in addition to the z-score normalization method, the following steps can be used for maximum-minimum normalization:
calculating to obtain a global interpretation variable maximum value and a global interpretation variable minimum value according to the local interpretation variable maximum value and the local interpretation variable minimum value provided by each data provider;
and sending the maximum value and the minimum value of the global interpretation variable to each data provider so as to enable each data provider to carry out maximum and minimum standardization on the sample value of the interpretation variable in the local medical data.
Similar technical effects can be obtained by the above alternative schemes, and an implementer can flexibly determine the implementation requirements and implementation conditions.
Based on the same inventive concept and based on the above proportional risk regression model training method, the embodiment of the present application further provides a proportional risk regression model training device for data, which is applied to a master controller, as shown in fig. 5, the apparatus 50 includes: the model distribution module 501 is configured to send the first proportional risk regression model and the model parameters to each data provider, where the model parameters of the proportional risk regression model include at least one explanatory variable; a global time-to-live sequence determining module 502, configured to determine a global time-to-live sequence according to the local maximum time-to-live and the local minimum time-to-live provided by each data provider; the global survival time increment sequence issuing module 503 is configured to send the global survival time increment sequence to each data provider, so that each data provider performs survival analysis according to the global survival time increment sequence, the first proportional risk regression model, and the sample value of the interpretation variable in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increment sequence; a global survival analysis intermediate result calculation module 505, configured to perform multi-party security calculation according to the local survival analysis intermediate result provided by each data provider to obtain a global survival analysis intermediate result; and a model updating module 505, configured to update the first proportional risk regression model according to the global survival analysis intermediate result to determine a second proportional risk regression model.
According to an embodiment of the present application, the global time-to-live increment sequence determining module 502 includes: a global maximum survival time and global minimum survival time determining submodule for determining a global maximum survival time and a global minimum survival time according to the local maximum survival time and the local minimum survival time provided by each data provider; and the global survival time increasing sequence determining submodule is used for determining a global survival time increasing sequence according to the survival time step, the global maximum survival time and the global minimum survival time.
According to an embodiment of the present application, the model updating module 505 includes: the model loss function gradient calculation operator module is used for calculating the gradient of the model loss function relative to each interpretation variable according to the global survival analysis intermediate result; and the parameter updating submodule is used for updating parameters in the first proportional risk regression model according to the terminal gradient so as to enable the loss function value of the terminal model to be continuously converged.
According to an embodiment of the present application, the intermediate result of survival analysis is an encrypted value obtained by an encryption method, and accordingly, the intermediate result of global survival analysis calculation module 505 includes: the first decryption submodule is used for decrypting the local survival analysis intermediate result provided by each data provider to obtain a decrypted local survival analysis intermediate result; and the global survival analysis intermediate result calculation submodule is used for carrying out multi-party safety calculation according to the decrypted local survival analysis intermediate result to obtain a global survival analysis intermediate result.
According to an embodiment of the present application, the apparatus 50 further includes: the global interpretation variable reference value calculating module is used for carrying out multi-party safety calculation according to the local interpretation variable reference values provided by each data provider to obtain a global interpretation variable reference value; and the global interpretation variable reference value sending module is used for sending the global interpretation variable reference value to each data provider so that each data provider can standardize the sample value of the interpretation variable in the local medical data.
According to an embodiment of the present application, the local interpretation variable reference value is an encrypted value, and accordingly, the global interpretation variable reference value operator module includes: the second decryption unit is used for decrypting the local interpretation variable reference value provided by each data provider to obtain a decrypted local interpretation variable reference value; and the global interpretation variable reference value calculating unit is used for carrying out multi-party safety calculation according to the decrypted local interpretation variable reference value to obtain a global interpretation variable reference value.
According to an embodiment of the present application, the normalization is z-score normalization, and accordingly, the global interpretation variable reference value operator module includes: the global interpretation variable average value calculating unit is used for calculating to obtain a global interpretation variable average value according to the sum of the number of the local samples provided by each data provider and the interpretation variable values of the local samples; the global interpretation variable average value sending unit is used for sending the global interpretation variable average value to each data provider so that each data provider can calculate the local interpretation variable sample variance; the global interpretation variable standard deviation calculation unit is used for carrying out multi-party safety calculation according to the local interpretation variable sample variances provided by each data provider to obtain global interpretation variable sample variances and global interpretation variable standard deviations; correspondingly, the global interpretation variable reference value sending module is specifically configured to return the global interpretation variable standard deviation to each data provider, so that each data provider can perform z-score standardization on the sample values of the interpretation variables in the local medical data.
According to an embodiment of the present application, the normalization is maximum and minimum normalization, and accordingly, the global explanatory variable reference value operator module is specifically configured to calculate a global explanatory variable maximum value and a global explanatory variable minimum value according to a local explanatory variable maximum value and a local explanatory variable minimum value provided by each data provider; correspondingly, the global interpretation variable reference value sending module is specifically configured to send the global interpretation variable maximum value and the global interpretation variable minimum value to each data provider, so that each data provider can perform maximum and minimum standardization on the sample value of the interpretation variable in the local medical data.
The embodiment of the present application further provides a training apparatus for a proportional risk regression model of data, which is applied to a data provider, and as shown in fig. 6, the device 60 includes: the model receiving module 601 is configured to receive a first proportional risk regression model sent by a master controller, where the proportional risk regression model includes at least one interpretation variable; a local lifetime providing module 602, configured to provide a local maximum lifetime and a local minimum lifetime to the master; a global survival data receiving module 603, configured to receive a global survival time increment sequence sent by the master; a local survival analysis module 604, configured to perform survival analysis according to the global survival time increment sequence, the first proportional risk regression model, and a sample value of an interpretation variable in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increment sequence; an intermediate result providing module 605 for survival analysis is used to provide the intermediate result of local survival analysis to the main controller.
According to an embodiment of the present application, the apparatus 60 further includes: the encryption module is used for encrypting the local maximum survival time and the local minimum survival time to obtain the encrypted local maximum survival time and the encrypted local minimum survival time; accordingly, the local lifetime providing module 602 is specifically configured to provide the encrypted local maximum lifetime and the encrypted local minimum lifetime to the master.
According to an embodiment of the present application, the encryption module is further configured to encrypt the local survival analysis intermediate result to obtain an encrypted local survival analysis intermediate result; accordingly, the local survival analysis intermediate result providing module 605 is specifically configured to provide the encrypted local survival analysis intermediate result to the master.
According to an embodiment of the present application, the apparatus 60 further includes: the reference value providing module of the interpretation variables in the local medical data is used for providing the reference values of the interpretation variables in the local medical data to the main control party so as to enable the main control party to calculate the reference values of the global interpretation variables; and the sample value standardization module is used for standardizing the sample value of the interpretation variable in the local medical data according to the global interpretation variable reference value provided by the master control party.
According to an embodiment of the present application, the normalization is z-score normalization, and accordingly, the sample value normalization module includes: the sample variance calculation submodule is used for calculating the sample variance of the local interpretation variable according to the global interpretation variable average value sent by the main controller; the variable sample variance sending submodule is used for providing local interpretation variable sample variance to the main control party; and the z-score standardization sub-module is used for carrying out z-score standardization on sample values of the interpretation variables in the local medical data according to the standard deviation of the total office interpretation variables sent by the master controller.
According to an embodiment of the present application, the normalization is a maximum-minimum normalization, and accordingly, the sample value normalization module is specifically configured to perform maximum-minimum normalization on sample values of the interpretation variables in the local medical data according to the maximum value and the minimum value of the global interpretation variables sent by the master.
In addition, a computer-readable storage medium is provided in an embodiment of the present application, where the storage medium includes a set of computer-executable instructions, and when the instructions are executed, the method performs any one of the above training methods for a proportional risk regression model.
Here, it should be noted that: the above descriptions of the embodiment of the proportional risk regression model training device, the embodiment of the proportional risk regression model training system, and the embodiment of the computer readable storage medium are similar to the descriptions of the foregoing method embodiments, and have similar beneficial effects to the foregoing method embodiments, and therefore are not repeated. For technical details that are not disclosed in the embodiments of the proportional risk regression model training device, the embodiments of the proportional risk regression model training system, and the embodiments of the computer-readable storage medium in the embodiments of the present application, please refer to the description of the foregoing method embodiments in the embodiments of the present application for understanding, and therefore, for brevity, will not be described again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication between the components shown or discussed may be through some interfaces, indirect coupling or communication between devices or units, and may be electrical, mechanical or other.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; the storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
Alternatively, the integrated unit in the embodiment of the present application may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or any other medium that can store program code.
The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims (14)

1. A method for training a proportional risk regression model, the method comprising:
sending a first proportional risk regression model to each data provider, the proportional risk regression model including at least one explanatory variable;
determining a global survival time increasing sequence according to the local maximum survival time and the local minimum survival time provided by each data provider;
sending the global survival time increasing sequence to each data provider, so that each data provider performs survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and the sample value of the interpretation variable in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence;
performing multi-party safety calculation according to the intermediate results of the local survival analysis provided by each data provider to obtain intermediate results of the global survival analysis;
and updating parameters in the first proportional risk regression model according to the global survival analysis intermediate result to determine a second proportional risk regression model.
2. The method of claim 1, wherein determining a global time-to-live increment sequence based on the local maximum time-to-live and the local minimum time-to-live provided by each data provider comprises:
Determining global maximum survival time and global minimum survival time according to the local maximum survival time and the local minimum survival time provided by each data provider;
and determining a global survival time increasing sequence according to the survival time step length, the global maximum survival time and the global minimum survival time.
3. The method of claim 1, wherein updating parameters in the first proportional risk regression model based on the global survival analysis intermediate results comprises:
calculating the gradient of the model loss function relative to each interpretation variable according to the global survival analysis intermediate result;
and updating parameters in the first proportional risk regression model according to the gradient, so that the loss function value of the first proportional risk regression model is converged continuously.
4. The method of claim 1, wherein the intermediate result of survival analysis is an encrypted value,
correspondingly, the performing multi-party security calculation according to the intermediate local survival analysis result provided by each data provider to obtain the intermediate global survival analysis result includes:
decrypting the local survival analysis intermediate result provided by each data provider to obtain a decrypted local survival analysis intermediate result;
And performing multi-party safety calculation according to the decrypted intermediate result of the local survival analysis to obtain an intermediate result of the global survival analysis.
5. The method of claim 1, further comprising:
performing multi-party safety calculation according to the local explanatory variable reference values provided by each data provider to obtain a global explanatory variable reference value;
and sending the global interpretation variable reference value to each data provider so that each data provider can standardize the sample value of the interpretation variable in the local medical data.
6. The method of claim 5, wherein the local interpretation variable reference value is an encrypted value,
correspondingly, according to the local explanatory variable reference value provided by each data provider, multi-party security calculation is carried out to obtain a global explanatory variable reference value, which comprises the following steps:
decrypting the local interpretation variable reference value provided by each data provider to obtain a decrypted local interpretation variable reference value;
and carrying out multi-party safety calculation according to the decrypted local explanatory variable reference value to obtain a global explanatory variable reference value.
7. The method of claim 5, wherein the normalization is z-score normalization,
Correspondingly, the performing multi-party security calculation according to the local explanatory variable reference value provided by each data provider to obtain the global explanatory variable reference value includes:
calculating to obtain a global interpretation variable average value according to the sum of the number of the local samples provided by each data provider and the interpretation variable values of the local samples;
sending the global interpretation variable average value to each data provider so that each data provider can calculate local interpretation variable sample variance;
performing multi-party safety calculation according to the local explanatory variable sample variance provided by each data provider to obtain a global explanatory variable sample variance and a global explanatory variable standard deviation;
correspondingly, the sending of the global interpretation variable reference value to each data provider for each data provider to normalize the sample values of the interpretation variable in the local medical data comprises:
and returning the standard deviation of the global interpretation variable to each data provider so that each data provider can carry out z-score standardization on the sample value of the interpretation variable in the local medical data.
8. The method of claim 5, wherein the normalization is a maximum-minimum normalization,
Correspondingly, the performing multi-party security calculation according to the local explanatory variable reference value provided by each data provider to obtain a global explanatory variable reference value includes:
calculating to obtain a global explanatory variable maximum value and a global explanatory variable minimum value according to the local explanatory variable maximum value and the local explanatory variable minimum value provided by each data provider;
correspondingly, the sending of the global interpretation variable reference value to each data provider for each data provider to normalize the sample values of the interpretation variable in the local medical data comprises:
and sending the maximum value and the minimum value of the global interpretation variable to each data provider so that each data provider can carry out maximum and minimum standardization on the sample value of the interpretation variable in the local medical data.
9. A method for training a proportional risk regression model, the method comprising:
receiving a first proportional risk regression model sent by a master controller, wherein the proportional risk regression model comprises at least one explanation variable;
providing the local maximum survival time and the local minimum survival time for the master control party;
receiving a global survival time increasing sequence sent by the master controller;
Performing survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and sample values of interpretation variables in local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence;
and providing the local survival analysis intermediate result to the master.
10. A method for training a proportional risk regression model, the method comprising:
the method comprises the steps that a master side sends a first proportional risk regression model to each data provider, wherein the proportional risk regression model comprises at least one explanatory variable;
each data providing party provides the sum of the local maximum survival time, the local minimum survival time, the number of local samples and the interpretation variable value of the local samples to the master party;
the master control party determines a global survival time increasing sequence according to the local maximum survival time and the local minimum survival time provided by each data provider, and sends the global survival time increasing sequence to each data provider;
each data provider performs survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and sample values of interpretation variables in local medical data sent by the master controller to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence, and provides the local survival analysis intermediate result to the master controller;
And the master control party performs multi-party safety calculation according to the intermediate local survival analysis result provided by each data provider to obtain an intermediate global survival analysis result, and updates parameters in the first proportional risk regression model according to the intermediate global survival analysis result to determine a second proportional risk regression model.
11. A training device for a proportional risk regression model, the training device comprising:
the model distribution module is used for sending a first proportional risk regression model to each data provider, and the proportional risk regression model comprises at least one explanatory variable;
the global survival time increasing sequence determining module is used for determining a global survival time increasing sequence according to the local maximum survival time and the local minimum survival time provided by each data provider;
the global survival time increasing sequence issuing module is used for sending the global survival time increasing sequence to each data provider so that each data provider can perform survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and the sample value of the interpretation variable in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence;
The global survival analysis intermediate result calculation module is used for performing multi-party safety calculation according to the local survival analysis intermediate result provided by each data provider to obtain a global survival analysis intermediate result;
and the model updating module is used for updating the first proportional risk regression model according to the global survival analysis intermediate result so as to determine a second proportional risk regression model.
12. A training device for a proportional risk regression model, the training device comprising:
the model receiving module is used for receiving a first proportional risk regression model sent by a master controller, and the proportional risk regression model comprises at least one explanation variable;
a local survival time providing module, configured to provide the local maximum survival time and the local minimum survival time to the master controller;
a global survival data receiving module, configured to receive the global survival time increment sequence sent by the master;
the local survival analysis module is used for carrying out survival analysis according to the global survival time increasing sequence, the first proportional risk regression model and the sample value of the interpretation variable in the local medical data to obtain a local survival analysis intermediate result corresponding to the global survival time increasing sequence;
And the survival analysis intermediate result providing module is used for providing the local survival analysis intermediate result for the main control party.
13. A system for training a proportional hazards regression model, the system comprising:
a master for performing the proportional risk regression model training method of any one of claims 1-8;
at least two data providers for performing the proportional hazards regression model training method of claim 9.
14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-10.
CN202111156675.1A 2021-09-30 2021-09-30 Proportional risk regression model training method, device and system and storage medium Active CN114021732B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111156675.1A CN114021732B (en) 2021-09-30 2021-09-30 Proportional risk regression model training method, device and system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111156675.1A CN114021732B (en) 2021-09-30 2021-09-30 Proportional risk regression model training method, device and system and storage medium

Publications (2)

Publication Number Publication Date
CN114021732A CN114021732A (en) 2022-02-08
CN114021732B true CN114021732B (en) 2022-07-29

Family

ID=80055215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111156675.1A Active CN114021732B (en) 2021-09-30 2021-09-30 Proportional risk regression model training method, device and system and storage medium

Country Status (1)

Country Link
CN (1) CN114021732B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116523081B (en) * 2023-04-07 2024-02-13 花瓣云科技有限公司 Data standardization method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401433A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 User information acquisition method and device, electronic equipment and storage medium
CN112418444A (en) * 2020-05-15 2021-02-26 支付宝(杭州)信息技术有限公司 Method and device for league learning and league learning system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200065713A1 (en) * 2018-08-24 2020-02-27 Adobe Inc. Survival Analysis Based Classification Systems for Predicting User Actions

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401433A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 User information acquisition method and device, electronic equipment and storage medium
CN112418444A (en) * 2020-05-15 2021-02-26 支付宝(杭州)信息技术有限公司 Method and device for league learning and league learning system

Also Published As

Publication number Publication date
CN114021732A (en) 2022-02-08

Similar Documents

Publication Publication Date Title
CN112182595B (en) Model training method and device based on federal learning
CN110245510B (en) Method and apparatus for predicting information
CN110189192B (en) Information recommendation model generation method and device
US8990252B2 (en) Anonymity measuring device
CN110610093B (en) Methods, systems, and media for distributed training in parameter data sets
CN110990871A (en) Machine learning model training method, prediction method and device based on artificial intelligence
CN110999200B (en) Method and system for evaluating monitoring function to determine whether triggering condition is met
CN110245514B (en) Distributed computing method and system based on block chain
CN113221153B (en) Graph neural network training method and device, computing equipment and storage medium
EP3871127A1 (en) Privacy preserving server
US10515060B2 (en) Method and system for generating a master clinical database and uses thereof
CN112394974A (en) Code change comment generation method and device, electronic equipment and storage medium
CN114696990A (en) Multi-party computing method, system and related equipment based on fully homomorphic encryption
CN114021732B (en) Proportional risk regression model training method, device and system and storage medium
CN112801307B (en) Block chain-based federal learning method and device and computer equipment
EP3754550A1 (en) Method for providing an aggregate algorithm for processing medical data and method for processing medical data
CN113849828B (en) Anonymous generation and attestation of processed data
Mansourvar et al. An additive–multiplicative restricted mean residual life model
CN112053058A (en) Index model generation method and device
CN113989036B (en) Federal learning prediction method and system without exposure of model-entering variable
JP7076167B1 (en) Machine learning equipment, machine learning systems, machine learning methods, and machine learning programs
CN114547684A (en) Method and device for protecting multi-party joint training tree model of private data
CN112836767A (en) Federal modeling method, apparatus, device, storage medium, and program product
CN113228022A (en) Analysis query response system, analysis query execution device, analysis query verification device, analysis query response method, and program
US20230394303A1 (en) Machine learning system, client terminal, aggregated server device and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant