CN110348241A

CN110348241A - A kind of multicenter under data sharing strategy cooperates with prognosis prediction system

Info

Publication number: CN110348241A
Application number: CN201910629800.2A
Authority: CN
Inventors: 李劲松; 李谨; 田雨; 吴承凯; 池胜强
Original assignee: Zhijiang Laboratory
Current assignee: Zhijiang Laboratory
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2019-10-18
Anticipated expiration: 2039-07-12
Also published as: WO2020233258A1; CN110348241B

Abstract

The invention discloses a multi-center collaborative prognosis prediction system under a data sharing strategy. The system can realize privacy-protected data sharing under multiple medical institution centers, thus providing enough data for model construction. The present invention uses an integrated learning algorithm that can obtain better prediction results than weak classifiers to construct the system. The system processes sensitive patient-level data at each center and simultaneously constructs sub-classifiers of the ensemble learning model, exchanging only less sensitive intermediate results to build a complete ensemble learning model, thus ensuring that the proposed multi-center model is compatible with The centralized model has the same or even better results. The multi-center collaborative prognosis prediction system of the present invention protects the personal privacy of patients and does not need to run algorithm models on large centralized data sources. In actual clinical applications, it provides a reliable solution for the lack of samples for constructing prediction models in a single medical institution. solution.

Description

A multi-center collaborative prognosis prediction system under a data sharing strategy

技术领域technical field

本发明属于医疗领域及机器学习领域，尤其涉及一种数据共享策略下的多中心协同预后预测系统。The invention belongs to the fields of medical treatment and machine learning, and in particular relates to a multi-center collaborative prognosis prediction system under a data sharing strategy.

背景技术Background technique

预后预测在临床研究和实践中发挥着重要作用。基于单个医疗机构的电子健康记录(EHR)数据构建的预测模型可能缺少足够的统计效力和良好的泛化能力。因此，基于多个医疗机构中心电子健康记录数据协同分析的预后预测模型构建，可以用于提高用于模型训练的患者数量和覆盖面，丰富患者的预后特征，最终提高模型的预后预测的准确性和泛化能力。集成学习是一种在临床预后中应用非常广泛的算法，与逻辑回归和cox模型等线性模型不同，集成学习算法通常精度更好，且具有捕获变量间的非线性关系的能力，能很好地避免机器学习中常见的过拟合问题。因此，利用集成学习算法进行模型构建，为多中心下的协同预后预测系统的搭建提供理想的解决方案。另外，在进行多中心预后预测的同时，必须要保护患者的隐私。现有的多中心下隐私保护的集成学习训练模型大多是基于加密的方法，如利用加性同态加密等方法。Aslett等人提出基于完全同态加密的集成学习模型。Magkos等人利用基于同态加密的协议框架构建加密模块，从而训练出集成学习分类器。虽然这些加密方法可以防止信息泄漏与数据交换，但会显著影响计算和存储效率，可扩展性差，不适用于处理多中心下的大型临床数据。Prognosis prediction plays an important role in clinical research and practice. Predictive models constructed based on electronic health record (EHR) data from a single medical institution may lack sufficient statistical power and good generalization ability. Therefore, the construction of a prognosis prediction model based on the collaborative analysis of electronic health record data in multiple medical institution centers can be used to increase the number and coverage of patients used for model training, enrich the prognostic characteristics of patients, and ultimately improve the accuracy and accuracy of the model's prognosis prediction. Generalization. Integrated learning is an algorithm that is widely used in clinical prognosis. Unlike linear models such as logistic regression and cox models, integrated learning algorithms usually have better accuracy and have the ability to capture nonlinear relationships between variables. Avoid overfitting problems common in machine learning. Therefore, the use of ensemble learning algorithms for model construction provides an ideal solution for the construction of a multi-center collaborative prognosis prediction system. In addition, while performing multi-center prognosis prediction, the privacy of patients must be protected. Most of the existing privacy-preserving ensemble learning training models under multiple centers are based on encryption methods, such as using additive homomorphic encryption and other methods. Aslett et al. proposed an ensemble learning model based on fully homomorphic encryption. Magkos et al. used a protocol framework based on homomorphic encryption to build an encryption module to train an ensemble learning classifier. Although these encryption methods can prevent information leakage and data exchange, they will significantly affect computing and storage efficiency, have poor scalability, and are not suitable for processing large clinical data under multi-center.

发明内容Contents of the invention

本发明的目的在于针对现有技术的不足，提供一种新型数据共享策略下的多中心协同预后预测系统。The purpose of the present invention is to provide a multi-center collaborative prognosis prediction system under a novel data sharing strategy to address the deficiencies in the prior art.

本发明的目的是通过以下技术方案来实现的：一种数据共享策略下的多中心协同预后预测系统，该系统包括以下四个模块：The object of the present invention is achieved through the following technical solutions: a multi-center collaborative prognosis prediction system under a data sharing strategy, the system includes the following four modules:

(1)数据获取模块：在各医疗机构中心分别收集患者预后预测所需要的各个变量的数据，作为该医疗机构中心的源数据集。(1) Data acquisition module: collect the data of each variable required for patient prognosis prediction in each medical institution center, and use it as the source data set of the medical institution center.

(2)数据匿名化模块：对每个医疗机构中心的源数据集以百分比p进行随机采样，对采样数据使用匿名化算法生成匿名化数据，剩余数据作为该医疗机构中心的本地训练集；来自每个医疗机构中心的匿名化数据由中央服务器收集合成增强数据集；将增强数据集分成两部分，即附加训练集和验证集；附加训练集用于回传并分配给每个医疗机构中心；验证集用于选择集成学习模型的超参数(hyper parameter)。(2) Data anonymization module: randomly sample the source data set of each medical institution center with a percentage p, use an anonymization algorithm to generate anonymized data for the sampled data, and use the remaining data as the local training set of the medical institution center; The anonymized data of each medical institution center is collected by the central server to synthesize the enhanced data set; the enhanced data set is divided into two parts, namely the additional training set and the verification set; the additional training set is used for return and distributed to each medical institution center; The validation set is used to select the hyperparameters of the ensemble learning model.

(3)模型训练模块：每个医疗机构中心在本地训练集成学习模型的子分类器，在训练过程中的训练数据包括该医疗机构中心的本地训练集和中央服务器回传给该医疗机构中心的附加训练集；这表明用于训练每个医疗机构中心子分类器的训练集不仅来自中心本身还来自其他中心的数据集，从而增加数据集的随机性，以提高集成学习模型的整体性能。在训练过程中，利用从增强数据集创建的验证集选择集成学习模型的超参数。(3) Model training module: each medical institution center trains the sub-classifier of the integrated learning model locally, and the training data in the training process includes the local training set of the medical institution center and the data returned by the central server to the medical institution center Additional training set; this indicates that the training set used to train the sub-classifiers for each medical institution center comes not only from the center itself but also from the datasets of other centers, thereby increasing the randomness of the dataset to improve the overall performance of the ensemble learning model. During training, the hyperparameters of the ensemble learning model are selected using a validation set created from the augmented dataset.

(4)预后模型应用模块：由中央服务器收集各医疗机构中心本地训练的子分类器构成完整的集成学习模型；将新的患者数据输入该集成学习模型执行预后预测。(4) Prognosis model application module: the central server collects sub-classifiers trained locally in each medical institution center to form a complete integrated learning model; new patient data is input into the integrated learning model to perform prognosis prediction.

进一步地，所述数据匿名化模块中，每个医疗机构中心源数据集的随机采样百分比p选择50％。将匿名化数据比例p固定在50％能够提升集成学习模型的预测效果，子分类器的直接集成或者数据的完全匿名化再集中训练都不能实现最佳结果；p的大小可以调整以适应复杂的决策支持场景，用于不同场景下的临床实践中患者的预后预测。Further, in the data anonymization module, the random sampling percentage p of the central source data set of each medical institution is selected as 50%. Fixing the proportion of anonymized data p at 50% can improve the prediction effect of the ensemble learning model, and the direct integration of sub-classifiers or the complete anonymization of data and centralized training cannot achieve the best results; the size of p can be adjusted to adapt to complex Decision support scenarios for prognosis prediction of patients in clinical practice under different scenarios.

进一步地，所述匿名化算法可选择k-匿名算法(k-anonymity)、l-多样性(l-diversity)、t-临近度(t-closeness)以及差分隐私等匿名算法。其中具体用于实现k-匿名的方法可以选择抑制(suppression)，抑制即彻底隐藏某些信息，不发布某些数据项。Further, the anonymization algorithm may choose k-anonymity algorithm (k-anonymity), l-diversity (l-diversity), t-closeness (t-closeness) and differential privacy and other anonymous algorithms. Among them, the specific method for realizing k-anonymity can choose suppression, which means completely hiding certain information and not releasing certain data items.

进一步地，该系统考虑水平分割数据(horizontal-partitioned data)，即每个医疗机构中心的源数据集具有相同种类的变量。Further, the system considers horizontal-partitioned data, that is, the source data set of each medical institution center has the same kind of variables.

本发明的有益效果是：本发明创新地提出了一种多中心数据共享策略，能够在多个医疗机构中心下实现隐私保护的数据共享，从而为模型构建提供足够的数据。本发明采用相对于弱分类器能够获得更好预测结果的集成学习算法(如随机森林算法)来构建系统。该系统在各个中心处理敏感的患者级数据，并同时构建出集成学习模型的子分类器，仅交换不太敏感的中间结果以构建完整的集成学习模型，从而保证了所提出的多中心模型与集中式模型具有相同甚至更优的结果。本发明多中心协同预后预测系统保护了患者的个人隐私，不需要在大型集中式数据源上运行算法模型，在实际临床应用中，为单个医疗机构中构建预测模型的样本太少提供了可靠的解决方案。The beneficial effects of the present invention are: the present invention innovatively proposes a multi-center data sharing strategy, which can realize privacy-protected data sharing under multiple medical institution centers, thereby providing sufficient data for model construction. The present invention uses an integrated learning algorithm (such as a random forest algorithm) that can obtain better prediction results than weak classifiers to construct the system. The system processes sensitive patient-level data at each center and simultaneously constructs sub-classifiers of the ensemble learning model, exchanging only less sensitive intermediate results to build a complete ensemble learning model, thus ensuring that the proposed multi-center model is compatible with The centralized model has the same or even better results. The multi-center collaborative prognosis prediction system of the present invention protects the personal privacy of patients and does not need to run algorithm models on large centralized data sources. solution.

附图说明Description of drawings

图1为数据共享策略下的多中心协同预后预测系统框架图；Figure 1 is a frame diagram of the multi-center collaborative prognosis prediction system under the data sharing strategy;

图2为数据共享策略示意图；Figure 2 is a schematic diagram of a data sharing strategy;

图3为各中心数据传输示意图；Figure 3 is a schematic diagram of data transmission in each center;

图4为本发明数据共享策略下的多中心协同预后预测系统与集中式训练下的预后预测系统的预测能力对比图。Fig. 4 is a comparison chart of the prediction ability of the multi-center collaborative prognosis prediction system under the data sharing strategy of the present invention and the prognosis prediction system under centralized training.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

本发明提供的一种新型数据共享策略下的多中心协同预后预测系统，如图1所示，包括以下四个模块：A multi-center collaborative prognosis prediction system under a novel data sharing strategy provided by the present invention, as shown in Figure 1, includes the following four modules:

(1)数据获取模块：在各医疗机构中心分别收集患者预后预测所需要的各个变量的数据，作为该医疗机构中心的源数据集。本实施例采用结直肠癌的数据进行实验验证，其中医疗机构中心的个数为5个，各个医疗机构中心通过数据获取模块采集到的电子病历数据样例如表1所示，共包括年龄、性别、肿瘤大小、T分期、N分期以及癌胚抗原指数等6个变量的数据信息。(1) Data acquisition module: collect the data of each variable required for patient prognosis prediction in each medical institution center, and use it as the source data set of the medical institution center. This embodiment adopts the data of colorectal cancer for experimental verification, wherein the number of medical institution centers is 5, and the electronic medical record data samples collected by each medical institution center through the data acquisition module are shown in Table 1, including age, gender , tumor size, T stage, N stage, and carcinoembryonic antigen index and other data information of six variables.

表1：结直肠癌患者的单个中心的电子病历数据采集举例Table 1: Example of electronic medical record data collection at a single center for colorectal cancer patients

年龄age 性别gender 肿瘤大小(mm)Tumor size (mm) T分期T stage N分期N stage 癌胚抗原指数carcinoembryonic antigen index 11 6565 男male 4.84.8 II IIIIII 阳性positive 22 7474 女Female 1.51.5 IIII IVIV 阴性feminine …… …… …… …… …… …… ……

(2)数据匿名化模块：如图2所示，对每个医疗机构中心的源数据集以百分比p进行随机采样，对采样数据使用匿名化算法生成匿名化数据，剩余数据作为该医疗机构中心的本地训练集。来自每个医疗机构中心的匿名化数据由中央服务器收集合成增强数据集；将增强数据集分成两部分，即附加训练集和验证集；附加训练集用于回传并分配给每个医疗机构中心；验证集用于选择集成学习模型的超参数(hyperparameter)。在实验中，匿名化数据比例p设置为50％，具体的匿名化算法采用k-匿名中的抑制算法，需要通过验证集选择的超参数有2个：单个决策树使用特征的最大数量、子分类器的数量。(2) Data anonymization module: As shown in Figure 2, the source data set of each medical institution center is randomly sampled with a percentage p, and the anonymized data is generated using an anonymization algorithm for the sampled data, and the remaining data is used as the medical institution center local training set. Anonymized data from each medical institution center is collected by a central server to synthesize an augmented dataset; the augmented dataset is split into two parts, an additional training set and a validation set; the additional training set is used for backhaul and distributed to each medical institution center ; The validation set is used to select the hyperparameters of the ensemble learning model. In the experiment, the proportion of anonymized data p is set to 50%. The specific anonymization algorithm uses the suppression algorithm in k-anonymity. There are two hyperparameters that need to be selected through the verification set: the maximum number of features used by a single decision tree, sub The number of classifiers.

(3)模型训练模块：如图2所示，每个医疗机构中心在本地训练集成学习模型的子分类器，在训练过程中的训练数据包括该医疗机构中心的本地训练集和中央服务器回传给该医疗机构中心的附加训练集；这表明用于训练每个医疗机构中心子分类器的训练集不仅来自中心本身还来自其他中心的数据集，从而增加数据集的随机性，以提高集成学习模型的整体性能。在训练过程中，利用从增强数据集创建的验证集选择集成学习模型的超参数，从而解决多中心模式下的袋外误差(OOB)与标准随机森林不完全相同导致的无偏估计无效的问题。(3) Model training module: as shown in Figure 2, each medical institution center locally trains the sub-classifier of the integrated learning model, and the training data during the training process includes the local training set of the medical institution center and the central server return An additional training set for the facility center; this indicates that the training set used to train each facility center sub-classifier comes not only from the center itself but also from datasets from other centers, thereby increasing the randomness of the dataset to improve ensemble learning The overall performance of the model. During the training process, the hyperparameters of the ensemble learning model are selected using the validation set created from the augmented dataset, thereby solving the problem that the out-of-bag error (OOB) in the multi-center mode is not exactly the same as that of the standard random forest, which causes the unbiased estimation to be invalid. .

(4)预后模型应用模块：由中央服务器收集各医疗机构中心本地训练的子分类器构成完整的集成学习模型；将新的患者数据输入该集成学习模型执行预后预测。实验结果如图4所示，预后预测系统的预测能力用AUC来衡量。可以看出本发明提出的数据共享策略下的多中心协同预后预测系统可以取得比集中式训练下的预后预测系统更优的预测结果。(4) Prognosis model application module: the central server collects sub-classifiers trained locally in each medical institution center to form a complete integrated learning model; new patient data is input into the integrated learning model to perform prognosis prediction. The experimental results are shown in Figure 4, and the prediction ability of the prognosis prediction system is measured by AUC. It can be seen that the multi-center collaborative prognosis prediction system under the data sharing strategy proposed by the present invention can achieve better prediction results than the prognosis prediction system under centralized training.

上述实施例用来解释说明本发明，而不是对本发明进行限制，在本发明的精神和权利要求的保护范围内，对本发明做出的任何修改和改变，都落入本发明的保护范围。The above-mentioned embodiments are used to illustrate the present invention, rather than to limit the present invention. Within the spirit of the present invention and the protection scope of the claims, any modification and change made to the present invention will fall into the protection scope of the present invention.

Claims

1. the multicenter under a kind of data sharing strategy cooperates with prognosis prediction system characterized by comprising

(1) data acquisition module: the number of each variable required for patient's prognosis prediction is collected respectively at each medical institutions center According to set of source data as the medical institutions center.

(2) data anonymous module: carrying out stochastical sampling to the set of source data at each medical institutions center with percentage p, to adopting Sample data generate anonymization data, local training set of the remaining data as the medical institutions center using anonymization algorithm；Come Enhancing data set is synthesized by central server collection from the anonymization data at each medical institutions center；Enhancing data set is divided into Two parts, i.e., additional training set and verifying collection；Additional training set is for returning and distributing to each medical institutions center；Verifying collection For selecting the hyper parameter (hyper parameter) of integrated learning model.

(3) it model training module: was being trained in the sub-classifier of the integrated learning model of locally training at each medical institutions center Training data in journey includes that the local training set at the medical institutions center and central server return to the medical institutions center Additional training set；This shows for training the training set of each medical institutions center sub-classifier to go back not only from center itself Data set from other centers, to increase the randomness of data set, to improve the overall performance of integrated study model.It is instructing During white silk, the hyper parameter of integrated learning model is selected using the verifying collection created from enhancing data set.

(4) prognostic model application module: the sub-classifier that each medical institutions center is trained is collected by central server and is constituted Complete integrated study model；New patient data is inputted into the integrated study model and executes prognosis prediction.

2. the multicenter under a kind of data sharing strategy according to claim 1 cooperates with prognosis prediction system, feature exists In, in the data anonymous module, the stochastical sampling percentage p selection 50% of each medical institutions center set of source data.

3. the multicenter under a kind of data sharing strategy according to claim 1 cooperates with prognosis prediction system, feature exists In k- anonymity algorithm (k-anonymity), l- diversity (l-diversity), t- proximity may be selected in the anonymization algorithm (t-closeness) and the anonymity algorithms such as difference privacy.The method for being wherein specifically used for realizing k- anonymity can choose inhibition (suppression), inhibit thoroughly to hide certain information, do not issue certain data item.

4. the multicenter under a kind of data sharing strategy according to claim 1 cooperates with prognosis prediction system, feature exists In the system considers horizontal segmentation data (horizontal-partitioned data), i.e., the source at each medical institutions center Data set has the variable of identical type.