CN112182636B

CN112182636B - Method, device, equipment and medium for realizing joint modeling training

Info

Publication number: CN112182636B
Application number: CN201910596082.3A
Authority: CN
Inventors: 冯智; 宋传园; 熊昊一; 浣军
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2023-08-15
Anticipated expiration: 2039-07-03
Also published as: CN112182636A

Abstract

The embodiment of the invention discloses a method, a device, equipment and a medium for realizing joint modeling training. The method comprises the following steps: acquiring a joint modeling request; performing sample line alignment and column feature stitching based on the local data set and at least one remote data set of other data providers involved in joint modeling to determine the local sample data set and the remote sample data set; at least one of the local sample data set and the remote sample data set is feature data processed by the original data through a feature extraction model; determining an initial model and model training parameters according to the joint modeling request, and performing joint modeling training based on the local sample data set and at least one remote sample data set to obtain a target joint model. By adopting the scheme of the embodiment, the speed of joint modeling analysis can be improved and the calculation cost can be reduced under the condition that the data of all the parties of joint modeling can be kept secret.

Description

Method, device, equipment and medium for realizing joint modeling training

Technical Field

The embodiment of the invention relates to a machine deep learning technology, in particular to a method, a device, equipment and a medium for realizing joint modeling training.

Background

Deep Learning (DL) is an important branch of machine Learning, and is an algorithm for performing characterization Learning on data by taking an artificial neural network as a framework.

With the development of algorithms and big data, algorithms and algorithms are no longer the bottleneck impeding AI development. The true and effective data sources in various fields are the most precious resources. Meanwhile, barriers which are difficult to break exist between data sources, in most industries, data exist in the form of islands, and due to the problems of industry competition, privacy safety, complex administrative procedures and the like, even if data integration is realized among different departments of the same company, important resistance is faced, and in reality, the integration of data dispersed in various places and institutions is almost impossible or the required cost is huge.

However, because of the number and quality of training sample data, there is a fundamental impact on the accuracy of the model obtained by training, there is a need for joint modeling training based on multiparty data.

In order to realize joint modeling training based on multiparty data and avoid leakage of original data among parties, main solutions provided in the prior art comprise: may be implemented based on a trusted computing environment, but has the disadvantage of having high requirements on the hardware environment; or the data to be interactively transmitted can be encrypted in the training process, and homomorphic encryption technology is adopted, however, complex cryptography and the transmitted data volume can both increase the complexity of training calculation and reduce the model training efficiency.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a medium for realizing joint modeling training, which are used for improving the speed of joint modeling analysis and reducing the calculation cost under the condition that the data of all parties of joint modeling can be kept secret.

In a first aspect, an embodiment of the present invention provides a method for implementing joint modeling training, which is executed by any data provider in joint modeling, where the method includes:

acquiring a joint modeling request;

performing sample line alignment and column feature stitching based on the local data set and at least one remote data set of other data providers involved in joint modeling to determine the local sample data set and the remote sample data set; at least one of the local sample data set and the remote sample data set is feature data after original data processing through a feature extraction model, the feature extraction model is a deep learning model, and sample rows in at least one of the local data set and the remote data set have supervised training labeling values;

and determining an initial model and model training parameters according to the joint modeling request, and performing joint modeling training based on the local sample data set and at least one remote sample data set to acquire a target joint model.

In a second aspect, an embodiment of the present invention further provides an apparatus for implementing joint modeling training, configured in any data provider of joint modeling, where the apparatus includes:

the modeling request acquisition module is used for acquiring a joint modeling request;

the sample data determining module is used for performing sample row alignment and column feature stitching according to the local data set and at least one remote data set of other data providers related to joint modeling so as to determine the local sample data set and the remote sample data set; at least one of the local sample data set and the remote sample data set is feature data after original data processing through a feature extraction model, the feature extraction model is a deep learning model, and sample rows in at least one of the local data set and the remote data set have supervised training labeling values;

and the joint model acquisition module is used for determining an initial model and model training parameters according to the joint modeling request, and carrying out joint modeling training on the basis of the local sample data set and at least one remote sample data set so as to acquire a target joint model.

In a third aspect, an embodiment of the present invention further provides an apparatus, including:

one or more processing devices;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processing devices, cause the one or more processing devices to implement a method of implementing joint modeling training as provided in any embodiment of the invention.

In a fourth aspect, there is also provided in an embodiment of the present invention a computer readable storage medium having stored thereon a computer program which, when executed by a processing device, implements a method of implementing joint modeling training as provided in any embodiment of the present invention.

The embodiment of the invention provides a realization scheme of joint modeling training, which can align sample rows and split column characteristics of a local data set and a remote data set related to joint modeling so as to determine the local sample data set and the remote sample data set, further perform joint modeling training based on the local sample data set and the remote sample data set, solve the problem that the data of the same group of users in different mechanisms are combined for machine learning modeling, promote data circulation among the mechanisms, break data islands and fully exert the data value of the mechanisms; in addition, as the same model is trained by using multiparty data together, the feature data processed by the original data through the feature extraction model cannot be shared and cannot be reversely pushed in the operation process, so that the data safety during model training is ensured. In addition, the data quantity of the characteristic data is far smaller than that of the original data, so that the speed of joint modeling analysis is improved and the calculation cost is reduced under the condition that the data of all parties of joint modeling can be kept secret.

The foregoing summary is merely an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more fully understood, and in order that the same or additional objects, features and advantages of the present invention may be more fully understood.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a flow chart of a method for implementing joint modeling training provided in an embodiment of the present invention;

FIG. 2 is a schematic diagram of the alignment and stitching of data by sample line of a different mechanism provided in an embodiment of the present invention;

FIG. 3 is a schematic diagram of generating structured feature data for a vectorization process provided in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a training generation feature extraction model and the use of the feature extraction model provided in an embodiment of the invention;

FIG. 5 is a flow chart of another method of implementing joint modeling training provided in an embodiment of the present invention;

FIG. 6 is a flow chart of a method of implementing yet another joint modeling training provided in an embodiment of the present invention;

FIG. 7 is a schematic diagram of LR algorithm based joint modeling provided in an embodiment of the present invention;

FIG. 8 is a schematic diagram of an iterative computation process based on LR algorithm joint modeling provided in an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of an implementation apparatus for joint modeling training provided in an embodiment of the present invention;

fig. 10 is a schematic structural view of an apparatus according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Before discussing the exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations (or steps) can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

FIG. 1 is a flow chart of a method for implementing joint modeling training provided in an embodiment of the present invention. The embodiment of the invention can be suitable for the scene of machine learning combined modeling across different data sets, in particular for the scene of machine learning combined modeling across large data sets of enterprises. The method for realizing the joint modeling training can be executed by a device for realizing the joint modeling training, and the device for realizing the joint modeling training can be realized in a software and/or hardware mode and can be integrated on data provider equipment. As shown in fig. 1, the implementation method of joint modeling training provided in the embodiment of the present invention includes the following steps:

s110, acquiring a joint modeling request.

In this embodiment, when the data predictor initiates a joint modeling once, a joint modeling request may be sent to the data provider, and the data provider may obtain the joint modeling request sent by the data predictor and respond to the joint modeling request.

S120, aligning according to sample rows and stitching column features according to the local data set and at least one remote data set of other data providers related to joint modeling, so as to determine the local sample data set and the remote sample data set.

At least one of the local sample data set and the remote sample data set is feature data processed by the original data through a feature extraction model, the feature extraction model is a deep learning model, and sample rows in at least one of the local data set and the remote data set have supervision training labeling values.

In this embodiment, the local data set refers to the data set of the data provider currently performing the joint modeling operation, and the remote data set refers to the data set of other data providers involved in joint modeling by the current data provider. In consideration of the fact that the local data set and the remote data set are data sets from different institutions, the user identifications contained in the data sets of the different institutions are partially identical, and the user identifications of the other data portions are different, for example, the local data set contains data corresponding to user ID_1, data corresponding to user ID_2 and data corresponding to user ID_3, and the remote data set contains data corresponding to user ID_1, data corresponding to user ID_2 and data corresponding to user ID_4. However, when the data sets of different institutions are used for joint modeling, only data with the same user identification in the data sets of different institutions can be used for joint modeling, and if the data sets of different institutions are data with different user identifications, joint modeling cannot be performed.

In view of the above, when joint modeling is performed using the local data set and the remote data set, the current data provider may align the data of each party modeling the joint by sample lines, that is, align the data included in the local data set and the data included in the remote data set by sample lines, to respectively treat the data having the same user identification in each party data set as respective sample data. In this way, a local sample data set may be determined from the local data set and a remote sample data set may be determined from the remote data set, where the data contained in the local sample data set and the data contained in the remote sample data set have the same user identification.

Illustratively, FIG. 2 is a schematic diagram of the alignment and stitching of data for a different mechanism provided in an embodiment of the present invention in sample rows. Referring to fig. 2, taking a local sample data set provided by the a data provider and a remote sample data set provided by the B data provider as an example, the local sample data set includes data with user identifiers of user id_1, user id_2, user id_3, … … and user id_n, and the remote sample data set also includes data with user identifiers of user id_1, user id_2, user id_3, … … and user id_n. At this time, the data contained in the local sample data set and the data contained in the remote sample data set have the same user identifier, and are spliced according to the sample row alignment and the column characteristics, that is, the user or the mobile phone number or some other identifier of the data owned by each party is the same, and each party data transversely fuses the data of the same user by column.

It should be noted that, in the process of alignment by sample rows and stitching by column features, the data of each party only determines the sequence to be stitched, but the data itself does not transmit the local node to other parties.

In this embodiment, at least one of the local sample data set and the remote sample data set is feature data after the original data processing by the feature extraction model. Optionally, the feature data extracted by the feature extraction model is structured feature data. The structured feature data has the advantages that the data volume of the structured feature data is far smaller than that of the original data, and the calculation volume is greatly reduced when the local sample data set and the remote sample data set are used for joint modeling in view of at least one of the local sample data set and the remote sample data set being the structured feature data, so that the calculation performance of the joint modeling is better and the speed is faster.

In this embodiment, for both the local sample data set and the remote sample data set, the data of one sample is similarly split longitudinally, and the features of each sample are disassembled into several parties, i.e., the features of each sample are disassembled into the local sample data set and the remote sample data set, and each party only has some of the features of the sample. For each supervised learning, the sample rows in at least one of the local data set and the remote data set have supervised training label values, i.e., at least one party's data labels have supervised training label values. Optionally, the supervised training label value and the feature data held by any party cannot be shared or reversely deduced in the operation process.

In the present embodiment, assuming that the data of the local sample data set is the feature data after the original data processing by the feature extraction model, the operation of the original data processing by the feature extraction model is classified into two cases. The first scheme is that the original data extraction processing is performed through the feature extraction model before joint modeling to obtain corresponding structured feature data, so that the obtained structured feature data can be directly used in joint modeling, and the scheme can be used as a preferred scheme. In the second scheme, the corresponding feature data is obtained by performing the original data extraction process through the feature extraction model before the joint modeling, but the corresponding feature data is obtained by performing the original data extraction process through the feature extraction model after the local data set and the remote sample data set are aligned according to the sample rows. Optionally, the local raw data includes structured or unstructured data of at least one of image, video, audio, and text.

For the first scheme, before joint modeling, each data provider may perform vectorization processing or feature extraction on original data (such as images, video, audio, text, etc.) locally including structured or unstructured data, and finally obtain structured feature data. In an alternative example, FIG. 3 is a schematic diagram of generating structured feature data for a vectorization process provided in an embodiment of the present invention. Referring to fig. 3, a data provider performs vectorization processing on data locally including structured or unstructured data. The specific vectorization processing process comprises the following steps: each layer of the vectorization model can be configured, and the granularity of the output structured data can be determined by setting each layer, parameters of each layer and the like so as to obtain the structured characteristics. The vectorization model itself may employ a prior art model algorithm, for example, the vectorization model shown in fig. 3 includes a dense embedded dense elements layers, three hidden layers, an output aggregation unit, and an output layer for processing text. In another alternative example, the data provider performs structured feature extraction on locally contained structured or unstructured data (such as images, video, audio, text, etc.) using a feature extraction model that employs a deep learning algorithm that includes: LBP algorithm, HOG feature extraction algorithm, SIFT operator, CNN convolution method, etc.

For the second approach, the data provider can perform sample-line alignment based on the local data set and at least one remote data set of other data providers involved in joint modeling; according to the feature extraction model, performing feature extraction on the local data set according to the aligned sample rows to obtain a local sample data set; the local sample data set and the remote sample data set are column-feature stitched to determine the local sample data set and the remote sample data set. The specific way to obtain the structured feature data may be specifically referred to the vectorization processing mode and the feature extraction mode, which are not described herein again.

S130, determining an initial model and model training parameters according to the joint modeling request, and performing joint modeling training based on the local sample data set and at least one remote sample data set to obtain a target joint model.

In this embodiment, the data predictor provides the initial model of the joint modeling to the data provider through the joint modeling request, and the data provider may use the local sample data set, the at least one remote sample data set, and the model training parameters determined by the joint modeling request to perform joint modeling training on the initial model determined by the joint modeling request, so as to obtain the target joint model after the joint modeling training.

In this embodiment, optionally, at least one of the local sample data set and the remote sample data set employs an original data set, the original data set comprising unstructured data. Optionally, the data content types in the local sample data set and the remote sample data set are the same or different. For example, the data content types in the local sample data set and the remote sample data set may each be bank data, insurance data, or hospital data; the data content type in the local sample data set may be insurance data and the data content type in the remote sample data set may be non-insurance data, such as bank data or hospital data.

The embodiment of the invention provides a realization scheme of joint modeling training, which can align sample rows and split column characteristics of a local data set and a remote data set related to joint modeling so as to determine the local sample data set and the remote sample data set, further perform joint modeling training based on the local sample data set and the remote sample data set, solve the problem that the data of the same group of users in different mechanisms are combined for machine learning modeling, promote data circulation among the mechanisms, break data islands and fully exert the data value of the mechanisms; in addition, as the same model is trained by using multiparty data together, the feature data processed by the original data through the feature extraction model cannot be shared and cannot be reversely pushed in the operation process, so that the data safety during model training is ensured. In addition, the data quantity of the characteristic data is far smaller than that of the original data, so that the speed of joint modeling analysis is improved and the calculation cost is reduced under the condition that the data of all parties of joint modeling can be kept secret. In the embodiment of the invention, the feature extraction model can be realized by adopting a deep learning model, the advantages of the deep learning model can be exerted, and training and feature extraction are carried out locally on a data provider to obtain feature data effectively reflecting the characteristics of the original data. The feature data is then applied to a process of joint modeling training, which preferably employs a machine learning model, rather than a deep learning model, but can also be used by migrating the extracted valid features of the deep learning model to a conventional machine learning model. Therefore, the advantage of feature extraction of the deep learning model is maintained, the calculated amount of the machine learning model adopted by the joint modeling is low, and the training speed is high.

On the basis of the technical solution in the foregoing embodiment, optionally, the method for implementing joint modeling training provided in the embodiment of the present invention further includes: training the feature extraction model of the unsupervised deep learning according to the local original data, and obtaining the trained feature extraction model and storing the feature extraction model in the local.

In this embodiment, the data provider may locally construct a deep learning network model as an unsupervised deep learning feature extraction model, perform unsupervised training on the constructed unsupervised deep learning feature extraction model according to local raw data, thereby generating a trained feature extraction model, and store the trained feature extraction model locally in the data provider.

In this embodiment, fig. 4 is a schematic diagram of a training generation feature extraction model and a feature extraction model according to an embodiment of the present invention. Referring to fig. 4, the feature extraction model is trained from multiple data providers to describe, and the data providers input by multiple parties are respectively a data provider a, a data provider B and a data provider C, which respectively belong to different institutions, and have a large amount of unlabeled data with different feature latitudes of the same user identifier. The data providers A, B, C respectively construct a deep learning network model locally, perform unsupervised deep learning, and respectively generate feature extraction models A, B, C of the same target, and store the feature extraction models in the local of the data providers A, B, C.

In this embodiment, with continued reference to fig. 4, taking the feature extraction model generated by training of the data provider a as an example, there are two uses of the feature extraction model generated by training, which are described in detail below. First, as in the steps described in the foregoing S120 and S130, the data provider a may process the local raw data through the feature extraction model to obtain the structured feature data, and then the data provider a may perform supervised joint modeling with remote data of other data providers, so as to learn and train to generate a target joint model with high accuracy. Second, the data provider a may provide the training generated feature extraction model to similar small data providers to structure feature data for joint modeling. For example, data provider D is the same type of institution as data provider a (e.g., both are banks or hospitals, if the same type of institution, the data dimensions they possess are typically the same, e.g., hospitals all have medical records of users, B-mode images, etc.), but data provider D has a small amount of data, e.g., data provider D is a small hospital, which has a small number of patient visits, and correspondingly has a small amount of data. At this time, the data provider D may utilize its own original data to perform supervised joint modeling with the structured feature data provided by the data provider a, to generate a joint model with high accuracy, and finally return the joint model to the data provider D for subsequent prediction by the data provider D.

It should be noted that, the steps S120 and S130 are performed for both the first and second functional uses. The only difference is that the data providers that are adapted for joint modeling are different, one for small data providers of different types that are able to provide structured feature data and another for large data providers of the same type that are not able to provide structured feature data.

FIG. 5 is a flow chart of another method of implementing joint modeling training provided in an embodiment of the present invention. Embodiments of the present invention may be combined with each of the alternatives in one or more of the embodiments described above. As shown in fig. 5, the implementation method of the joint modeling training provided in the embodiment of the present invention includes the following steps:

s510, acquiring a joint modeling request.

S520, determining the processed safety identifications of the original identifications of each sample row according to the local data set to generate a local safety identification list, wherein the safety identifications are in one-to-one correspondence with the original identifications.

In this embodiment, each sample line in the local data set has a corresponding original identifier for identifying the respective characteristic data of that sample line. Referring to fig. 2, taking a local data set of the data provider a as an example, the local data set includes a sample row identified by the original identifications of the user id_1, the user ids_2, … …, and the user id_n. In order to ensure that the local original data is not exposed, when the local original data is aligned according to the sample lines, the original identifiers of all the sample lines of the local data set are not directly used, and encryption processing is needed to be carried out on all the original identifiers to generate a security identifier, so that a local security identifier list is obtained. The security identifier corresponds to the original identifier one by one, and the difference is that the security identifier is an encrypted identifier, but does not expose the original content, for example, the original ID is hashed, and the security ID list does not expose real data, so that interaction is possible.

S530, acquiring a remote security identification list provided by other data providers involved in joint modeling.

In this embodiment, referring to the process of S520, other data providers involved in joint modeling determine, from the remote data set, the processed security identifier of each sample line original identifier, so as to generate a remote security identifier list. The current data provider can acquire remote security identification lists provided by other data providers involved in joint modeling, and accordingly, the current data provider can send the local security identification list to the other data providers involved in joint modeling, so that interaction of security identifications between the current data provider and the other data providers involved in joint modeling is achieved.

Illustratively, taking data provider a and data provider B as examples, assume that both data provider a and data provider B have a set of sample data Qa and Qb, assuming that data provider a has a labeled tag Y value. The security identifiers of the data in each row in Qa and Qb are interacted, for example, the data provider a sends its security identifier list to the data provider B, or the data provider a requests to obtain the security identifier list of the data provider B (the security identifiers are in one-to-one correspondence with the original identifiers, but the original text content is not exposed, for example, the original identifiers are hashed).

S540, intersection processing is carried out on the remote safety identification list and the local safety identification list, sample rows in the intersection are used as a local sample data set, and a remote sample data set is determined.

In this embodiment, when the data sets of different institutions are used for joint modeling, only data with the same user identifier in the data sets of different institutions can be modeled jointly. Therefore, intersection processing is carried out on the remote safety identification list and the local safety identification list, which identifications in the remote safety identification list and the local safety identification list are identical is determined, sample rows in the intersection are determined to be data with identical user identifications in different party data sets, and accordingly data sets corresponding to the sample rows in the intersection can be used as local sample data sets. Optionally, the intersection process adopts a diffe-hellman safe and rapid intersection algorithm.

Taking a data provider a and a data provider B as an example, in order to transversely fuse data of two parties in columns to extend the value of the data to the greatest extent, the secure identification lists Pa and Pb of the dimension data Qa and Qb are subjected to intersection processing to obtain m sample rows, and then m sample row data, namely Da and Db, are obtained from a Qa and Qb set respectively, so that the dimension expansion of the data into da+db dimensions is realized. The two sides are intersected to obtain data of m sample rows, the two sides generate new sets Da and Db for the intersection result, and the same data in Da and Db have the same security identification.

S550, determining an initial model and model training parameters according to the joint modeling request, and performing joint modeling training based on the local sample data set and at least one remote sample data set to obtain a target joint model.

The embodiment of the invention provides a realization scheme of joint modeling training, which is used for carrying out sample line alignment and column characteristic splicing on a local data set and a remote data set related to joint modeling, carrying out intersection processing by adopting a local safety identification list and a remote safety identification list, realizing sample line alignment, avoiding exposing real data due to the use of the safety identification list when the sample line alignment is carried out, solving the problem that the data of the same group of users in different institutions are combined for machine learning modeling, promoting data circulation among the institutions, breaking data islands and fully playing the data value of the institutions; in addition, as the same model is trained by using multiparty data together, the feature data processed by the original data through the feature extraction model cannot be shared and cannot be reversely pushed in the operation process, so that the data safety during model training is ensured. In addition, the data quantity of the characteristic data is far smaller than that of the original data, so that the speed of joint modeling analysis is improved and the calculation cost is reduced under the condition that the data of all parties of joint modeling can be kept secret.

FIG. 6 is a flow chart of a method of implementing yet another joint modeling training provided in an embodiment of the present invention. Embodiments of the present invention may be combined with each of the alternatives in one or more of the embodiments described above. As shown in fig. 6, the implementation method of the joint modeling training provided in the embodiment of the present invention includes the following steps:

s610, acquiring a joint modeling request.

S620, performing sample row-wise alignment and column feature stitching based on the local data set and at least one remote data set of other data providers involved in joint modeling to determine the local sample data set and the remote sample data set.

S630, determining an initial model and model training parameters according to the joint modeling request, encrypting data interacted between data providers by adopting a homomorphic encryption technology based on the local sample data set and at least one remote sample data set so as to perform joint modeling training.

In this embodiment, homomorphic encryption is a cryptographic technique based on the theory of computational complexity of mathematical puzzles. The homomorphically encrypted data is processed to obtain an output, and the output is decrypted, the result of which is the same as the output result obtained by processing the unencrypted original data by the same method. In order to ensure that original data or structured feature data of the data provider does not appear in a local node in the training process of a training model, data leakage is avoided, and data interacted between the data providers can be encrypted by adopting homomorphic encryption technology.

It should be noted that, in the training process, the original data are all calculated locally, and the data interacted between the data providers are all encrypted intermediate results, and the encrypted intermediate results need to be calculated by using ciphertext, because of the characteristics of homomorphic encryption technology, after the result of ciphertext calculation is decrypted, the result is consistent with the result of calculation directly using plaintext.

In this embodiment, optionally, the data predictor provides an initial model to be jointly modeled to each data provider and synchronizes homomorphic encrypted public keys for decryption by the data provider using the public keys. The current data provider can train based on the local sample data set, intermediate data needing to be interacted are encrypted through a private key in the training process and then sent to other data providers involved in joint modeling for processing, and meanwhile, other data providers involved in joint modeling in the training process also send the intermediate data needing to be interacted to the current data provider after being encrypted through the private key. It can be seen that the current data provider can jointly train the initial model based on the local sample data and the intermediate data which is transmitted by other data providers involved in joint modeling and needs to be interacted after being encrypted by homomorphic encryption technology.

Taking a data provider a and a data provider B as an example, the method includes that a local sample data set aligned by a sample row of the data provider a and a remote sample data set aligned by a sample row of the data provider B are included, the data provider a and the data provider B can synchronize homomorphic encryption public keys, and in a training process based on the local sample data set, the data provider a and the data provider B involved in joint modeling send intermediate data needing interaction to the data provider a after being encrypted by private keys in the training process so as to perform joint modeling training. Based on homomorphic encryption characteristics, a model which is generated by training the interaction data and the local sample data set which are encrypted in the homomorphic mode by the data provider A is consistent with a model which is built by putting the two original data together.

S640, the trained models are interacted with other data providers to obtain the target joint model in a merging mode.

In this embodiment, taking the data provider a and the data provider B as examples, after the joint modeling training, the data provider a and the data provider B obtain Wa and Wb models respectively, and considering that all or part of parameters in Wa and Wb models are generally different when the two parties use different model algorithms, the two parties are required to interactively share the Wa and Wb models respectively generated to form a complete usable model.

Compared with the technical scheme of the previous embodiment, the embodiment of the application has the advantages that the structured feature data does not appear in the local node when the joint modeling calculation is carried out, and encrypted intermediate data are transmitted and interacted. Since the data volume of the structured feature data is far smaller than that of the original data, the calculation volume required by the cryptography algorithm and interaction is reduced greatly, so that the calculation performance of the joint modeling is better and the speed is faster.

On the basis of the technical solution of the foregoing embodiment, optionally, a machine learning algorithm adopted by the joint modeling request includes at least one of the following: LR, GBDT and LightGBM. The LR algorithm will be described as an example, but the machine learning algorithm used in the joint modeling request in the joint modeling of the present application is not limited to the LR algorithm.

FIG. 7 is a schematic diagram of LR algorithm based joint modeling provided in an embodiment of the present application. Fig. 8 is a schematic diagram of an iterative calculation process based on LR algorithm joint modeling provided in the embodiment of the present application, and the iterative calculation process related to S740 to S792 in fig. 7 can be intuitively understood through fig. 8. Referring to fig. 7, taking the data provider a and the data provider B as an example, the specific process of the LR algorithm-based joint modeling specifically includes the following steps S710 to S792:

S710, the data provider A and the data provider B load the initialization model LR respectively, and randomly input initialization model parameters W1 and W2 respectively.

S720, the data provider A and the data provider B respectively acquire own data, namely aligned data, da and Db; let Da be labeled with the tag value Y. Where Da and Db can be denoted as data set X of data provider a and data provider B themselves, respectively, in fig. 8.

S730, the data provider a generates a key, a public key (pub_key) and a private key (pri_key), and sends the public key (pub_key) to the data provider B, and the data provider B stores the received public key.

S740, the data provider a and the data provider B calculate intermediate values f—f1 and F2 of the generated residuals based on the respective data sets Da and Db according to the initialization model LR, respectively, and the data provider B sends F2 to the data provider a. Referring to fig. 8, the data provider a calculates f1[ i ] = Σwj ] X [ i ] [ j ] based on X [ i ] [ j ] included in its own data set X, where W [ i ] is an initialization model parameter W1 of the initialization model LR loaded by the data provider a; meanwhile, the data provider B also calculates f2i= Σwj×χj based on X i j included in the data set X thereof, where W i is the initialization model parameter W2 of the initialization model LR loaded by the data provider B.

S750, the data provider A calculates the sum of F1 and F2 to obtain an F value, then calculates a residual error T according to the activation function and the Y value, encrypts the residual error T by using a public key (pub_key) to obtain a ciphertext residual error enc_T, and sends the ciphertext residual error enc_T to the data provider B. Wherein, referring to fig. 8, the data provider a may obtain F [ i ] =f1 [ i ] +f2[ i ] after obtaining F2 transmitted by the data provider B. Then, the data provider a may also calculate, according to the activation function and the Y value, a residual T as ti=the activation function F [ i ] -Y [ i ], where Y [ i ] refers to the tag value Y marked by the data set X.

And S760, the gradient of the data provider A is calculated according to the residual error T and Da value by the data provider A. Referring to fig. 8, after calculating the residual t [ i ] and X [ i ] [ j ] in the data set X, the data provider a may calculate a gradient grad a [ j ] = (Σtj ] [ i ] [ j ]/line_num) of the data provider a.

And S770, the data provider B calculates ciphertext gradient and enc_sum_gradb of the B side according to the received encrypted residual enc_T and Db values. With reference to fig. 8, the data provider B may receive the encrypted residual enc_t sent by the data provider, and after obtaining the encrypted residual enc_t [ i ] and X [ i ] [ J ] in the own data set X, may calculate the ciphertext gradient and enc_sum_grad B [ J ] = Σenc_t [ i ] [ X [ i ] [ J ] of the data provider B.

S780, the data provider B generates a random value rand (), stores the random value rand (), encrypts and generates enc_rand (), and ciphertext gradient and addition by using a public key (pub_key) of the data provider A, and sends the added result to the A side. (wherein, the random value is to protect the gradient sum of the data provider B, and the data provider A has no way to obtain the true gradient sum of the B party after decryption)

S790, the data provider a decrypts the ciphertext sum sent from the data provider B using the private key (pri_key), and sends the decryption result back to the data provider B. The ciphertext sent by the data provider B and the ciphertext specifically are enc_sum_grad B [ j ] +enc_rand (), and the decryption result is sum_grad dB [ j ] +rand ().

And S791, the data provider B takes the decryption result, subtracts the stored random value rand (), and obtains the gradient sum of the data provider B and calculates the gradient sum of the data provider B. Wherein. Referring to fig. 8, the data provider calculates the gradient of the data provider B to be gradb [ j ] = (sum_gradb [ j ] +rand () -rand ())/line_num.

S792, the data provider A and the data provider B update model parameters according to the respective gradient values and learning rates respectively to obtain a new round of model. And determining whether to perform the training of the next round of models according to the initially set learning rate, training iteration times, convergence targets and other factors. The specific formula of the update model of the data provider a is w1[ j ] =grad a [ j ] ×alpha+w1[ j ], and the specific formula of the update model of the data provider B is w2[ j ] =grad B [ j ] ×alpha+w2[ j ].

FIG. 9 is a schematic structural diagram of an implementation apparatus for joint modeling training according to an embodiment of the present invention. The embodiment of the invention can be suitable for the scene of machine learning combined modeling across different data sets, in particular for the scene of machine learning combined modeling across large data sets of enterprises. The implementation means of the joint modeling training may be implemented in software and/or hardware and may be integrated on the data provider device. As shown in fig. 9, the implementation apparatus for joint modeling training provided in the embodiment of the present invention specifically includes: a modeling request acquisition module 910, a sample data determination module 920, and a joint model acquisition module 930. Wherein:

a modeling request acquisition module 910, configured to acquire a joint modeling request;

a sample data determining module 920, configured to perform sample line-by-line alignment and column feature stitching according to the local data set and at least one remote data set of other data providers involved in joint modeling, so as to determine the local sample data set and the remote sample data set; at least one of the local sample data set and the remote sample data set is feature data after original data processing through a feature extraction model, the feature extraction model is a deep learning model, and sample rows in at least one of the local data set and the remote data set have supervised training labeling values;

The joint model obtaining module 930 is configured to determine an initial model and model training parameters according to the joint modeling request, and perform joint modeling training based on the local sample data set and at least one remote sample data set to obtain a target joint model.

On the basis of the technical solution of the foregoing embodiment, optionally, the apparatus further includes:

the feature extraction model determining module 940 is configured to train the feature extraction model for unsupervised deep learning according to local original data, and obtain a trained feature extraction model, and store the trained feature extraction model in the local, where feature data extracted by the feature extraction model is structured feature data.

On the basis of the foregoing embodiment, optionally, the sample data determining module 920 includes:

a sample line alignment unit for performing sample line alignment according to the local data set and at least one remote data set of other data providers involved in joint modeling;

the feature extraction unit is used for carrying out feature extraction on the local data set according to the aligned sample rows according to the feature extraction model so as to obtain the local sample data set;

and the column characteristic stitching unit is used for stitching the column characteristics of the local sample data set and the remote sample data set.

On the basis of the technical solution of the foregoing embodiment, optionally, the raw data includes structured or unstructured data of at least one of an image, video, audio and text; the deep learning algorithm adopted by the feature extraction model comprises at least one of the following: LBP algorithm, HOG feature extraction algorithm, SIFT operator and CNN convolution method.

the local safety mark list determining unit is used for determining the safety marks after the original marks of each sample row are processed according to the local data set so as to generate a local safety mark list, wherein the safety marks are in one-to-one correspondence with the original marks;

the remote security mark list acquisition unit is used for acquiring remote security mark lists provided by other data providers;

and the local sample data set determining unit is used for carrying out intersection processing on the remote safety identification list and the local safety identification list, and taking a sample row in the intersection as a local sample data set.

On the basis of the foregoing embodiment, optionally, the joint model obtaining module 930 includes:

The encryption training unit is used for determining an initial model and model training parameters according to the joint modeling request, encrypting data interacted between data providers by adopting a homomorphic encryption technology based on the local sample data set and at least one remote sample data set so as to perform joint modeling training;

and the joint model acquisition unit is used for interacting the trained models with other data providers to combine and acquire the target joint model.

On the basis of the technical solution of the foregoing embodiment, optionally, the machine learning algorithm adopted by the joint modeling request includes at least one of the following: LR, GBDT and LightGBM.

On the basis of the technical solution of the foregoing embodiment, optionally, at least one of the local sample data set and the remote sample data set adopts an original data set, where the original data set includes unstructured data.

On the basis of the technical solutions of the foregoing embodiments, optionally, the types of data content in the local sample data set and the remote sample data set are the same or different.

The implementation device of the joint modeling training provided in the embodiment of the present application may execute the implementation method of the joint modeling training provided in any embodiment of the present application, and has the corresponding functions and beneficial effects of executing the implementation method of the joint modeling training, and technical details not described in detail in the foregoing embodiment may refer to the implementation method of the joint modeling training provided in any embodiment of the present application.

Fig. 10 is a schematic structural view of an apparatus according to an embodiment of the present invention. Fig. 10 illustrates a block diagram of an exemplary device 1012 suitable for use in implementing embodiments of the present invention. The device 1012 shown in fig. 10 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in FIG. 10, device 1012 is in the form of a general purpose computing device. Components of device 1012 may include, but are not limited to: one or more processors 1016, a system memory 1028, and a bus 1018 that connects the various system components, including the system memory 1028 and the processor 1016.

Bus 1018 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor 1016, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Device 1012 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by device 1012 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 1028 may include computer system readable media in the form of volatile memory such as Random Access Memory (RAM) 1030 and/or cache memory 1032. Device 1012 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, the storage device 1034 may be used to read from or write to a non-removable, non-volatile magnetic medium (not shown in FIG. 10, commonly referred to as a "hard disk drive"). Although not shown in fig. 10, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 1018 via one or more data medium interfaces. Memory 1028 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 1040 having a set (at least one) of program modules 1042 may be stored in, for example, memory 1028, such program modules 1042 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 1042 typically carry out the functions and/or methods of the embodiments described herein.

The device 1012 may also communicate with one or more external devices 1014 (e.g., keyboard, pointing device, display 1024, etc.), one or more devices that enable a user to interact with the device 1012, and/or any device (e.g., network card, modem, etc.) that enables the device 1012 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1022. Also, device 1012 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 1020. As shown, the network adapter 1020 communicates with other modules of the device 1012 through the bus 1018. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with device 1012, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

Processor 1016 executes various functional applications and data processing by running programs stored in system memory 1028, such as implementing the method of implementing joint modeling training provided in embodiments of the present invention, the method comprising:

Acquiring a joint modeling request;

Of course, those skilled in the art will appreciate that the processor may also implement the technical solution of the implementation method of joint modeling training provided in any embodiment of the present invention.

There is also provided in an embodiment of the present invention a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of implementing joint modeling training as provided in an embodiment of the present invention, the method comprising:

Acquiring a joint modeling request;

Of course, the computer-readable storage medium provided in the embodiments of the present invention, on which the computer program stored, is not limited to the method operations described above, but may also perform related operations in the implementation method of joint modeling training provided in any embodiment of the present invention.

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method of implementing joint modeling training, performed by any one of data providers of joint modeling, the method comprising:

acquiring a joint modeling request;

determining an initial model and model training parameters according to the joint modeling request, and performing joint modeling training based on the local sample data set and at least one remote sample data set to obtain a target joint model;

wherein the performing sample-by-sample line alignment and column feature stitching based on the local data set and at least one remote data set of other data providers involved in joint modeling comprises:

Aligning the data contained in the local data set and the data contained in the remote data set according to sample lines so as to respectively take the data with the same user identification in each data set as respective sample data;

according to the feature extraction model, performing feature extraction on the local data set according to the aligned sample rows to obtain a local sample data set;

and performing column feature stitching on the local sample data set and the remote sample data set.

2. The method as recited in claim 1, further comprising:

training the feature extraction model of the unsupervised deep learning according to the local original data, acquiring the feature extraction model after training, and storing the feature extraction model in the local, wherein the feature data extracted by the feature extraction model is structured feature data.

3. The method of claim 1, wherein the raw data comprises structured or unstructured data of at least one of images, video, audio, and text; the machine learning algorithm adopted by the feature extraction model comprises at least one of the following: LBP algorithm, HOG feature extraction algorithm, SIFT operator and CNN convolution method.

4. The method of claim 1, wherein performing sample-line alignment based on the local data set and at least one remote data set of other data providers involved in joint modeling comprises:

Determining the processed safety identifications of the original identifications of each sample row according to the local data set to generate a local safety identification list, wherein the safety identifications are in one-to-one correspondence with the original identifications;

acquiring a remote security identification list provided by other data providers;

and carrying out intersection processing on the remote safety identification list and the local safety identification list, and taking a sample row in the intersection as a local sample data set.

5. The method of claim 1, wherein determining initial models and model training parameters from the joint modeling request and performing joint modeling training based on the local sample data set and at least one remote sample data set comprises:

determining an initial model and model training parameters according to the joint modeling request, encrypting the data interacted between the data providers by adopting a homomorphic encryption technology based on the local sample data set and at least one remote sample data set so as to perform joint modeling training;

and interacting the trained models with other data providers to combine and acquire the target joint model.

6. The method of claim 5, wherein the machine learning algorithm employed by the joint modeling request comprises at least one of: LR, GBDT and LightGBM.

7. The method of claim 1, wherein at least one of the local sample data set and the remote sample data set employs a raw data set, the raw data set comprising unstructured data.

8. The method of claim 1, wherein the data content types in the local sample data set and the remote sample data set are the same or different.

9. An apparatus for implementing joint modeling training, wherein the apparatus is configured for any data provider of joint modeling, and the apparatus comprises:

The joint model acquisition module is used for determining an initial model and model training parameters according to the joint modeling request, and carrying out joint modeling training on the basis of the local sample data set and at least one remote sample data set so as to acquire a target joint model;

wherein the sample data determination module comprises:

the sample line alignment unit is used for aligning the data contained in the local data set and the data contained in the remote data set according to the sample line so as to respectively take the data with the same user identifier in each data set as respective sample data;

10. An apparatus, comprising:

one or more processing devices;

a storage means for storing one or more programs;

when the one or more programs are executed by the one or more processing devices, the one or more processing devices are caused to implement the method of implementing joint modeling training of any of claims 1-8.

11. A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processing means, implements a method of implementing joint modeling training according to any of claims 1-8.