Disclosure of Invention
The present specification provides a modeling method based on shared data, including:
receiving user data transmitted by a plurality of data providers;
fuzzifying field values of key fields in the received user data respectively based on fuzzification processing degrees corresponding to trust levels of all data providers;
performing field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
Optionally, the field fusion of the fields in the user data transmitted by each data provider after the fuzzification processing includes:
summing field values of corresponding fields in user data transmitted by each data provider after fuzzification processing; and/or the presence of a gas in the gas,
and adding fields in the user data transmitted by each data provider after the fuzzification processing to the specified field interval.
Optionally, the method further includes:
and setting the trust level for each data provider based on the historical contribution degree of each data provider to the trained machine learning model.
Optionally, the method further includes:
constructing a training sample based on the user data after field fusion;
training a preset machine learning model based on the training sample to obtain the weight value of each field in the training sample; wherein the weight value represents the contribution degree of each field to the machine learning model;
generating a weight value corresponding to the machine learning model for each data provider based on a weight value of each field contained in user data transmitted by each data provider; wherein the weight values corresponding to the machine learning model characterize the degree of contribution of each data provider to the machine learning model.
Optionally, before generating a weight value corresponding to the machine learning model for each data provider based on a weight value of each field included in the user data transmitted by each data provider, the method further includes:
determining the field fusion number corresponding to each field in the training sample;
averaging the weight values corresponding to the fields in the training sample based on the field fusion number;
and setting the average value as the weight value of the corresponding field contained in the user data transmitted by each data provider.
Optionally, the generating a weight value corresponding to the machine learning model for each data provider based on a weight value of each field included in the user data transmitted by each data provider includes any one of the following manners:
carrying out weighted calculation on the weight value of each field contained in the user data transmitted by each data provider, and setting the weighted calculation result as the weight value corresponding to the machine learning model of each data provider;
setting the weight value of the key field contained in the user data transmitted by each data provider as the weight value corresponding to the machine learning model of each data provider;
and performing weighted calculation on the weight value of each key field contained in the user data transmitted by each data provider, and setting the weighted calculation result as the weight value corresponding to the machine learning model of each data provider.
This specification also proposes a modeling apparatus based on shared data, including:
a first receiving unit that receives user data transmitted from a plurality of data providers;
the processing unit is used for respectively carrying out fuzzification processing on field values of key fields in the received user data based on fuzzification processing degrees corresponding to the trust levels of all the data providers;
the first fusion unit is used for carrying out field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
Optionally, the first fusing unit further:
summing field values of corresponding fields in user data transmitted by each data provider after fuzzification processing; and/or the presence of a gas in the gas,
and adding fields in the user data transmitted by each data provider after the fuzzification processing to the specified field interval.
Optionally, the method further includes:
and the setting unit is used for setting the trust level for each data provider based on the historical contribution degree of each data provider to the trained machine learning model.
Optionally, the method further includes:
the construction unit is used for constructing a training sample based on the user data after field fusion;
the training unit is used for training a preset machine learning model based on the training sample to obtain the weight value of each field in the training sample; wherein the weight value represents the contribution degree of each field to the machine learning model;
a generation unit that generates a weight value corresponding to the machine learning model for each data provider based on a weight value of each field included in user data transmitted by each data provider; wherein the weight values corresponding to the machine learning model characterize the degree of contribution of each data provider to the machine learning model.
Optionally, the generating unit further:
determining a field fusion number corresponding to each field in the training sample before generating a weight value corresponding to the machine learning model for each data provider based on a weight value of each field included in user data transmitted by each data provider; averaging the weight values corresponding to the fields in the training sample based on the field fusion number; and setting the average value as the weight value of the corresponding field contained in the user data transmitted by each data provider.
Optionally, the generating unit further performs any one of the following manners:
carrying out weighted calculation on the weight value of each field contained in the user data transmitted by each data provider, and setting the weighted calculation result as the weight value corresponding to the machine learning model of each data provider;
setting the weight value of the key field contained in the user data transmitted by each data provider as the weight value corresponding to the machine learning model of each data provider;
and performing weighted calculation on the weight value of each key field contained in the user data transmitted by each data provider, and setting the weighted calculation result as the weight value corresponding to the machine learning model of each data provider.
The present specification further provides a modeling method based on shared data, including:
receiving user data transmitted by a plurality of data providers; the field values of the key fields in the user data transmitted by the data providers are fuzzified in advance based on fuzzification degrees corresponding to trust levels of the data providers;
performing field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
Optionally, the field fusion of the fields in the user data transmitted by each data provider after the fuzzification processing includes:
summing field values of corresponding fields in user data transmitted by each data provider after fuzzification processing; and/or the presence of a gas in the gas,
and adding fields in the user data transmitted by each data provider after the fuzzification processing to the specified field interval.
This specification also proposes a modeling apparatus based on shared data, including:
a second receiving unit that receives user data transmitted from a plurality of data providers; the field values of the key fields in the user data transmitted by the data providers are fuzzified in advance based on fuzzification degrees corresponding to trust levels of the data providers;
the second fusion unit is used for carrying out field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
Optionally, the second fusion unit further:
summing field values of corresponding fields in user data transmitted by each data provider after fuzzification processing; and/or the presence of a gas in the gas,
and adding fields in the user data transmitted by each data provider after the fuzzification processing to the specified field interval.
The present specification also provides a modeling system based on shared data, including:
the data providing server sides transmit user data to the modeling party server side; the field values of the key fields in the user data transmitted by the data providers are fuzzified by the data providers or the modeler based on fuzzification degrees corresponding to trust levels of the data providers;
the modeling party server side performs field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
This specification also proposes an electronic device including:
a processor;
a memory for storing machine executable instructions;
wherein, by reading and executing machine-executable instructions stored by the memory corresponding to shared data based modeled control logic, the processor is caused to:
receiving user data transmitted by a plurality of data providers;
fuzzifying field values of key fields in the received user data respectively based on fuzzification processing degrees corresponding to trust levels of all data providers;
performing field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
This specification also proposes an electronic device including:
a processor;
a memory for storing machine executable instructions;
wherein, by reading and executing machine-executable instructions stored by the memory corresponding to shared data based modeled control logic, the processor is caused to:
receiving user data transmitted by a plurality of data providers; the field values of the key fields in the user data transmitted by the data providers are fuzzified in advance based on fuzzification degrees corresponding to trust levels of the data providers;
performing field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
In this specification, on one hand, because the fuzzification processing may be performed on the field values of the key fields in the user data transmitted by a plurality of data providers based on the fuzzification processing degree corresponding to the trust level of each data provider, for some data providers with higher trust levels, the fuzzification processing may be performed on some key fields in the user data in a mild way or may not be performed, so that the data availability of the user data shared by each data provider may be ensured to the maximum extent on the premise of considering the protection of data privacy;
on the other hand, after receiving the user data transmitted by the plurality of data provider, the fields in the user data transmitted by each data provider can be subjected to field fusion, so that the fields in the user data can be complemented for the user data shared by each data provider, and the data fusion is performed to the maximum extent.
Detailed Description
In this specification, a technical solution is provided for performing selective fuzzification processing on field values of key fields in "shared" user data based on trust levels of respective data providers in a scenario where a plurality of data providers "share" user data that are respectively maintained to a modeler for machine learning model training, so as to retain effective information carried by the key fields in original user data as much as possible.
When the method is realized, field values of some key fields in user data which are required to be shared by a plurality of data providers to a modeler can be fuzzified based on the trust level of each data provider; the fuzzification processing degree of the field value of each key field can be negatively related to the trust level of the data provider; namely, the higher the trust level is, the lower the fuzzification processing degree of the key field is; for example, for some data providers with higher trust levels, some key fields in the user data may be lightly obscured or not obscured.
For the user data transmitted by each data provider after the fuzzification processing, field fusion can be further performed on fields in the user data transmitted by each data provider;
for example, field values of corresponding fields in user data transmitted by each data provider after fuzzification processing are summed; and/or adding fields in the user data transmitted by each data provider after the fuzzification processing to a specified field interval; for example, a standardized data structure of the fused user data may be initially created, a plurality of field intervals may be planned, and then the fields in the user data transmitted by each data provider may be placed in the designated field intervals, respectively.
When field fusion is completed, the modeler may train the machine learning model based on the field fused user data.
On one hand, because the fuzzification processing can be carried out on the field values of the key fields in the user data transmitted by a plurality of data providers based on the fuzzification processing degree corresponding to the trust level of each data provider, for some data providers with higher trust level, the light fuzzification processing or no fuzzification processing can be carried out on some key fields in the user data, so that the data availability of the user data shared by each data provider can be ensured to the maximum extent on the premise of considering the data privacy protection;
on the other hand, after receiving the user data transmitted by the plurality of data provider, the fields in the user data transmitted by each data provider can be subjected to field fusion, so that the fields in the user data can be complemented for the user data shared by each data provider, and the data fusion is performed to the maximum extent.
The following is a detailed description through specific embodiments and with reference to specific application scenarios.
Referring to fig. 1, fig. 1 is a modeling method based on shared data according to an embodiment of the present disclosure, applied to a service end of a modeling party, and executing the following steps:
102, receiving user data transmitted by a plurality of data providers;
104, fuzzifying field values of key fields in the received user data respectively based on fuzzification processing degrees corresponding to trust levels of all data providers;
106, performing field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
The modeling party may specifically include a party that has user data maintained by "sharing" multiple data providers to train business requirements of the machine learning model; the data provider may specifically include a party having a cooperative relationship with the modeling party.
In practical application, the data provider and the modeling party based on the shared data can respectively correspond to different operators; for example, the modeling party may be data operation platform of Alipay, and the data provider may be a service platform such as an e-commerce platform, a third party bank, a courier company, other financial institution, a telecom operator, and the like, interfacing with the data operation platform of Alipay.
The machine learning model may specifically include a machine learning model trained based on any type of machine learning algorithm;
for example, in the illustrated embodiment, taking a credit issuance scenario as an example, the machine learning model may specifically be a user risk assessment model trained based on a supervised machine learning algorithm such as logistic regression, which may be used to decide whether a user is a risky user.
The user data can comprise a plurality of data fields, and the user data can be used for extracting modeling characteristics for training a machine learning model; in the present specification, the specific form and data type of the user data are not particularly limited;
for example, in practical applications, if it is desired to create a machine learning model for risk assessment of a loan application initiated by a user, the user data may include user data such as the user's transaction data, shopping records, payment records, consumption records, financial product purchase records, bank flow records, call records of a telecommunications carrier, and so forth.
The key fields may specifically include data fields in the user data, in which key information that contributes significantly to training of the machine learning model is recorded (i.e., some fields of the original user data that are valuable for training of the model); in practical application, the key fields can be specified specifically based on practical modeling requirements; for example, taking the example of training a machine learning model for risk assessment of a loan application initiated by a user, assuming that the original user data contains a credit score field that records the credit score of the user, this field may be designated as a key field at this time since the credit score is of greater value for the user's risk assessment.
The fuzzification processing may specifically include any type of processing mode for encrypting field values corresponding to fields in the user data and performing privacy protection on information recorded in the user data; for example, by using some specific encryption algorithms, the field value of each field in the user data is encrypted to obtain a new field value, and then the original field value is replaced with the new field value obtained by the encryption calculation.
In this specification, in order to guarantee data availability of user data shared by each data provider to the greatest extent on the premise of considering data privacy protection, a selective fuzzification process may be performed on key fields in the user data that each data provider needs to transmit to the modeler, based on trust levels of each data provider.
Each data provider can correspond to one trust level. The trust level is specifically used for determining the fuzzification degree when fuzzification processing is performed on the key field in the user data, and can keep a negative correlation with the fuzzification degree; that is, for a data provider, the higher the trust level, the lower the degree of obfuscation when obfuscating key fields in user data transmitted by the data provider.
The fuzzification degree corresponding to each trust level can be controlled by the modeling party based on actual requirements, and is not particularly limited in this specification;
for example, in one implementation, the trust level may be divided into three levels, i.e., high, medium, and low, and for a data provider with a "high" trust level, the field values of the key fields in the user data may not be fuzzified; for a data provider with a trust level of 'middle', mild fuzzification processing can be adopted for field values of key fields in user data; for data providers with a low trust level, severe fuzzification processing can be adopted; or directly discard all user data.
In this specification, the trust level corresponding to each data provider may specifically be an initialized trust level negotiated by each data provider and the modeler, or may be a trust level set by the modeler for each data provider based on a historical contribution degree of user data transmitted by each data provider to a trained machine learning model.
Alternatively, in practical applications, each data provider may use the negotiated trust level as the initialized trust level, and the modeler may use the trust level set for each data provider based on the historical contribution as the initialized trust level.
For example, in one embodiment shown, an initial level of trust may be negotiated for the data provider by negotiating with the modeler.
For the modeling party, a trust level can be set for each data provider based on the historical contribution degree of each data provider to the finally trained machine learning model; for example, after the device is cold started, the server of the modeler may read a training result of a previous machine learning model, and generate a weight value corresponding to the trained machine learning model for each data provider based on the weight value of each field that has been trained, as a historical contribution degree of each data provider to the machine learning model. Wherein, the set trust level and the historical contribution degree can keep a positive correlation relationship. I.e., the higher the historical contribution, the higher the corresponding trust level.
Of course, in practical applications, if the machine model training is performed for the first time, the historical contribution degree of each data provider to the machine learning model cannot be obtained, and the modeler may also directly set the trust level negotiated with each data provider as the trust level of each data provider.
In this specification, the server of each data provider and the server of the modeler may both have a function of performing fuzzification processing on field values of key fields in user data transmitted by each data provider based on the trust level of each data provider.
When each data provider transmits the original user data to the modeler, the data provider can fuzzify the field value of the key field in the original user data based on the trust level negotiated with the modeler in advance, and then transmit the user after fuzzification to the modeler, or directly transmit the original user data to the modeler, and the modeler completes the fuzzification process shown above.
After the modeling party receives the user data transmitted by each data provider, the field value of the key field in the received user data can be determined, and whether fuzzification processing is performed or not can be determined; for example, each data provider may actively notify the modeler whether the field value of the key field in the transmitted user data has been subjected to the above-mentioned fuzzification processing and the corresponding fuzzification processing degree; or, it may also be determined whether the field value has been subjected to the above-mentioned fuzzification processing and the corresponding fuzzification processing degree by analyzing the field value of the key field in the user data transmitted by each data provider.
On one hand, if the field value of the key field in the user data received by the modeler is not fuzzified, the fuzzification processing can be further performed on the field value of the key field in the received user data based on the trust level set for each data provider.
On the other hand, if the fuzzification processing is already performed on the field value of the key field in the user data received by the modeler, whether the fuzzification processing degree of the user data transmitted by each data provider is higher than the fuzzification processing degree corresponding to the trust level set by the modeler for each data provider can be further determined;
if the fuzzification degree of the user data transmitted by any data provider is higher than the fuzzification degree corresponding to the trust level set for the data provider by the fuzzification party, the fuzzification processing is usually irreversible, so that the fuzzification processing on the field value of the key field in the user data transmitted by the data provider is not required in the case;
if the fuzzification degree of the user data transmitted by any data provider is lower than the fuzzification degree corresponding to the trust level set for the data provider by the fuzzifier, in this case, the fuzzification processing can be directly performed on the field value of the key field in the user data transmitted by the data provider based on the trust level set for the data provider.
In this specification, after the fuzzification processing is performed on the field values of the key fields in the user data transmitted by each data provider by the fuzzification party, the fields in the user data transmitted by each data provider after the fuzzification processing is completed may be further subjected to field fusion.
In this specification, field fusion may specifically refer to a process of integrating information recorded in fields in user data transmitted by each data provider.
In an embodiment shown, when the modeler performs field fusion on corresponding fields in user data transmitted by each data provider, the modeler may specifically use any one of the following two manners, or simultaneously use the following two manners to complete:
in one mode, field values of corresponding fields (for example, identical fields) in user data transmitted by each data provider after fuzzification processing can be summed to complete field fusion;
for example, if the plurality of data providers are banks having a cooperative relationship with the modeler, and the user data transmitted by the banks to the modeler includes a "deposit amount" field, in this case, when the modeler performs field fusion, the values of the fields of the "deposit amount" in the user data transmitted by the banks may be added to complete the fusion of the "deposit amount" field.
In another implementation manner, fields in the user data transmitted by each data provider after the fuzzification processing are added to a specified field interval to complete field fusion; in this case, a standard data structure of the fused user data can be initialized and created, a plurality of field intervals are planned in the data structure, and then fields in the user data transmitted by each data provider are respectively placed in the designated field intervals;
for example, assuming that the user data transmitted by each data provider includes a "public accumulation fund field" and a "academic degree field," the fields 1 to 100 of the fused user data may be planned as field intervals for storing public accumulation fund information, and the fields 101 to 200 may be planned as field intervals for storing academic degree information. It should be noted that, in the user data transmitted by each data provider, the field type of field fusion is performed by respectively adopting the first manner and the second manner shown above, which is not particularly limited in this specification, and in practical applications, those skilled in the art may determine the field type based on actual requirements;
for example, for fields such as "deposit amount", the field values may be added in the first way above; for fields such as "the public accumulation fund field" and "the academic calendar field", the above second manner may be adopted to be added to the same field section, respectively.
In this specification, after the modeling party performs field fusion on the user data transmitted by each data provider, the user data obtained after the field fusion can be used for training the machine learning model.
After the field fusion is completed, the information recorded in the fields in the user data transmitted by each data provider is integrated, so that the machine learning model is trained based on the user data obtained after the field fusion, the problem that the trained model is not accurate enough due to the fact that the recording positions of the same type of information in the user data are discrete can be avoided, and the modeling precision of the model can be improved.
In this specification, when a modeling party trains a machine learning model based on field-fused user data, a plurality of training samples can be constructed based on the field-fused user data; for example, in an implementation manner, a modeling party may extract, from the user data, digital fields (e.g., key fields) of a plurality of dimensions as modeling features, and generate a data feature vector as a training sample based on field values corresponding to the extracted data features of the dimensions; for example, the data features of M dimensions can be extracted from N data samples, and the target matrix can be an N × M dimensional matrix. Or, in another implementation manner, the modeling party may also default to use all fields in the user data after field fusion as modeling features to generate a constructed training sample (that is, directly use the user data obtained after field fusion as a training sample).
Then, a sample set can be generated based on the constructed training sample, the sample set is the training sample set which finally participates in model training, and the modeling method can train the machine learning model based on the preset training sample set to obtain the optimal weight value (namely the model parameter to be trained) of each field in the training sample. When the training of the optimal weight value of each field is completed, the training of the machine learning model is completed, and the optimal weight value can be used for representing the contribution degree of each field to the machine learning model.
Note that the machine learning algorithm used by the modeling party in training the machine learning model is not particularly limited in this specification.
For example, the machine learning model may be a LR (Logistic Regression) model. Each data sample in the training sample set may carry a pre-calibrated sample label. The specific form of the sample label generally depends on the specific business scenario and modeling requirements, and is not particularly limited in this specification; for example, if it is desired to create a machine learning model for risk assessment of a loan application initiated by a user, the sample tag may be a user tag indicating whether the user is a risky user.
In this case, when training the machine learning model based on the LR algorithm, the server of the modeler may typically use a Loss Function (Loss Function) to evaluate the fitting error between the training sample and the corresponding sample label.
When the method is implemented, the training sample and the corresponding sample label can be input into a loss function as input values, repeated iterative computation is performed by adopting a gradient descent method until the algorithm converges, then the value of the model parameter (namely, the optimal weight value of each field in the training sample) when the fitting error between the training sample and the corresponding sample label is minimum can be solved reversely, and then the solved value of the model parameter is used as the optimal parameter to construct the machine learning model.
In practical applications, because the contribution of each field in the user data to the finally trained machine learning model is usually represented by a weight value (i.e., a finally trained model parameter) of each field corresponding to the machine learning model; therefore, after the machine learning model is trained, the modeler may further generate a weight value corresponding to the machine learning model for each data provider based on the weight value of each field included in the user data transmitted by each data provider, so as to characterize the historical contribution degree of each data provider to the machine learning model.
It should be noted that, since the fields in the training samples used in training the machine learning model may be fusion fields obtained by summing the field values, the optimal weight values corresponding to such fusion fields in the training samples that are finally trained cannot be equal to the weight values of the corresponding fields included in the user data transmitted by each data provider.
In this case, after obtaining the optimal weight values corresponding to the fields in the training sample through model training, the modeling party may further convert the optimal weight values to obtain the weight values of the corresponding fields in the original user data transmitted by each data provider.
In one embodiment shown, the modeler may determine a field fusion number corresponding to each field in the training sample; the field fusion number corresponds to the first implementation manner of field fusion shown above, and specifically means that the number of original fields of the fusion field in the training sample is obtained by summing field values. For example, when a field in the training sample is a fused field obtained by summing field values of N fields in the original user data, the fused number of the fields is N.
Then, the weight values corresponding to the fields in the training sample may be averaged based on the field fusion number, and the average value may be set as the weight value of the corresponding field included in the user data transmitted by each data provider. Namely, the contribution degree of each field contained in the original user data transmitted by each data provider to the machine learning model is represented by using the average value of the weight value corresponding to each field in the training sample relative to the field fusion number.
For example, assuming that a field in the training sample is a fused field obtained by summing field values of N identical fields in the user data transmitted by each data provider, the weight value of the field may be averaged according to the N value, and then the average may be set as the weight value of the N fields in the original user data.
In addition to the above-described conversion method of converting the weight values corresponding to the fields in the training samples into the weight values corresponding to the fields in the original user data transmitted by each data provider by averaging, in practical applications, the conversion method may also be implemented by other conversion methods.
In another implementation shown, the machine learning model may be trained respectively without field fusion for the fields in the original user data, so as to obtain corresponding first weight values; then, further performing pairwise fusion training on the fields in the original user data to obtain a corresponding second weight value; finally, the weight value of the corresponding field contained in the original user data transmitted by each data provider can be determined by comparing the first weight value with the second weight value.
For example, assuming that the data providers are banks having a cooperative relationship with the modeler, the user data transmitted by the banks to the modeler includes "deposit amount" fields, which are respectively denoted as field a, field B and field C; in this scenario, firstly, the user data transmitted by the three banks can be used as training samples to respectively train the machine learning model, and weight values corresponding to the field A, the field B and the field C are obtained; then, fusing every two user data transmitted by the three banks to construct a training sample training machine learning model, and obtaining weight values corresponding to a field A and a field B, a field A and a field C and a fused field formed by fusing the field B and the field C; finally, the difference value between the weight value corresponding to the fused field after the field A and the field B are fused and the weight value of the field A can be calculated to be used as the weight value of the field B; calculating a difference value between a weight value corresponding to a fused field obtained by fusing the field A and the field C and a weight value of the field A to serve as a weight value of the field C; and calculating the difference value between the weight value corresponding to the fused field after the field B and the field C are fused and the weight value of the field C as the weight value of the field B, and the like. For example, assuming that the weight value obtained by training the model alone for the field a is 0.5, the weight obtained by fusing the field a and the field B is 0.6, and the weight value obtained by fusing the field B is increased by 0.1, the weight value for the field B may be set to 0.1.
Certainly, for a fusion field obtained in a training sample in a manner that field values are not summed, the finally trained optimal weight value corresponding to the fusion field can be equal to the weight value of the corresponding field contained in the user data transmitted by each data provider; therefore, in this case, the optimal weight value corresponding to such a fusion field of the trained training sample may be directly set as the weight value of the corresponding field included in the user data transmitted by each data provider.
In this specification, after the modeler obtains the weight values of the fields included in the user data transmitted by the respective data providers through the modeling process described above, the modeler may generate, for each data provider, a weight value corresponding to the machine learning model for characterizing the degree of contribution of each data provider to the machine learning model based on the weight values of the fields included in the user data transmitted by each data provider.
In one embodiment, the weight values corresponding to the finally trained machine learning model may be generated for each data provider in any one of the following manners:
in one mode, a weighted value of each field included in the user data transmitted by each data provider may be calculated, and a result of the weighted calculation may be set as a weighted value corresponding to the machine learning model for each data provider;
in another mode, a weight value of a key field included in user data transmitted by each data provider may be set as a weight value corresponding to the machine learning model by each data provider;
in the third method, a weight value of each key field included in the user data transmitted by each data provider may be weighted, and a result of the weighted calculation may be set as a weight value corresponding to the machine learning model for each data provider.
After a weight value corresponding to the machine learning model is set for each data provider, a corresponding trust level can be set for each data provider based on the weight value;
for example, the weight value may be compared with a threshold corresponding to a plurality of preset trust levels to determine a corresponding trust level; for example, the trust level may be divided into three levels of "high", "medium", and "low", where two thresholds are a first threshold, and a second threshold smaller than the first threshold; in this case, if any of the data providers has a weight value corresponding to the machine learning model above a first threshold, the trust level may be determined to be "high"; if any of the data providers has a weight value corresponding to the machine learning model between a first threshold and a second threshold, the trust level may be determined to be "medium"; if either data provider is less than the second threshold value, corresponding to the weight value of the machine learning model described above, the trust level may be determined to be "low".
Referring to fig. 2, fig. 2 is a schematic diagram illustrating a method for training a machine learning model based on user data transmitted by a plurality of data providers.
In fig. 2, it will be illustrated that the service end of the modeling party has a function of fuzzifying field values of key fields in user data transmitted by each data provider based on the trust level of each data provider.
As shown in fig. 2, in the initial state, the plurality of data providers may correspond to a trust level, respectively.
When each data provider transmits local user data to a service end of a modeling party, the original user data can be directly transmitted in a plaintext form, or key fields in the user data are transmitted to the modeling party in a plaintext form, and the modeling party completes the fuzzification processing process.
Referring to fig. 2, after receiving the user data transmitted by each data provider, the modeler may fuzzify, based on the trust level of each data provider, field values of key fields in the user data transmitted by each data provider; wherein the fuzzification processing degree is inversely related to the trust level; for example, for a data provider with a low trust level, field values of key fields in user data transmitted by the data provider can be heavily fuzzified; for a data provider with high trust level, the field value of the key field in the user data can be slightly fuzzified or not fuzzified.
Please refer to fig. 2 again, after the fuzzification processing is completed, field fusion may be performed on corresponding fields in the user data transmitted by each data provider; for example, field values of corresponding fields in user data transmitted by each data provider after fuzzification processing are summed; and/or adding fields in the user data transmitted by each data provider after the fuzzification processing to the specified field interval.
Referring to fig. 2, when the field fusion process shown above is completed, the service end of the modeling party may train a preset machine learning model based on the user data after field fusion; for example, a training sample may be constructed based on the user data after field fusion, and a preset machine learning model is trained based on the training sample to obtain an optimal weight value of each field in the training sample, so as to complete training of the machine learning model.
In this specification, in addition to the above-described function of fuzzifying field values of key fields in user data transmitted by each data provider based on the trust level of each data provider, in practical applications, the service end of the modeler may not have the function of fuzzifying field values of key fields in user data transmitted by each data provider based on the trust level of each data provider, but only have the above-described function of field fusion of corresponding fields in user data transmitted by each data provider.
Referring to fig. 3, fig. 3 is a block diagram of another modeling method based on shared data according to an embodiment of the present disclosure, which is applied to a service end of a modeler and performs the following steps:
step 302, receiving user data transmitted by a plurality of data providers; the field values of the key fields in the user data transmitted by the data providers are fuzzified in advance based on fuzzification degrees corresponding to trust levels of the data providers;
step 304, performing field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
In this embodiment, in an initial state, each data provider may perform fuzzification processing on field values of key fields in user data that needs to be transmitted to the modeler, using a trust level negotiated with the modeler, and transmit the user data that is subjected to the fuzzification processing to the modeler, respectively.
The detailed fuzzification processing process is not described in detail in this embodiment, and reference may be made to the previous description in this specification.
The modeling party can perform field fusion on corresponding fields in user data transmitted by each data provider, construct a training sample based on the user data after the field fusion, train a machine learning model, generate a weight value corresponding to the trained machine learning model for each data provider based on the weight value of each trained field corresponding to the trained machine learning model, so as to represent the historical contribution degree of each data provider to each machine learning model, set a trust level for each data provider based on the historical contribution degree, send the set trust level to each data provider, and update the initial trust level maintained by each data provider. Subsequently, each data provider can use the updated fuzzification degree represented by the trust level set based on the historical contribution degree to fuzzify the field value of the key field in the user data to be transmitted.
The specific processes of field fusion, training of a machine learning model, and setting trust levels for each data provider based on historical contribution degrees described above are not described in detail in this embodiment, and refer to the previous description of this specification.
Referring to fig. 4, fig. 4 is a schematic diagram illustrating another example of training a machine learning model based on user data transmitted by multiple data providers.
In fig. 4, a function of fuzzifying field values of key fields in user data transmitted by each data provider, where a service end of a modeler does not have a trust level based on each data provider, is taken as an example to illustrate.
As shown in fig. 4, in the initial state, the plurality of data providers may correspond to a trust level, respectively. When each data provider transmits local user data to a server of a modeler, the data provider can fuzzify field values of key fields in original user data based on a trust level negotiated with the modeler in advance, and then transmit the fuzzified users to the modeler.
Referring to fig. 4, after receiving the user data transmitted by each data provider, the modeler may perform field fusion on corresponding fields in the user data transmitted by each data provider, and then train a preset machine learning model based on the user data after the field fusion, to obtain weight values (contribution degrees) of each field corresponding to the machine learning model.
Further, after the training of the machine learning model is completed, the modeling party may generate a weight value corresponding to the machine learning model for each data provider based on the weight value of each field obtained through the training corresponding to the machine learning model, to represent the contribution degree of each data provider corresponding to the machine learning model, then reset a trust level for each data provider based on the contribution degree, and return the set trust level to each data provider, and each subsequent data provider may use the updated fuzzification degree represented by the trust level set based on the historical contribution degree to fuzzify the field value of the key field in the user data to be transmitted.
As can be seen from the foregoing embodiments, on one hand, because the fuzzification processing can be performed on the field values of the key fields in the user data transmitted by multiple data providers based on the fuzzification processing degree corresponding to the trust level of each data provider, for some data providers with higher trust levels, the fuzzification processing or the non-fuzzification processing can be performed on some key fields in the user data, so that the data availability of the user data shared by each data provider can be ensured to the maximum extent on the premise of considering data privacy protection;
on the other hand, after receiving the user data transmitted by the plurality of data provider, the fields in the user data transmitted by each data provider can be subjected to field fusion, so that the fields in the user data can be complemented for the user data shared by each data provider, and the data fusion is performed to the maximum extent.
Corresponding to the method embodiment shown in fig. 1, the present specification also provides an embodiment of a modeling apparatus based on shared data.
The embodiment of the modeling apparatus based on shared data in the present specification can be applied to electronic devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. Taking a software implementation as an example, as a logical device, the device is formed by reading, by a processor of the electronic device where the device is located, a corresponding computer program instruction in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 5, the hardware structure diagram of the electronic device where the modeling apparatus based on shared data according to this specification is located is shown in fig. 5, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, the electronic device where the apparatus is located in the embodiment may also include other hardware according to the actual function of the electronic device, which is not described again.
FIG. 6 is a block diagram of a shared data based modeling apparatus, shown in an exemplary embodiment of the present description.
Referring to fig. 6, the modeling apparatus 60 based on shared data can be applied to the electronic device shown in fig. 5, and includes: a first receiving unit 601, a processing unit 602, and a first fusing unit 603.
The first receiving unit 601 receives user data transmitted by a plurality of data providers;
the processing unit 602 is configured to perform fuzzification processing on field values of key fields in the received user data based on fuzzification processing degrees corresponding to trust levels of the data providers;
a first fusion unit 603, which performs field fusion on fields in the user data transmitted by each data provider after the fuzzification processing; and the user data after field fusion is used for training the machine learning model.
In this embodiment, the first fusing unit 603 further:
summing field values of corresponding fields in user data transmitted by each data provider after fuzzification processing; and/or the presence of a gas in the gas,
and adding fields in the user data transmitted by each data provider after the fuzzification processing to the specified field interval.
In this embodiment, the apparatus 60 further includes:
the setting unit 604 (not shown in fig. 6) sets the trust level for each data provider based on the historical contribution of each data provider to the trained machine learning model.
In this embodiment, the method further includes:
a construction unit 605 (not shown in fig. 6) that constructs training samples based on the field-fused user data;
a training unit 606 (not shown in fig. 6) for training a preset machine learning model based on the training sample to obtain a weight value of each field in the training sample; wherein the weight value represents the contribution degree of each field to the machine learning model;
a generation unit 607 (not shown in fig. 6) that generates, for each data provider, a weight value corresponding to the machine learning model based on a weight value of each field contained in user data transmitted by each data provider; wherein the weight values corresponding to the machine learning model characterize the degree of contribution of each field to the machine learning model.
In this embodiment, the generating unit 607 further:
determining a field fusion number corresponding to each field in the training sample before generating a weight value corresponding to the machine learning model for each data provider based on a weight value of each field included in user data transmitted by each data provider; averaging the weight values corresponding to the fields in the training sample based on the field fusion number; and setting the average value as the weight value of the corresponding field contained in the user data transmitted by each data provider.
In this embodiment, the generation unit 607 further performs any one of the following ways:
carrying out weighted calculation on the weight value of each field contained in the user data transmitted by each data provider, and setting the weighted calculation result as the weight value corresponding to the machine learning model of each data provider;
setting the weight value of the key field contained in the user data transmitted by each data provider as the weight value corresponding to the machine learning model of each data provider;
and performing weighted calculation on the weight value of each key field contained in the user data transmitted by each data provider, and setting the weighted calculation result as the weight value corresponding to the machine learning model of each data provider.
The detailed implementation process of the functions and actions of each unit in the above-described apparatus is detailed in the implementation process of the corresponding step in the above-described method, and is not described herein again.
In correspondence with the method embodiment illustrated in fig. 3 described above, the present specification also provides another embodiment of a modeling apparatus based on shared data.
FIG. 7 is a block diagram of a shared data based modeling apparatus, shown in an exemplary embodiment of the present description.
Referring to fig. 7, the modeling apparatus 70 based on shared data can be applied to the electronic device shown in fig. 5, and includes: a second receiving unit 701 and a second fusing unit 702.
The second receiving unit 701 receives user data transmitted by a plurality of data providers; the field values of key fields in the user data transmitted by the data providers are fuzzified by the data providers in advance based on the negotiated trust level; the fuzzification processing degree of the field value is inversely proportional to the negotiated trust level;
a second fusion unit 702, which performs field fusion on corresponding fields in the user data transmitted by each data provider after the fuzzification processing; and the user data after field fusion is used for training the machine learning model.
In this embodiment, the second fusing unit 702 further:
summing field values of corresponding fields in user data transmitted by each data provider after fuzzification processing; and/or the presence of a gas in the gas,
and adding corresponding fields in the user data transmitted by each data provider after the fuzzification processing to the same field interval.
The detailed implementation process of the functions and actions of each unit in the above-described apparatus is detailed in the implementation process of the corresponding step in the above-described method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
In correspondence with the method embodiment illustrated in fig. 3 described above, this specification also provides an embodiment of a shared data based modeling system.
The modeling system based on the shared data can comprise a plurality of data provider service terminals and modeling service terminals.
The plurality of data provider servers transmit user data to the modeler server; the field values of the key fields in the user data transmitted by the data providers are fuzzified by the data providers or the modeler based on fuzzification degrees corresponding to trust levels of the data providers;
the modeling party server side performs field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
The implementation process of the function and the effect of each server in the system shown above is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
Corresponding to the method embodiment shown in fig. 1, the present specification also provides an embodiment of an electronic device. The electronic device includes: a processor and a memory for storing machine executable instructions; wherein the processor and the memory are typically interconnected by an internal bus. In other possible implementations, the device may also include an external interface to enable communication with other devices or components.
In this embodiment, by reading and executing machine-executable instructions stored by the memory that correspond to the shared data based modeling control logic described above and shown in fig. 1, the processor is caused to:
receiving user data transmitted by a plurality of data providers;
fuzzifying field values of key fields in the received user data respectively based on fuzzification processing degrees corresponding to trust levels of all data providers;
performing field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
In this embodiment, the processor is further caused to:
summing field values of corresponding fields in user data transmitted by each data provider after fuzzification processing; and/or adding fields in the user data transmitted by each data provider after the fuzzification processing to the specified field intervals.
In this embodiment, the processor is further caused to:
and setting the trust level for each data provider based on the historical contribution degree of each data provider to the trained machine learning model.
In this embodiment, the processor is further caused to:
constructing a training sample based on the user data after field fusion;
training a preset machine learning model based on the training sample to obtain the weight value of each field in the training sample; wherein the weight value represents the contribution degree of each field to the machine learning model;
generating a weight value corresponding to the machine learning model for each data provider based on a weight value of each field contained in user data transmitted by each data provider; wherein the weight values corresponding to the machine learning model characterize the degree of contribution of each data provider to the machine learning model.
In this embodiment, the processor is further caused to:
determining the field fusion number corresponding to each field in the training sample;
averaging the weight values corresponding to the fields in the training sample based on the field fusion number;
and setting the average value as the weight value of the corresponding field contained in the user data transmitted by each data provider.
In this embodiment, by reading and executing machine-executable instructions stored by the memory that correspond to the shared data based modeling control logic described above and shown in fig. 1, the processor is further caused to perform any of the following instructions:
carrying out weighted calculation on the weight value of each field contained in the user data transmitted by each data provider, and setting the weighted calculation result as the weight value corresponding to the machine learning model of each data provider;
setting the weight value of the key field contained in the user data transmitted by each data provider as the weight value corresponding to the machine learning model of each data provider;
and performing weighted calculation on the weight value of each key field contained in the user data transmitted by each data provider, and setting the weighted calculation result as the weight value corresponding to the machine learning model of each data provider.
In this embodiment, the field values of the key fields in the user data transmitted by the multiple data providers are fuzzified by the multiple data providers in advance based on the negotiated trust level; wherein the fuzzification processing degree of the field value is inversely proportional to the negotiated trust level;
corresponding to the method embodiment shown in fig. 3, the present specification also provides an embodiment of an electronic device. The electronic device includes: a processor and a memory for storing machine executable instructions; wherein the processor and the memory are typically interconnected by an internal bus. In other possible implementations, the device may also include an external interface to enable communication with other devices or components.
In this embodiment, by reading and executing machine-executable instructions stored by the memory that correspond to the shared data based modeling control logic described above and shown in fig. 3, the processor is caused to:
receiving user data transmitted by a plurality of data providers; the field values of the key fields in the user data transmitted by the data providers are fuzzified in advance based on fuzzification degrees corresponding to trust levels of the data providers;
performing field fusion on fields in the user data transmitted by each data provider after fuzzification processing; and the user data after field fusion is used for training the machine learning model.
In this embodiment, the processor is further caused to:
summing field values of corresponding fields in user data transmitted by each data provider after fuzzification processing; and/or the presence of a gas in the gas,
and adding fields in the user data transmitted by each data provider after the fuzzification processing to the specified field interval.
Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.
It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.