CN113326900A - Data processing method and device of federal learning model and storage medium - Google Patents

Data processing method and device of federal learning model and storage medium Download PDF

Info

Publication number
CN113326900A
CN113326900A CN202110736203.7A CN202110736203A CN113326900A CN 113326900 A CN113326900 A CN 113326900A CN 202110736203 A CN202110736203 A CN 202110736203A CN 113326900 A CN113326900 A CN 113326900A
Authority
CN
China
Prior art keywords
feature
target
subset
subsets
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110736203.7A
Other languages
Chinese (zh)
Inventor
陈伟敬
马国强
范涛
陈天健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202110736203.7A priority Critical patent/CN113326900A/en
Publication of CN113326900A publication Critical patent/CN113326900A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The application provides a data processing method and device of a federated learning model, which are applied to first participant equipment; the method comprises the following steps: the method comprises the steps of obtaining a feature set of a training sample used as a federated learning model and a target prediction result of the training sample corresponding to the federated learning model, combining features in the feature set to obtain a plurality of feature subsets, obtaining weight scores of the feature subsets, and sampling the feature subsets to obtain a plurality of target feature subsets based on the weight scores; training the linear regression model based on the target feature subsets and the prediction results of the target feature subsets corresponding to the federated learning model to obtain model parameters when the linear regression model converges; and determining contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters. By the method and the device, the contribution information of each feature in a single sample can be measured, the calculated amount of the model can be effectively reduced, and the calculation efficiency is improved.

Description

Data processing method and device of federal learning model and storage medium
Technical Field
The present application relates to artificial intelligence technologies, and in particular, to a data processing method and apparatus for a federated learning model, an electronic device, and a computer-readable storage medium.
Background
With the trend of gradually strengthening data privacy protection in various industries, federal learning is a technology which can cooperate with multi-party data to establish machine learning under the condition of protecting data privacy, and becomes one of the key points of cooperation among various enterprises and industries.
In the financial and wind control fields, users of the federal machine learning model often want to know the positive and negative effects of each feature in a single model input on the model output. Such as which feature and which values of the feature are specific to a particular sample (say, a default customer), can have a significant impact on determining that the user is a default user. In addition, it is also necessary to determine the positive and negative impact of the partner-provided features on the model output. Therefore, the interpretability of the federated machine learning model is particularly important.
According to the related federal learning model interpretation scheme, the model is interpreted on the whole by acquiring the feature importance, and a single sample cannot be specifically interpreted. In addition, although the use of the feature importance degree can know how many times the feature of the partner is used, the positive and negative effects of the partner feature on the model output result are unknown, and the model calculation amount is very large when the feature contribution information is determined, so that the actual deployment cost is high.
Disclosure of Invention
The embodiment of the application provides a data processing method and device of a federated learning model, an electronic device, a computer readable storage medium and a computer program product, which can measure contribution information of each feature in a single sample and the contribution information of a second participant, and can greatly reduce the calculated amount of the model and improve the calculation efficiency.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a data processing method of a federated learning model, which is applied to a first participant device and comprises the following steps:
obtaining a feature set of a training sample used as a federated learning model and a target prediction result of the training sample corresponding to the federated learning model, and combining features in the feature set to obtain a plurality of feature subsets, wherein the feature set comprises: a first party provided feature having tag information and at least one second party provided feature;
acquiring a weight score of each feature subset, and sampling from the plurality of feature subsets to obtain a plurality of target feature subsets based on the weight scores;
training a linear regression model based on a plurality of target feature subsets and the prediction results of the target feature subsets corresponding to the federated learning model to obtain model parameters when the linear regression model converges;
and determining contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters.
The embodiment of the application provides a data processing apparatus of nation learning model, includes:
an obtaining module, configured to obtain a feature set of a training sample used as a federated learning model and a target prediction result of the training sample corresponding to the federated learning model, and combine features in the feature set to obtain a plurality of feature subsets, where the feature set includes: a first party provided feature having tag information and at least one second party provided feature;
the sampling module is used for acquiring the weight fraction of each feature subset and sampling a plurality of target feature subsets from the plurality of feature subsets based on the weight fraction;
the training module is used for training a linear regression model based on the target feature subsets and the prediction results of the target feature subsets corresponding to the federated learning model to obtain model parameters when the linear regression model converges;
and the determining module is used for determining the contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters.
In the above scheme, the sampling module is further configured to sort the weight scores of the feature subsets in an order from large to small to obtain a weight score sequence; according to the weight fraction sequence, sequentially sampling from the feature subset with the largest weight fraction to obtain a first number of feature subsets serving as target feature subsets; wherein the first number is less than a total number of feature subsets corresponding to the feature set.
In the above scheme, the sampling module is further configured to perform regularization on the weight fraction of each feature subset to obtain a scaling coefficient corresponding to each weight fraction; and sampling a plurality of target feature subsets from the plurality of feature subset samples based on the scaling coefficients corresponding to the weight scores.
In the above scheme, the sampling module is further configured to sort the scaling coefficients corresponding to the weight scores according to the size of the scaling coefficients to obtain a scaling coefficient sequence;
according to the sequence of the scale coefficients in the scale coefficient sequence, sequentially performing the following processing on each scale coefficient until a target feature subset of a target sampling number is obtained:
acquiring the current sampling number, determining the product of the proportional coefficient and the current sampling number, and taking the product as the current capacity value;
acquiring the number of feature subsets corresponding to the proportionality coefficient;
when the current capacity value is larger than the quantity, taking the feature subset corresponding to the proportionality coefficient as a target feature subset;
and when the current capacity value is smaller than the quantity, randomly selecting the feature subset with the quantity same as the current sampling quantity from the feature subsets which are not selected as target feature subsets.
In the foregoing solution, the sampling module is further configured to perform the following processing for each of the scaling coefficients:
acquiring the current sampling number, determining the product of the proportional coefficient and the current sampling number, and taking the product as the current capacity value;
acquiring the number of feature subsets corresponding to the proportionality coefficient;
when the current capacity value is larger than or equal to the quantity, taking the feature subset corresponding to the proportionality coefficient as a target feature subset;
and when the current capacity value is smaller than the quantity, randomly selecting the quantity of the feature subsets as target feature subsets from the feature subsets which are not selected.
In the above scheme, the training module is further configured to obtain a conversion relationship between the target feature subset and the target training sample;
converting the features in each target feature subset based on the conversion relation to obtain a target training sample of the linear regression model;
and taking the prediction result of the target feature subset corresponding to the federated learning model as a sample label of a corresponding target training sample, and training the linear regression model to obtain model parameters when the linear regression model is converged.
In the foregoing solution, the training module is further configured to perform the following processing for each target feature subset respectively:
comparing the target feature subset with the feature set to obtain features which are different from the feature set and serve as missing features;
and respectively carrying out characteristic value assignment on each missing characteristic, and filling the missing characteristics of the target characteristic subset based on assignment results to obtain a target training sample of the linear regression model.
In the above scheme, the training module is further configured to determine a default value corresponding to each missing feature; assigning a feature value to the missing feature based on the default value.
In the foregoing scheme, the determining module is further configured to obtain a linear mapping relationship corresponding to the linear regression model, where the target prediction result in the linear mapping relationship is a dependent variable, each feature in the feature set is an independent variable, and the model parameter is a coefficient of the independent variable;
and determining contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters and the linear mapping relation.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the data processing method of the federal learning model provided by the embodiment of the application when the executable instructions stored in the memory are executed.
The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for processing data of the federal learning model provided in the embodiment of the present application.
The embodiment of the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the data processing method of the federal learning model provided in the embodiment of the present application.
The embodiment of the application has the following beneficial effects:
compared with the technology of explaining the machine learning model by using the feature importance output by the federal tree model, in the embodiment of the application, the first participant equipment scores a plurality of feature subsets corresponding to the feature set of the training sample of the federal learning model, samples the feature subsets based on the weight scores to obtain a plurality of target feature subsets, and trains the linear regression model based on the target feature subsets, so that the trained linear regression model can effectively measure the contribution information of each feature provided by the first participant in the training sample used as the federal learning model and the contribution information of the second participant, and meanwhile, the calculated amount of the model can be greatly reduced, and the calculation efficiency is improved.
Drawings
FIG. 1 is an alternative architectural diagram of a data processing system of the federated learning model provided in an embodiment of the present application;
fig. 2 is an alternative structural schematic diagram of an electronic device provided in an embodiment of the present application;
FIG. 3 is an alternative flow chart of a data processing method of the federated learning model provided in the embodiments of the present application;
FIG. 4 is a schematic diagram of a target feature subset sampling process provided by an embodiment of the present application;
FIG. 5 is another alternative diagram of a sampling process of a target feature subset provided by an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a training process of a linear regression model provided in an embodiment of the present application;
FIG. 7 is a schematic diagram of a target feature subset conversion process provided by an embodiment of the present application;
FIG. 8 is a schematic diagram of a target training sample provided by an embodiment of the present application;
FIG. 9 is a schematic diagram of the contribution of features of a linear regression model according to an embodiment of the present application;
FIG. 10 is an alternative flow chart of a data processing method of the federated learning model provided in an embodiment of the present application;
FIG. 11 is a schematic view of a subset sampling process provided by an embodiment of the present application;
fig. 12 is a schematic flow chart of sample explanation provided in an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Where similar language of "first/second" appears in the specification, the following description is added, and where reference is made to the term "first \ second \ third" merely for distinguishing between similar items and not for indicating a particular ordering of items, it is to be understood that "first \ second \ third" may be interchanged both in particular order or sequence as appropriate, so that embodiments of the application described herein may be practiced in other than the order illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) SHAP (SHAPLey Additive explantation) value: a model-independent interpretable analysis mode based on a cooperative game theory is provided, each prediction record has a corresponding shape value, and each feature also has a corresponding shape value. When the shape value is larger than 0, the current feature in the current sample is indicated to advance the model prediction result to the forward direction, and the reverse direction is indicated to advance to the reverse direction.
2) Median (Median): also known as median, the term of statistics, which is a number centered in a set of data in a sequence, represents a value in a sample, population or probability distribution, which divides the set of values into two equal parts. For a finite number set, the median can be found by ranking all observations high and low. If there are an even number of observations, the median is usually taken as the average of the two most intermediate values.
3) Mode (Mode): refers to a number with a significant central tendency point on the statistical distribution, representing the general level of data. And is the value that appears most frequently in a group of data, and sometimes the mode is several in a group of numbers, which is denoted by M.
The embodiment of the application provides a data processing method and device of a federated learning model, an electronic device, a computer readable storage medium and a computer program product, which can measure contribution information of each feature in a single sample and the contribution information of a second participant, and can greatly reduce the calculated amount of the model and improve the calculation efficiency.
Based on the above explanations of terms and terms involved in the embodiments of the present application, first, a data processing system of the federal learning model provided in the embodiments of the present application is described, referring to fig. 1, where fig. 1 is an alternative architecture diagram of the data processing system of the federal learning model provided in the embodiments of the present application, in the data processing system 100 of the federal learning model, a first participant device 400, a second participant device 410 (exemplarily showing 2 second participant devices, respectively designated as 410-1 and a terminal 410-2, for distinction), the first participant device 400 and the second participant device 410 are connected to each other through a network 300 while being connected to a parameter aggregation device 200 through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of both, and data transmission is implemented using a wireless link.
In some embodiments, the first participant device 400 and the second participant device 410 are interconnected via the network 300, while third party devices (collaborators, servers, etc.) that may be involved in the federal learning model may be connected via the network 300.
In some embodiments, the first participant device 400 and the second participant device 410 may be, but are not limited to, a laptop computer, a tablet computer, a desktop computer, a smart phone, a dedicated messaging device, a portable gaming device, a smart speaker, a smart watch, etc., and may also be client terminals of federal learning participants, such as participant devices storing user characteristic data at various banks or financial institutions, etc. The parameter aggregation device 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and is configured to assist each participant device in performing federal learning to obtain a federal learning model. The network 300 may be a wide area network or a local area network, or a combination of both. The first participant device 400 and the second participant device 410 may be directly or indirectly connected through wired or wireless communication, and the embodiments of the present application are not limited thereto.
The first participant device 400 is configured to perform federal learning model training with the second participant device 410-1 and the second participant device 410-2, respectively, to obtain a trained federal learning model.
The first participant device 400 is further configured to obtain a feature set of a training sample used as a federated learning model and a target prediction result of the federated learning model corresponding to the training sample, and combine features in the feature set to obtain a plurality of feature subsets, where the feature set includes: a first party provided feature having tag information and at least one second party provided feature; acquiring the weight fraction of each feature subset, and sampling from the feature subsets to obtain a plurality of target feature subsets based on the weight fraction; training the linear regression model based on the target feature subsets and the prediction results of the target feature subsets corresponding to the federated learning model to obtain model parameters when the linear regression model converges; and determining contribution information of the target prediction result corresponding to each feature in the feature set based on the model parameters.
The second participant device 410 is configured to send the encrypted local features to the first participant device 400, and is configured to construct a training sample used as a federal learning model, and when receiving the notification message sent by the first participant device 400, fill the local features with default values or target values to obtain model inputs corresponding to the local features.
Referring to fig. 2 and fig. 2 are schematic structural diagrams of an optional electronic device provided in the embodiment of the present application, in practical applications, an electronic device 500 may be implemented as the terminal 400 or the server 200 in fig. 1, and an electronic device implementing the data processing method of the federal learning model in the embodiment of the present application is described by taking the electronic device as the server 200 shown in fig. 1 as an example. The electronic device 500 shown in fig. 2 includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It will be appreciated that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.
The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 530 includes one or more output devices 531 enabling presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.
The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
a presentation module 553 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;
an input processing module 554 to detect one or more user inputs or interactions from one of the one or more input devices 532 and to translate the detected inputs or interactions.
In some embodiments, the data processing apparatus of the federal learning model provided in this application may be implemented in software, and fig. 2 shows the data processing apparatus 555 of the federal learning model stored in the memory 550, which may be software in the form of programs and plug-ins, and includes the following software modules: the acquisition module 5551, the sampling module 5552, the training module 5553 and the determination module 5554 are logical and thus may be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.
In other embodiments, the data processing Device of the federal learning model provided in this Application may be implemented in hardware, for example, the data processing Device of the federal learning model provided in this Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the data processing method of the federal learning model provided in this Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
The data processing method of the federal learning model provided in the embodiment of the present application will be described with reference to exemplary applications and implementations of the terminal provided in the embodiment of the present application. The method is applied to a first participant device, and referring to fig. 3, fig. 3 is an optional flowchart of a data processing method of a federal learning model provided in an embodiment of the present application, and will be described with reference to steps shown in fig. 3.
In step 101, a first participant device obtains a feature set of a training sample used as a federated learning model and a target prediction result of the federated learning model corresponding to the training sample, and combines features in the feature set to obtain a plurality of feature subsets, where the feature set includes: a first party provided feature having tag information, and at least one second party provided feature.
Here, in the federal learning model, at least two parties are generally included, wherein a first party has tag information, also called a master, expressed using Guest; the second participant is the feature provider, denoted Host. The method provided by the embodiment of the application can be suitable for a longitudinal federal learning model in which a Guest party and at least one Host party participate.
Taking a Guest party and a Host party as examples, in a single training sample, the Guest party and the Host party respectively hold partial features, wherein the Guest party provides G _1, G _2, G _3, … … and G _ n, n features are provided, and n is a positive integer greater than 0; the Host side provides m features of H _1, H _2, H _3, … …, H _ m, where m is a positive integer greater than 0. It can be understood that the feature set of a single training sample contains n + m features, { G _1, G _2, G _3, … …, G _ n, H _1, H _2, H _3, … …, H _ m }.
When the Guest party provides n features of G _1, G _2, G _3, … …, and G _ n, and the Host party provides m features of H _1, H _2, H _3, … …, and H _ m, a total of n + m features in a piece of training sample data. And combining the n + m characteristics to obtain a corresponding characteristic subset, wherein the number of the characteristics in the characteristic subset is from 0 to n + m, when the number of the characteristics is 0, the characteristic subset is an empty set, and when the number of the characteristics is n + m, the characteristic subset is a full set. The calculation formula of the number N, N of the feature subsets is as follows:
Figure BDA0003141691640000091
in the above formula (1), n + m is the number of features,
Figure BDA0003141691640000101
indicating the number of feature subsets containing 0 features (i.e. the number of empty sets is 1),
Figure BDA0003141691640000102
representing the number of feature subsets containing 1 feature,
Figure BDA0003141691640000103
indicating the number of feature subsets containing n + m features (i.e., the number of complete sets is 1).
It should be noted that, in the federal learning model, in order to protect the privacy of data of each participant, the feature of the Host party is not directly transmitted to the Guest party, but corresponding intermediate information (such as model parameters) is transmitted between the participants through an encryption manner. Therefore, when calculating the influence of the Host side on the prediction result (the output result of the federal learning model), the overall influence of the Host side is generally determined, that is, all the features provided by the Host side are regarded as an overall feature, which can be called federal feature and is denoted as Host _ eat. And determining the influence of the federal characteristics on the prediction result and the positive and negative of the influence, and determining the overall influence of the corresponding Host party and the positive and negative of the overall influence. Wherein, the positive and negative of the influence means that the influence is positive influence or negative influence.
When a single training sample corresponding to a federated learning model is interpreted, features provided by a Host party are used as a federated feature Host _ feat, n + m features in the single sample can be regarded as n + s features, n is the number of features provided by a Guest party, and s represents s federated feature Host _ feats corresponding to s Host parties (namely, one Host _ feat represents one Host party), wherein s is a positive integer greater than 0, and s is 1, which represents that only one Host party exists; if s is 2, two Host parties are indicated.
In some embodiments, interpreting a single training sample actually determines the effect of each feature in the single sample on a target prediction result, where the target prediction result refers to a prediction result obtained by the training sample through a federated learning model.
Illustratively, in the federal learning model, a Guest party and a Host party participate in the calculation, for a training sample containing n + m features, where n is the feature provided by the Guest party, and m is the number of features of the Host party, for the training sample, in order to ensure the safety of data, m features provided by Host are replaced by a Host _ feat feature, so that there are n +1 features in total, and for the n +1 features, the n +1 features are subjected to combined calculation according to the foregoing formula (1) to obtain 2n+1A subset of features.
Illustratively, a training sample used as a federal learning model, S { age: 28, learning to be: this department, income: 10000, work: IT, the prediction result obtained by inputting the sample into the federal learning model characterizes that the client is a default client, wherein the characteristic { age: 28, learning to be: this family is provided by the Guest party, characteristics { revenue: 10000, work: IT is provided by the Host side, and actually contains 4 features, and in actual application, in order to ensure the security of the data of the Host side, the features { revenue: 10000, work: IT, as a host _ feat feature, the training sample contains 2 features { age: 28,learning a calendar: family }, and a host _ feat feature, for a total of 3 features. The feature subset corresponding to the training sample contains 0-3 features, and totally contains 8 (2)3) The bar feature subset specifically includes: empty set, { age: 28}, { academic calendar: this family }, { host _ feat: { revenue: 10000, work: IT } }, { age: 28, learning to be: this family }, { age: 28, host _ flat: { revenue: 10000, work: IT } }, { academic calendar: home, host _ flat: { revenue: 10000, work: IT } }, { age: 28, learning to be: home, host _ flat: { revenue: 10000, work: IT } }.
In step 102, a weight score of each feature subset is obtained, and a plurality of target feature subsets are sampled from the plurality of feature subsets based on the magnitude of the weight score.
Here, the feature set in each training sample used as the federal learning model corresponds to a plurality of feature subsets having different subset sizes, where the subset size refers to the number of features included in the feature subsets, and the importance of each feature can be preliminarily estimated by scoring the subsets having different subset sizes.
To illustrate the manner in which feature subsets are scored, in some embodiments, the weight score for each feature subset may be determined using the following formula:
W(m)=(M-1)/(m*(M-m)) (2)
wherein, the weight fraction with the subset size M is represented by W (M), M is the number of the features in a training sample used as a federal learning model, M is a positive integer larger than 0, and M is larger than M.
Traversing the feature subsets with different sizes corresponding to the feature set, scoring each feature subset according to the formula (2), and finally, scoring all the subsets with the sizes from 1 to M-1 to obtain a weight score set which can be regarded as a scoring vector.
In some embodiments, the number of elements in the resulting scoring vector may be M-1, i.e., the scoring is by subset size. In the case of ignoring the null set and the full set, the subset size corresponding to a feature set containing M features is from 1 to M-1, and accordingly, the number of elements in the resulting scoring vector is also M-1, i.e., weight [ w _1, … …, w _ (M-1) ], where w _1 denotes the weight fraction of the subset with subset size 1, and w _ (M-1) denotes the weight fraction of the subset with subset size M-1.
In some embodiments, ignoring the null set and the full set, the number of elements in the resulting scoring vector may be all feature subsets corresponding to the positive set that also contain M features, where the number of feature subsets corresponding to each subset size is represented using the following combining function.
C(r,t)=t!/r!(t-r)! (3)
In the above-mentioned combination function formula (3), r is an integer, t is a positive integer of 1 or more, and t > r. The number of feature subsets C (r, t) with a subset size r corresponding to a feature set including t features may be interpreted, for example, C (2,5) ═ 10.
Illustratively, taking M ═ 5 as an example, the number of subsets C (1,5) having one element in the subset is 5, the number of subsets C (2,5) having two elements in the subset, the number of subsets C (3,5) having three elements in the subset, and the number of subsets C (4,5) having four elements in the subset. In the case of ignoring the null set and the full set, the total number of corresponding feature subsets is: total ═ C (1,5) + C (2,5) + C (3,5) + C (4,5) + 5+10 +5 ═ 30, the feature subsets of all subset sizes 1 to 4 are scored by the above equation (2), i.e.:
the weight of a subset of size 1 may be calculated as: (5-1)/(1 × 4) ═ 1;
the weight of the subset of size 2 can be calculated as: (5-1)/(2 x 3) ═ 4/6;
the weight of the subset of size 3 may be calculated as: (5-1)/(3 × 2) ═ 4/6;
the weight of the subset of size 4 may be calculated as: (5-1)/(4 x 1) ═ 1.
Finally, a scoring vector weight is obtained. Wherein, weight can have two expression forms, one is weight ═ 1,4/6,4/6, 1; the other one is:
Figure BDA0003141691640000121
it should be noted that, from the property of w (m) in the formula (2), it can be seen that: the larger and smaller subsets are weighted more strongly and the complementary sets are weighted equally.
In the following example, taking M as 5, W (1) as W (4) has a weight fraction value of 1, and W (2) as W (3) has a weight fraction value of 4/6.
After obtaining the weight score corresponding to each feature subset, the manner of sampling the target feature subset is described, and in some embodiments, the target feature subset may be obtained by sampling in the following manner: sequencing the weight scores of the feature subsets according to the sequence of the weight scores from large to small to obtain a weight score sequence; according to the weight fraction sequence, sequentially sampling from the feature subset with the largest weight fraction to obtain a first number of feature subsets as target feature subsets; wherein the first number is less than the total number of feature subsets corresponding to the feature set.
In practical implementation, the obtained scoring vectors can be directly reordered to obtain a weight score sequence with weight scores ordered from large to small, and the first number of target feature subsets are obtained by directly sampling according to the weight score sequence. The first number may be a preset number of samples and the value of the first number is less than the total number of feature subsets, which may be denoted as max _ subset.
Taking M as an example, obtaining weight '═ 1,4/6,4/6,1], after reordering, obtaining weight' ═ 1,1,4/6,4/6], which can determine that W (1) ═ W (4) has equal and higher weight fraction, and W (2) ═ W (3) has equal and lower weight fraction, at this time, if the preset number of samples is 10, then the number of feature subsets with subset size of 1, C (1,5) ═ 5, and the size of the subset is 4, the number of C (4,5) ═ 5, which are 10 in total as target feature subsets, are consistent with the preset number of samples, and the sampling is ended.
In some embodiments, the target feature subset may also be sampled by the size of the scaling factor. Referring to fig. 4, fig. 4 is a schematic diagram of a target feature subset sampling process provided in an embodiment of the present application. Based on fig. 3, step 102 may be implemented by the steps shown in fig. 4.
Step 1021, the first participant device performs regularization processing on the weight scores of the feature subsets to obtain a scaling coefficient corresponding to each weight score.
It should be noted that the scaling factor is actually obtained by performing a regularization (normalization) operation on the weight scores corresponding to the feature subsets obtained according to the formula (2) to obtain a scaling factor with a value range between 0 and 1, and the larger the scaling factor is, the higher the corresponding weight score is.
In some embodiments, the proportion of the weight of each feature subset in the total is calculated according to the scoring vector weight obtained by the foregoing calculation, and a proportion coefficient combination is obtained, which may be a proportion vector, denoted as p, and the calculation formula is as follows:
p=weight/(w_1+……+w_(M-1))=[p1,p2,……,p_(M-1)] (4)
for example, taking M as 5 as an example, when the weight score weight corresponding to each feature size is obtained [1,4/6,4/6,1] through the above formula (2), then the weight sizes in weight are normalized according to the above formula (4), so as to obtain the p value corresponding to each feature subset size:
the weight of the feature subset with size 1 corresponds to the value: p _1 ═ 1/(1+4/6+4/6+1) ═ 0.3;
the weight of the subset of features of size 2 corresponds to the value: p _2 ═ (4/6)/(1+4/6+4/6+1) ═ 0.2;
the weight of the feature subset of size 3 corresponds to the value: p _3 ═ (4/6)/(1+4/6+4/6+1) ═ 0.2;
the weight of the feature subset of size 4 corresponds to the value: p _4 is 1/(1+4/6+4/6+1) is 0.3;
finally, the scale vector p is obtained as [0.3,0.2,0.2,0.3 ]. In addition, when the weight fraction weight samples corresponding to the feature sizes are obtained as follows,
Figure BDA0003141691640000131
the resulting set of scaling factors p is represented as follows:
Figure BDA0003141691640000132
step 1022, sampling from the plurality of feature subsets to obtain a plurality of target feature subsets based on the scaling coefficients corresponding to the weight scores.
After the weight scores corresponding to the feature subsets are obtained and regularized to obtain corresponding scaling coefficients, in some embodiments, the target feature subsets may be obtained by sampling in the following manner: sampling a target feature subset through a scale coefficient, and ordering the scale coefficients corresponding to the weight fractions according to the size of the scale coefficient to obtain a scale coefficient sequence; and then, according to the sequence of the scale coefficients in the scale coefficient sequence, executing corresponding processing on each scale coefficient in sequence. Referring to fig. 5, fig. 5 is another alternative schematic diagram of a sampling process of a target feature subset provided by an embodiment of the present application, and based on fig. 4, step 1022 may be implemented by the steps shown in fig. 5.
In step 201, the first participant device obtains the current sampling number, determines the product of the scaling factor and the current sampling number, and takes the product as the current capacity value.
Here, the current sampling number may be represented by i _ subset, and i represents an index of the scale factor currently operated in the scale factor sequence, and may also be used to represent the current sampling round. If i starts from 1, 1_ subset indicates that the scale factor currently operated is the first scale factor in the scale factor sequence, or the first sample. When the scale factor currently being processed is the first scale factor in the scale factor sequence, the current sample number is a preset sample threshold and is denoted by max _ subset, that is, 1_ subset is max _ subset. In order to obtain a reasonable target feature subset, the maximum sampling number of the current sampling round is calculated in each sampling round.
In some embodiments, the maximum number of target feature subsets that can be sampled in the current sampling round, which may also be referred to as the capacity number of the current round, is determined according to the product of the current scaling factor p _ current and the current sampling number i _ subset, and is expressed by a capacity, i.e., a capacity is i _ subset p _ current.
Exemplarily, taking M ═ 5 as an example, we get p ═ p _1, p ═ (M-1), p _2, p ″ (M-2) ], i.e. p ═ p _1, p _4, p _2, p _3 ═ 0.3,0.3,0.2,0.2], when we take the first round of sampling, p _ curren ═ p _1 ═ 0.3, the corresponding subset size is p _ size ═ 1, i _ subset ═ max _ subset (assumed to be, 20), we get one capacity number: capacity 20 × 0.3 × 6.
Step 202, the number of feature subsets corresponding to the scaling factor is obtained.
As described above, when the first sampling is performed, the number of feature subsets corresponding to the subset size p _ size 1 is C (1,5) ═ 5, and when the condition for performing the second sampling is satisfied, the number of feature subsets corresponding to the subset size p _ size 4 is C (4,5) ═ 5.
And step 203, when the current capacity value is larger than the number, taking the feature subset corresponding to the proportionality coefficient as a target feature subset.
As a bearing example, when the first round of sampling is performed, the capacity is 6, that is, the number of feature subsets that can be accommodated in the current round is 6, and at this time, the number of feature subsets corresponding to the subset size p _ size 1 is C (1,5) ═ 5, and 5 is smaller than 6, so that all the feature subsets whose p _ size is 1 are sampled as the target feature subsets, and at this time, the number of target feature subsets that still need to be sampled is updated to max _ subset-5 ═ 15, that is, when the second round of sampling is performed, i _ subset is 2_ subset ═ 15.
And 204, when the current capacity value is smaller than the quantity, randomly selecting the feature subset with the quantity same as the current sampling quantity from the unselected feature subsets as a target feature subset.
Taking M _ 5 as an example, after the first sampling round is finished, the number of target feature subsets to be sampled is 5, and the second sampling round needs to be continued, that is, steps 201 to 203 are executed, it should be noted that i _ subset (i.e., 2_ subset) is no longer max _ subset (20), but is updated to 20-5 _ 15, that is, 2_ subset 15, then the probability obtained in step 202 is 15 × 0.3 is 4.5, 4 is obtained by rounding 4.5, that is, the number of feature subsets that can be accommodated is 4 in the second sampling round, and at this time, the number of feature subsets corresponding to the subset size p _ size 4 is C (4,5) ═ 5, 5 is greater than 4, therefore, when the second sampling round is finished, all the feature subsets with p _ size 4 cannot be all the feature subsets sampled. In some embodiments, to ensure the equality of the target feature subset, the current traversal may be directly ended, and the remaining target feature subsets are obtained by randomly sampling all the remaining feature subsets.
In the example above, all the feature subsets with p _ size ═ 4 cannot be sampled as feature subsets, and then, for the remaining 15 target feature subsets that need to be sampled, random sampling can be performed from the remaining feature subsets (30-5 ═ 25), and finally, the max _ subset number of target feature subsets is obtained.
And sequentially traversing the scale coefficients in p for each scale coefficient in the scale coefficient sequence, and circularly executing the steps 201 to 204 until the target feature subset of the target sampling number is obtained.
In step 103, the linear regression model is trained based on the plurality of target feature subsets and the prediction results of the federate learning model corresponding to the target feature subsets, so as to obtain model parameters when the linear regression model converges.
It should be noted that some features in the target feature subset are missing (i.e., missing features), but since the feature subset with incomplete features cannot directly obtain the prediction result through a machine learning model (the federal learning model f in this embodiment), the missing features need to be filled, and a model input (or training sample) corresponding to the federal learning model f needs to be constructed.
In some embodiments, the missing features in the feature set corresponding to the target feature subset relative to the training sample used as the federated learning model in step 101 may be determined by: determining the intersection of the target feature subset and the feature set in step 101; and determining the missing features of the target feature subset based on the obtained intersection, wherein the missing features refer to features which are distinguished from the intersection and the feature set.
Illustratively, a training sample used as a federal learning model, S { age: 30, learning to be calendared: master, income: 15000, work: communication, and the prediction result obtained by inputting the sample into the federal learning model characterizes that the client is a default client, wherein the characteristic { age: 30, learning to be calendared: master } was provided by Guest, characteristics { revenue: 15000, work: communication is provided by the Host side, and actually contains 4 features, and in actual application, in order to ensure the security of the data of the Host side, the features { revenue: 15000, work: communication, as a host _ feat feature, the feature set corresponding to the training sample includes 2 features { age: 30, learning to be calendared: master }, and a host _ feat feature, 3 features in total, corresponding to 8 feature subsets (including empty set and full set), specifically: empty set, { age: 30}, { academic calendar: master }, { host _ feat: { revenue: 15000, work: communication } }, { age: 30, learning to be calendared: master }, { age: 30, host _ feat: { revenue: 15000, work: communication } }, { academic calendar: master, host _ feat: { revenue: 15000, work: communication } }, { age: 30, learning to be calendared: master, host _ feat: { revenue: 15000, work: communication } }. Assume that the current target feature subset is { age: 30, host _ feat: { revenue: 15000, work: communication, the missing features { academic } of the target feature subset may be identified.
After the missing features in the target feature subset are determined, the missing features in the target feature subset need to be filled, model input corresponding to the federal learning model f is constructed, and then a prediction result corresponding to the target feature subset f is obtained through the federal learning model. In some embodiments, the missing features may be filled in by: determining a default value corresponding to each missing feature; assigning feature values for the missing features based on default values.
In actual implementation, the default value of the missing feature may be set directly according to actual conditions, and a fixed value may be used as the default value of the missing feature.
Bearing the above example, the target feature subset { age: 30, host _ feat: { revenue: 15000, work: the missing feature in the communication is the study calendar, the default value of the feature 'study calendar' can be preset to be 'subject', and the target feature subset is filled with the study calendar: this family }.
In some embodiments, the default value corresponding to the missing feature may also be determined by: acquiring a training sample set used as a federal learning model; the following processing is performed for each feature subset: acquiring a characteristic value corresponding to the missing characteristic of the characteristic subset in each training sample of the training sample set; determining a target value corresponding to the missing feature based on the feature value corresponding to the missing feature in each training sample; and filling the target value corresponding to the corresponding missing feature in the target feature subset.
Illustratively, the training samples correspond to features { income 100, age 20, height 170}, and the feature subset { age 20, height 170}, where obviously, the missing features are income and the machine learning model cannot predict the samples corresponding to the feature subset. The filling may be populated with "revenue" means/modes/medians (which may be calculated from training data), with the predicted values of the filled samples as an estimate of the model output mean. From the training sample set, the average value of "income" features is 50, the average value of "height" features is 150, and the average value of "age" features is 15. The feature subset { age-20, height-170 } may be padded with { income-50 } resulting in a corresponding padding sample { age-20, height-170, income-50 }.
And filling the missing features in the target feature subset to obtain the model input of the target feature subset corresponding to the federal learning model f. As described above, the prediction result of the target feature subset { age-20 and height-170 } is determined by inputting { age-20, height-170 and income-50 } to the federal learning model, and inputting the model input to the federal learning model f to obtain the corresponding prediction result (model output), i.e., f _ x ({ age-20, height-170 }) -f ({ age-20, height-170 and income-50 }).
And taking the prediction result of the target feature subset corresponding to the federal learning model f as label data of the linear regression model h, converting the target feature subset to obtain a corresponding target training sample as a training sample of the linear regression model h, and training the linear regression model h according to the determined label information and the training sample. Referring to fig. 6, fig. 6 is a schematic diagram of a training flow of a linear regression model provided in an embodiment of the present application, and based on fig. 3, step 103 may be implemented by the steps shown in fig. 6, and is described with reference to the steps shown in fig. 6.
Step 301, the first participant device obtains a conversion relationship between the target feature subset and the target training sample.
It should be noted that the conversion relationship between the target feature subset and the target training sample is used to represent each feature in the target feature subset by 0 and 1 to obtain the target training sample, that is, each feature in the training sample for the linear regression model h is represented by 0 and 1, where 1 represents an actually existing feature in the target feature subset, and 0 represents a missing feature in the target feature subset.
And step 302, converting the features in each target feature subset based on the conversion relation to obtain a target training sample of the linear regression model.
The target feature subset is converted into a target training sample used as a linear regression model according to the conversion relationship obtained in step 301.
In some embodiments, reference may be made to fig. 7 for a specific manner of converting features in each target feature subset, where fig. 7 is a schematic diagram of a target feature subset conversion flow provided in an embodiment of the present application, and based on fig. 6, step 302 may be implemented by steps 3021 to 3022 shown in fig. 7.
Step 3021, the first party device compares the target feature subset with the feature set to obtain features that are different from the feature set, and the features are used as missing features.
Here, the target feature subset is compared with the feature set in the training sample used as the federal learning model f in step 101, and the missing feature corresponding to the target feature subset is determined.
And step 3022, respectively performing feature value assignment on each missing feature, and filling the missing features of the target feature subset based on the assignment result to obtain a target training sample of the linear regression model.
Here, after the missing feature in the target feature subset is determined, 0 is used to represent the missing feature, and 1 is used to represent the normal feature in the target feature subset.
In some embodiments, referring to fig. 8, fig. 8 is a schematic diagram of target training samples provided in the embodiments of the present application, a max _ subset number of target training samples may form a matrix of max _ subset rows, each row represents a target training sample used as a linear regression model, and features of the target training samples are represented by 0 or 1. Taking the first row as an example, assuming that there are a federal learning model in which a Guest party and a Host party participate, the Guest party provides p-1 features { X1, X2, … …, Xp-1}, the feature provided by the Host party is regarded as a feature Xp, normal features, X1, X2, and the like, in the target feature subset sampled from the p features, are represented by 1, and missing features, X3 and X4, in the target feature subset, are represented by 0.
And 303, taking the prediction result of the target feature subset corresponding to the federal learning model as a sample label of a corresponding target training sample, and training the linear regression model to obtain model parameters when the linear regression model is converged.
Here, the linear regression model h is trained based on the target training samples determined in step 302 and the prediction results of the target feature subsets corresponding to the federal learning model f as label information (model output) of the linear regression model h. The training method of the model can refer to the existing model training method, and the embodiment of the application does not limit the training method of the linear regression model.
In some embodiments, the linear regression model h can be represented using the following function:
Figure BDA0003141691640000181
wherein, the Guest party provides n characteristics, the characteristics provided by one Host are regarded as one characteristic, and k hosts correspond to k characteristics.
It should be noted that, since the characteristic value of the target training sample corresponding to the linear regression model is represented by 0 or 1, when the linear regression model reaches the convergence condition,after the max _ subset number of target training samples all participate in model training, the linear regression model reaches a convergence condition, and the parameters of the linear regression model are obtained as follows: { w1,w2,……,wn,wh_1,……,wh_k}。
In step 104, based on the model parameters, contribution information of each feature in the feature set corresponding to the target prediction result is determined.
In some embodiments, the model parameters when the linear regression model converges are obtained, and the obtained model parameters include contribution information of the target prediction result corresponding to each feature in the feature set. For example, a linear mapping relation corresponding to a linear regression model is obtained, a target prediction result in the linear mapping relation is a dependent variable, each feature in a feature set is an independent variable, and a model parameter is a coefficient of the independent variable; and determining contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters and the linear mapping relation.
Illustratively, for a 21-dimensional linear regression model,
h(x)=w1x1+w2x2+……+w20x20+wh_1xh_1
when the values of x are all 1, namely:
h(1)=w1×1+w2×1+……+w20×1+wh_1×1
obtained linear regression model parameters w1,w2,……,w20,wh_1And the weights in the first 20 may be directly approximated to the contribution of the feature value corresponding to each feature of the Guest party in the feature set of the training sample used as the Federal learning model f in step 101, and the last weight may be directly approximated to the overall contribution of the Host party.
Referring to fig. 9, fig. 9 is a schematic diagram of feature contribution of a linear regression model provided in this embodiment of the present application, where the training samples used as the federal learning model are { age is 20, height is 170, and income is 100}, the prediction result corresponding to the federal learning model f is 1.2, a linear regression model is trained, h is 0.45+0.4+0.35 is 1.2, the feature age can be determined to be 20, and the contribution to the prediction result 1.2 is 0.45; the characteristic height is 170, and the contribution degree to the prediction result 1.2 is 0.4; the feature revenue was 100, and the degree of contribution to the prediction result 1.2 was 0.35.
From the foregoing description, the estimation method provided in this embodiment of the present application provides n features for Guest (leading party) in the federated learning model, and treats all features of the host party as a federated feature host _ feat, where n +1 features of the Guest party are total; the Guest party generates 2n +1 subsets corresponding to n +1 features; by different sampling modes from 2n+1Sampling a preset number of target feature subsets from the feature subsets, and training a linear regression model (for example, originally 2) on Guest through the preset number of subsets212048 subsets of the subset 2097152 are collected, and the calculation amount is directly reduced exponentially), a weight vector is output, the first n terms of the weight vector are estimated SHAP values of each feature provided by the Guest party, and the (n +1) th term is an estimated SHAP value of host _ eat. The Guest party determines the feature importance of each feature and the total contribution value of the host party based on the SHAP values of the n feature values and the SHAP value of the host _ feat of each sample.
It should be noted that the estimated sharp value corresponding to each feature in each sample is obtained through the trained linear regression model, and is actually adopted by 2n+1The sharp values calculated by the models trained on the subsets are very close.
In the embodiment of the application, a first participant device obtains a plurality of feature subsets by obtaining a feature set of a training sample used as a federated learning model and a target prediction result of the federated learning model corresponding to the training sample, and combining features in the feature set, obtains weight scores of the feature subsets, and obtains a plurality of target feature subsets by sampling from the feature subsets based on the weight scores; training the linear regression model based on the target feature subsets and the prediction results of the target feature subsets corresponding to the federated learning model to obtain model parameters when the linear regression model converges; and determining contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters. By the method and the device, the contribution information of each feature in a single sample can be measured, the contribution information of the second participant can be measured, meanwhile, the calculated amount of the model can be greatly reduced, and the calculation efficiency is improved.
Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
The characteristic importance scheme provided by the related longitudinal federated tree model can meet part of requirements, and a longitudinal federated tree scene (two parties, i.e., a guest party and a host party) participated by two parties is taken as an example, and the scheme of the characteristic importance provided by the longitudinal federated tree model is briefly described as follows:
(1) a guest party initializes a table locally, the table comprises all local features of the guest and anonymous feature numbers sent by host, and the count value of each feature is 0;
(2) starting to establish a decision tree, and adding 1 to a corresponding count value in a table or adding a splitting gain value (gain) to the splitting characteristic used by each decision tree node;
(3) and outputting a characteristic importance table after the decision tree is established. The feature importance table can be used to interpret the model: the feature that the count value is high represents that it plays a high role in the modeling process.
However, the feature importance scheme provided by the vertical federal tree model can satisfy the requirement (1), but cannot satisfy the requirements (2) and (3).
First, for requirement (2), the user cannot use the feature importance to specifically interpret a single sample, for example, there is a sample S { age: 28, love and marriage status: married, income: 10000, study calendar: this department, work: IT, native place: … …, etc., is a default customer. The sample is predicted through a longitudinal federal learning model, the obtained score is 0.1, business users hope to know how much each specific characteristic value (such as age being 28) in the sample S contributes to the final 0.1 prediction score, and whether the influence of the characteristic on the prediction score is positive or negative can be judged, so that business elicitations can be obtained by combining the model and the experience of real life.
The influence of a specific characteristic value on the output result of the model and the positive and negative of the influence cannot be judged only according to the obtained characteristic importance. Because feature importance reflects the use of only one global feature, there is no way to analyze a particular sample.
For requirement (3) above, although the feature importance can know how many times the feature of the partner is used, the partner feature is also agnostic to the positive and negative effects of a single sample. If the partner's characteristics provide a large impact for many samples, it can be used to measure the value of the partner's characteristics.
In order to solve the above-mentioned problems in the related art, an embodiment of the present application provides a data processing method for a federated learning model, that is, a longitudinal federated machine learning model interpretation method based on a SHAP value, which, by combining characteristics of the SHAP value, interprets a prediction result of the longitudinal federated learning model in a longitudinal federated learning scene, and measures an overall feature value of a partner. Meanwhile, a plurality of target subsets are screened from the feature subsets corresponding to the feature set of the training sample used as the federated learning model through a subset sampling strategy, the target subsets are mapped into vectors expressed by 0 and 1 and used as the training samples of the linear regression model, the prediction result of the federated learning model corresponding to each target subset is used as the prediction result of the linear regression model, the linear regression model is trained, the weight corresponding to each feature when the linear regression model is converged is obtained, the SHAP value of each feature is estimated, and the calculation cost can be greatly reduced.
First, before describing a specific implementation process of a data processing method of the federated learning model, it is necessary to describe the SHAP-related knowledge first.
The SHAP value is a machine learning model interpretation scheme based on the Charpy value (Shapely). To illustrate Shapely value, Shapely value is a method in game theory to measure the contribution of participants (e.g., companies) by considering the contribution of participants to a subset of participants that does not include themselves to calculate the contribution of each party:
Figure BDA0003141691640000211
wherein, { x1,…,xpIs the set of all input features, p is the number of all input features, { x1,…,xp}\{xjIs not including xjIs a valuation function, val (S) is a prediction of the subset of features S, | S | the number of features in the subset S, | S |, isj(val) means the finally calculated feature xjThe value of (xiapril) (snap value).
Illustratively, assume that companies 1,2, and 3, all collaborate together to earn 120W, and calculate a Shapely value for each company.
1) Enumerating all subsets: {1,2,3}, {1,2}, {2,3}, {1}, {2}, {3}, and null set.
2) Each company is added to a subset that does not contain itself, and the respective contribution margin is calculated, assuming:
N={1,2,3},
v({1})=0,v({2})=0,v({3})=0,
v({1,2})=90,v({1,3})=80,v({2,3})=70,
v({1,2,3})=120
v (empty set) ═ 0, where v is the valuation function val; v {1} can be understood as revenue 0 dollars when there is only company 1; v 2,3 is understood to mean that the revenue is 70W dollars when only companies 2,3 are collaborating.
Then, for the company 1, according to the above formula (6), the company 1 is added to the subsets that do not include itself, and the marginal contribution (also called marginal profit) in each feature subset is calculated as follows:
company 1 joins the subset 2,3 with a marginal contribution of: 2/6 { v {1,2,3} -v {2,3} } -100/6;
company 1 joins the subset {3}, with a marginal contribution of: 1/6 { v {1,3} -v {3} } 80/6;
company 1 joins the subset {2}, with the marginal contribution: 1/6 { v {1,2} -v {2} } 90/6;
company 1 joins the empty set with marginal contributions: 0;
therefore, the total contribution value (Shapely value) of company 1 to the co-profit of 120W is: 270/6 ═ 45W.
Similarly, the contributions of companies 2 and 3 can be calculated to be 40W and 35W by using the above formula.
The SHAP value in the embodiment of the present application may be understood as a variation of the Shapely value, which is used to measure the contribution of a certain feature value to the predicted result (model output result), specifically:
assuming a machine learning model f, an input sample X, the sample containing features {1,2,3, … …, M }, the features being replaced by subscripts i, there are a total of M features, N is a subset, S is a subset of features not containing i, f _ X is an estimation function, and an average of the outputs of the model f under the subset of features is returned. The SHAP value calculation formula is consistent with Shapely values, differing only in sign, as follows:
Figure BDA0003141691640000221
continuing with the description of the calculation of the shield value, assuming, for example, that there is a sample x of 20 for age, 170 for height, and 100 for income, the machine learning model f, obtains a prediction score of 1.2 for the sample f,
1) enumerating all feature subsets: the term "age" is 20, the term "height" is 170, the term "income" is 100, the term "age" is 20, the term "height" is 170, the term "age" is 20, the term "income" is 100, the term "height" is 170, the term "income" is 100, the term "age" is 20, the term "height" is 170, and the term "height" is 170.
2) Traversing each feature in the sample x, adding each feature value into a subset which does not contain the feature value, calculating respective marginal contribution, and calculating a prediction result corresponding to each feature subset by using an evaluation function f _ x, wherein the assumption is that:
f _ x ({ age-20, height-170, income-100 }) 1.2;
f _ x ({ age-20, height-170 }) 0.9;
f _ x ({ age ═ 20, income ═ 100}) -0.8;
f _ x ({ height 170, income 100}) 0.7;
f _ x ({ revenue 100}) ═ 0;
f _ x ({ age ═ 20}) -0;
f _ x ({ height-170 }) -0;
f_x({})=0;
note that, when the feature subset is the full set of features in the sample x, the prediction result calculated by using the evaluation function f _ x.
Next, the contribution of the feature value { age ═ 20} to the prediction result value 1.2 of the sample is calculated, and { age ═ 20} is sequentially added to the feature subsets not including itself, and the marginal contribution value is calculated according to the above equation (6), specifically as follows:
{ age ═ 20} is added to the feature subset { height ═ 170, income ═ 100}, with a marginal contribution: 2/6 {1.2-0.7} ═ 1/6;
age 20 is added to the feature subset income 100, with a marginal contribution: 1/6 {0.8-0} ═ 0.8/6;
{ age ═ 20} is added to the feature subset { age ═ 20}, with a marginal contribution: 1/6 {0.9-0} ═ 0.9/6;
{ age ═ 20} into the empty set, the marginal contribution is: 0;
the total contribution was found to be 2.7/6 ═ 0.45;
therefore, the eigenvalue { age ═ 20} contributes 0.45 to the final predicted score of 1.2.
Continuing with the above description of the estimation function f _ x, when a sample is given and the SHAP value of each feature value is evaluated, an estimation method f _ x is determined and used to correctly reflect the average value of model outputs (predicted scores) under a certain feature subset, providing the following general schemes:
taking samples { age ═ 20, height ═ 170, income ═ 100} as an example, assume organic learning model f, the vertical federal learning model in the embodiment of the present application:
in scheme (1), if a feature is missing in the subset, the mean/mode/median of the feature value of the feature is used instead.
Assuming a feature subset of 20 for age and 170 for height, obviously, the machine learning model f is not able to predict the sample in the absence of the "income" feature. In practical applications, the feature subset { age is 20 and height is 170} may be padded with a mean/mode/median corresponding to the feature "income" (which may be calculated from training data), a padding sample { age is 20, height is 170 and income is T } is obtained, and T is a mean/mode/median of the feature corresponding to the feature income, and an estimate of the mean is output using a predicted value of the padding sample as a model.
Assuming that the average value of the "income" feature is 50, the average value of the "height" feature is 150, and the average value of the "age" feature is 15, which are calculated from the training sample set, then: after the feature subset { age-20, height-170 } is padded, a padding sample is obtained, where the obtained padding sample is { age-20, height-170, income-50 }, and
f _ x ({ age-20, height-170 }) -f ({ age-20, height-170, income-50 })
For another example, assuming that there is a feature subset { age-20 }, the feature subset is padded to obtain a padding sample { age-20, height-150, income-50 }, and
f _ x ({ age-20 }) } f ({ age-20, height-150, income-50 }),
in addition, if the training sample is empty, the f _ x output is the average of all the prediction scores of the training samples
Scheme (2), if a feature is missing in the subset, 0 is used instead directly.
Illustratively, assuming that there is a feature subset { age-20 }, then the sample is padded with 0, resulting in a padded sample { age-20, height-0, income-0 }, and
f _ x ({ age-20 }) } f ({ age-20, height-0, income-0 }).
In the scheme (3), if a certain feature is missing in the feature subset, multiple times of sampling are performed in the range of possible values of the missing feature to obtain a plurality of synthesized samples, and the average value of the predicted scores of the plurality of synthesized samples by the machine learning model is the f _ x output value.
Suppose that the age is within the range of 10-30 and the height is within the range of 150-
A predicted value f _ x ({ revenue ═ 100}) of the feature subset { revenue ═ 100} is calculated,
assuming that 5 samples are constructed, the missing feature ages and heights in the feature subset are randomly sampled, and a structural sample s1 is obtained, wherein { age is 15, height is 160, income is 100}, s2 is { age is 21, height is 164, income is 100}, s3 is { age is 29, height is 175, income is 100}, s4 is 25, height is 155, income is 100}, s5 is { age is 28, height is 180, income is 100}, and a structural sample is obtained by randomly sampling the missing feature ages and heights in the feature subset,
f _ x ({ revenue ═ 100}) [ f (s1) + f (s2) + f (s3) + f (s4) + f (s5) ]/5.
After the explanation is made on the basis of the knowledge related to the above SHAP, in an actual federal learning scenario, the SHAP value of the feature is calculated by the above formula (6) or formula (7), which often requires a very high calculation amount, and has a high requirement on the calculation capability of the electronic device (server or terminal), thereby increasing the complexity in actual deployment.
In general, in an actual federal learning scenario, gust may have features of one to several tens of dimensions, and host may have features of one to several thousand dimensions, according to the foregoing scheme, all features of host are regarded as one feature, and then all subsets are enumerated according to the SHAP/Shapely calculation process, and the number of subsets may also be enormous, for example, the gust side provides 20-dimensional features, one host side regards all features of the host side as one feature, and if any one sample is to be interpreted, the number of subsets that need to be completely enumerated is 220+12097152, if 100 samples need to be interpreted, the number of communicated/predicted samples is 209715200, which is very computationally expensive.
Based on this, the subsets are sampled according to a certain strategy, and weighted linear regression is used to estimate the SHAP value, so as to greatly reduce the calculation cost.
Illustratively, the feature subset is represented by a simple 0,1 vector, h is the mapping function:
h ({ age-20, height-170, income-100 }) [1,1,1],
h ({ age-20, height-170 }) [1,1,0],
h ({ age ═ 20, income ═ 100}) [1,0,1],
h ({ height 170, income 100}) [0,1,1],
h ({ revenue ═ 100}) [1,0,0],
h ({ age ═ 20}) [0,1,0],
h ({ height 170}) [0,0,1],
h({})=[0,0,0],
as can be seen from the foregoing description, each subset will have a corresponding evaluation result:
f _ x ({ age-20, height-170, income-100 }) 1.2,
f _ x ({ age-20, height-170 }) 0.9,
f _ x ({ age ═ 20, income ═ 100}) -0.8,
f _ x ({ height 170, income 100}) 0.7,
f _ x ({ revenue 100}) -0,
f _ x ({ age ═ 20}) -0,
f _ x ({ height 170}) -0,
f_x({})=0,
in practical implementation, a linear regression model is trained with 0,1 vector as the output of the X, f _ X function as y and each vector is given a certain weight (generally, the smaller and larger subsets have high weights) according to the size of the subset, and the weight of the linear regression model is the estimated value of the SHAP for each feature.
Under the federal learning scene, the feature importance of the guest feature and the participating host can be estimated according to the method, and finally, the contribution value is counted.
Next, a specific process of estimating the feature SHAP value by combining the features of the SHAP in the longitudinal federal learning scenario and training a linear regression model will be described.
In the data processing method (the longitudinal federal machine learning model interpretation method based on the SHAP value) for executing the federal learning model provided in the embodiment of the present application, input and output information of a longitudinal federal learning model f, f needs to be set as follows:
the method comprises the following steps of training a data set (training samples), samples to be interpreted X0, X1, … and Xn, wherein N pieces are provided, Guest and host respectively hold partial features, a certain feature (such as age) provided by a Guest party is represented by capital G _ i (i is an integer), and a certain feature provided by a host party is represented by capital H _ i (i is an integer). For example, the guest party provides G _0, G _1,. and G _ n, n features; host provides m features, H _0, H _1, …, H _ m; the characteristic value (e.g., age-20) corresponding to the characteristic provided by the guest party is represented by a lower case g _ j (j is an integer), e.g., g _0, g _1, …, g _ n; the characteristic values corresponding to the characteristics provided by the host party, such as h _0, … and h _ m, are represented by lower case h _ j (j is an integer). If a plurality of host exists, the tail adding mark distinguishes each host, such as h _0_0, h _0_1, and host _0_2 respectively represent the first feature of host _0, the first feature of host _1, and the first feature of host _ 2.
Set values of the estimation function f _ x method: one of the three methods described above is used.
Sampling number sample _ n: if the f _ x method uses the third method described above, the parameter is active, representing the number of times the feature value is randomly sampled.
Subset number of samples max _ subset a fixed number, typically 2048, of subsets to sample when using linear regression to estimate the SHAP value.
The participant of the embodiment of the application: guest (leading) party (first party), host party (second party), and in addition, there may be a plurality of host parties.
Based on the aforementioned input information, the functional modules of the system for implementing the longitudinal federal machine learning model interpretation method based on the shield tunneling protocol (SHAP) value provided by the embodiment of the application are as follows: a subset sampling module, a sample interpretation module, and a prediction module. Referring to fig. 10, fig. 10 is an alternative flowchart of a data processing method of the federal learning model provided in an embodiment of the present application, and an overall flow of the data processing method of the federal learning model is described with reference to fig. 10.
In step 401, the first party counts the number of features of the first party and the number of second parties participating in federated modeling.
Here, the first party (Guest) carries tag information, and the second party (Host) is a feature provider, assuming that k (k is an integer of 1 or more) hosts participate in the federal modeling.
The Guest party treats all features (H _0, … …, H _ m) of the Host party as a federal feature Host _ feat. Thus, the guest side has n + k features G _0, G _1, … …, G _ n, host _ feat _1, … …, host _ feat _ k.
In step 402, the first participant inputs feature subsets formed by features in the sample to be interpreted to the subset sampling module, and obtains a result of sampling the subset with a preset number according to a preset subset sampling strategy.
Here, the sample to be explained (i.e. a single training sample used as the federal learning model) can be regarded as containing n + k features in total, and the total number of final samples is preset and recorded as max _ subset.
The result of the subset sampling can be regarded as a matrix of max _ subset row (n + k) columns, the matrix is reserved by the gust as a training matrix, the front n columns of the training matrix are copied by the gust to form a more part, the more part is called as a subset matrix, the back k columns are sent to the corresponding host, the self column is stored after the host receives the back k columns, and the column becomes the subset column. The guest saves the corresponding weights, called training weights.
max _ subset is a number representing the number of subset samples, which can be understood as the number of training samples used as a linear regression model. Exemplarily, max _ subset is 10, and one training sample is used as the machine learning model f, where the guest party provides 4 features, i.e., n is 4, and only one host party participates, i.e., k is 1. The matrix with the subset sampling result max _ subset (n + k) is a matrix of 10 × 5 (10 rows and 5 columns), wherein the last 1 column is from the host side and the first 4 columns are from the guest side.
In step 403, the first participant inputs each sample to be interpreted to the sample interpretation module, and obtains a contribution value to each feature in the sample.
The guast side obtains the SHAP value of the feature value of each sample and the SHAP value of host _ feat from the feature contribution calculation module. The host party does not obtain any results.
The guest side adds the SHAP value of all the samples X0, X1, … …, Xn to be explained corresponding to the feature [ G _0, G _1, … …, G _ n ] and the SHAP value absolute value of each sample host _ feat to be explained to obtain the feature importance of each feature and the total contribution value of the host side.
The guest side adds the SHAP values of all the samples X0, X1, … …, Xn to be interpreted corresponding to the features G _0, G _1, … …, G _ n and the positive value of the SHAP value of each sample host _ feat to be interpreted to obtain the positive contribution of each feature.
The guest side adds the SHAP value of the feature [ G _0, G _1, … …, G _ n ] corresponding to all the samples X0, X, … …, Xn to be interpreted and the negative value of the SHAP value of each sample host _ feat to be interpreted to obtain the negative contribution degree of each feature.
And finally, the guest party outputs the result, and the host party does not output the result.
Next, a detailed description is given to specific functions implemented by the subset sampling module, the subset sampling module samples the subset according to a certain weight policy, and stores the result, the module inputs the total number of features, and the subset sampling number max _ subset is preset, assuming that the total number of features is M, and M is a positive integer greater than 1. Referring to fig. 11, fig. 11 is a schematic view of a subset sampling process provided in an embodiment of the present application, and based on fig. 10, step 402 may be implemented by specific steps shown in fig. 11. The subset sampling process is described in connection with the steps shown in fig. 11.
At step 501, the first participant device enumerates the size of all possible feature subsets.
From 1 to M-1 (ignoring the empty set and the full set), for example, taking M ═ 5 as an example, the number of subsets C (1,5) with one element in the subset is 5, the number of subsets C (2,5) with two elements in the subset, the number of subsets C (3,5) with three elements in the subset, and the number of subsets C (4,5) with four elements in the subset, the total number of subsets is obtained: total ═ C (1,5) + C (2,5) + C (3,5) + C (4,5) ═ 5+10+10+5 ═ 30.
Step 502, performing weight scoring on the subsets with different sizes to obtain a scoring vector, wherein the different sizes refer to the number of the elements of the subsets.
And (3) carrying out weight scoring on the subsets through the weight scoring formula (2), and finally, scoring the feature subsets of all sizes from 1 to M-1 to obtain a scoring vector weight:
weight ═ w _1, … …, w _ (M-1), where w _1 represents the weight fraction of the subset with a subset size of 1.
Taking M ═ 5 as an example, all sized subsets of 1 to 4 are scored by the above equation (2): the scoring vector weight is obtained [1,4/6,4/6,1 ].
Step 503, calculating the proportion of the weight of each feature subset in the total according to the scoring vector to obtain a proportion vector.
Here, the scale vector is denoted as p, the scale vector corresponding to the weight vector is obtained by using the foregoing formula (4), and the value range of each element in the scale vector is 0 to 1.
For example, in the case of M ═ 5, weight ═ 1,4/6,4/6,1] is obtained in step 503, so that the p value corresponding to weight is obtained: p _1 ═ 1/(1+4/6+4/6+1) ═ 0.3, p _2 ═ 1/5 ═ 0.2, p _3 ═ 0.2, and p _4 ═ 0.3, yielding a ratio vector p ═ 0.3,0.2,0.2, 0.3.
And step 504, sorting the proportion vectors from large to small to obtain sorted proportion vectors.
From the property of w (m) in equation (2): the larger and smaller subsets are weighted more strongly and the complementary sets are weighted equally.
For example, taking M ═ 5 as an example, W (1) ═ W (4) and W (2) ═ W (3), so that p ═ p _1, p — (M-1), p _2, p — (M-2) ] is obtained in the order, that is, p ═ p _1, p _4, p _2, p _3 ═ 0.3,0.3,0.2, 0.2.
And 505, determining the capacity number corresponding to the current sampling turn.
Note that the scaling factor corresponding to the current round is extracted and recorded as p _ max, and the corresponding subset size is p _ size. Multiplying p _ max by the total number i _ subset of samples that need to be sampled in the current round, i represents the sampling round, when i is 1, i.e. the first round of sampling, i _ subset is max _ subset, and obtaining a capacity number: capacity is i _ subset p _ max.
For example, taking M as 5 as an example, preset max _ subset as 20, and get the sorted scale vector: if the first sampling is taken as an example, p _ max is p _1, p _ size is 1, and the capacity number capacit is 0.3 × 20 is 6.
Step 506, comparing whether the current capacity number is larger than or equal to the number of the feature subsets corresponding to the current round, if so, going to step 507, otherwise, going to step 509.
Here, the capacity numbers capacity and the size of C (p _ size, M) are obtained in the comparison step 505, and different operations are performed according to the comparison result.
Step 507, adding the vector representation corresponding to the feature subset with the size corresponding to the current sampling turn into the vector result.
For example, taking M ═ 5 as an example, when the first sampling is performed, and p _ size is 1, then vectors [1,0,0,0,0], [0,1,0,0,0], [0,0,1,0,0], [0,0,0, 1] are added to the vector result; the weight calculation formula of each vector in the linear regression model is as follows: w (p _ size)/C (p _ size, M),
the vector with a subset size of 1, M5, p _ size 1, and weight w _1/C (1,5) 1/5 in the linear regression model is used to add the corresponding weight of each vector to the weight list w _ list. Update the value of i _ subset as: i _ subset — C (p _ size, M) for the next round of subset selection.
Step 508, updating the scoring vector, and determining whether the elements in the scoring vector are empty, if yes, executing step 509, otherwise, executing step 503.
Here, w _ (p _ size) is removed from the above-mentioned scoring vector weight vector list, and if weight is empty finally, go to step 509, otherwise, go back to step 503, i.e. loop execution, and start the next round of subset selection operation.
In step 509, when the remaining current sample number max _ subset cannot further accommodate the next complete subset, the remaining current sample number max _ subset is randomly sampled.
Here, the current sample number is the updated max _ subset, i.e., i _ subset. And calculating p for the rest weights in the weights as the probability that the subset with each size is selected, then randomly sampling the size of max _ subset, randomly generating a 0,1 vector according to the size of the subset, adding the vector into the result, and dividing the corresponding weight by sum (weight)/max _ subset, namely dividing the rest weights by the number of the rest samples, and executing step 507.
All subset sampling is completed, and vector results are returned, along with a corresponding weight list, step 510.
Illustratively, referring to the vector result shown in fig. 8, it should be noted that the vector result may be regarded as a vector matrix composed of 0 and 1, each row is mapped by a subset of sampled features, the mapping rule is that the missing features in the subset are represented by 0, the normal features in the subset are represented by 1, and the total row number of the vector matrix represents the total number of sample samples.
Next, the function of the sample interpretation module will be described, and the sample interpretation module takes a sample X to be interpreted as an input. Referring to fig. 12, fig. 12 is a schematic diagram of a sample explanation flow provided in an embodiment of the present application, and based on fig. 10, step 403 may be implemented by the specific steps shown in fig. 12. The sample interpretation process is explained in conjunction with the steps shown in fig. 12.
Step 601, the first participant and the second participant respectively input the characteristic part of the held sample to be explained into a prediction module to obtain a prediction value as a reference value.
Here, the sample X to be explained is a sample used as a federal learning model. Looking up the set value of the f _ x method, if the method is the method 1, calculating the average value, the median or the mode of each feature according to the training data by the guest/host to be used as a filling value; if the method 2 is the aforementioned, 0 is used as the padding value; if the method is the method 3, the gust and the host count possible values of each feature and use the values as random sampling.
And step 602, the first party loads the own subset matrix, the second party loads the own subset column, and the samples to be predicted which are the same in quantity and are used as the federal learning model are generated.
Here, the gust generates a corresponding subset according to the 0,1 feature vector of each row in the subset matrix, and then generates a prediction sample to be f _ X according to the subset and the gust feature of X. The host is generated from each row in the subset column, f _ x samples to be predicted.
The method comprises the following steps of generating a rule of a sample to be predicted used as a federal learning model: if the action is 0, the host generates a new sample by using the corresponding f _ X scheme, otherwise, if the action is 1, the host only copies the host feature of X as a new sample.
Thus, the same number of corresponding samples to be predicted by f _ x will be generated by the guest and the host,
and 603, the first participant and the second participant respectively input the samples to be predicted generated by the first participant into the prediction module, and the federal prediction is executed to obtain a prediction result of each sample to be predicted corresponding to the federal learning model.
Here, the prediction result of the sample corresponding to the federal learning model may also be referred to as a score vector of the sample as each item of y, y minus a reference value.
And step 604, the first participant trains a linear regression model by using the training matrix as a sample, the training weight as a weight and the predicted value of the corresponding federated learning model of the sample to be predicted as a label.
Here, the predicted value y of the corresponding federal learning model of the sample to be predicted is new as a label of the linear regression model. The trained linear regression model may not have truncation (i.e. bias), it should be noted that the first n terms of the linear regression model correspond to the contribution of the gust feature, the last k terms correspond to the contribution of each host, the weights are extracted, and the weights are returned as the interpretation result.
And finally, explaining a prediction module, wherein the prediction module takes a specific sample as input, a longitudinal federal machine learning model f is loaded by the gust and host, the input sample is predicted according to f, and a predicted value is returned.
The optimized longitudinal federal learning model interpretation method provided by the embodiment of the application can determine the feature contribution degree of each feature provided by a Guest party in a single sample and measure the feature contribution of a Host party, and if the feature contribution degree is a multi-Host condition, the same is true, so that the deficiency in a federal scene is supplemented; in addition, the measurement of the importance degree of the global features, which is the same as that of the traditional method, is also reserved; the optimization of the subset sampling greatly reduces communication and computational complexity, so that the functions of explaining a plurality of samples and measuring contribution have practicability in a federal scene.
Continuing with the exemplary structure of the data processing device 555 of the federal learning model provided in this application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the data processing device 555 of the federal learning model in the memory 540 may include:
an obtaining module 5551, configured to obtain a feature set of a training sample used as a federated learning model and a target prediction result of the federated learning model corresponding to the training sample, and combine features in the feature set to obtain a plurality of feature subsets, where the feature set includes: a first party provided feature having tag information and at least one second party provided feature;
a sampling module 5552, configured to obtain a weight score of each feature subset, and sample a plurality of target feature subsets from the plurality of feature subsets based on the size of the weight score;
the training module 5553 is configured to train the linear regression model based on the multiple target feature subsets and the prediction results of the federate learning model corresponding to the target feature subsets to obtain model parameters when the linear regression model converges;
a determining module 5554, configured to determine, based on the model parameters, contribution information of each feature in the feature set corresponding to the target prediction result.
In some embodiments, the sampling module 5552 is further configured to sort the weight scores of the feature subsets in an order from large to small, so as to obtain a weight score sequence; according to the weight fraction sequence, sequentially sampling from the feature subset with the largest weight fraction to obtain a first number of feature subsets as target feature subsets; wherein the first number is less than the total number of feature subsets corresponding to the feature set.
In some embodiments, the sampling module 5552 is further configured to perform regularization on the weight scores of the feature subsets to obtain a scaling coefficient corresponding to each weight score; and sampling a plurality of target feature subsets from the plurality of feature subsets based on the scaling coefficients corresponding to the weight scores.
In some embodiments, the sampling module 5552 is further configured to sort the scaling coefficients corresponding to the weight fractions according to the size of the scaling coefficients, so as to obtain a scaling coefficient sequence; according to the sequence of the scale coefficients in the scale coefficient sequence, the following processing is sequentially executed on each scale coefficient until a target feature subset of the target sampling number is obtained: acquiring the current sampling number, determining the product of the proportional coefficient and the current sampling number, and taking the product as the current capacity value; acquiring the number of feature subsets corresponding to the proportionality coefficient; when the current capacity value is larger than the number, the feature subset corresponding to the proportionality coefficient is used as a target feature subset; and when the current capacity value is smaller than the quantity, randomly selecting the feature subset with the quantity same as the current sampling quantity from the unselected feature subsets as a target feature subset.
In some embodiments, the sampling module 5552 is further configured to perform the following for each scaling factor: acquiring the current sampling number, determining the product of the proportional coefficient and the current sampling number, and taking the product as the current capacity value; acquiring the number of feature subsets corresponding to the proportionality coefficient; and when the current capacity value is larger than or equal to the quantity, taking the characteristic subset corresponding to the scaling coefficient as a target characteristic subset. And when the current capacity value is smaller than the quantity, randomly selecting the quantity of feature subsets from the feature subsets which are not selected as target feature subsets.
In some embodiments, the training module 5553 is further configured to obtain a transformation relationship between the target feature subset and the target training sample; converting the features in the target feature subset based on the conversion relation to obtain a target training sample of a linear regression model; and taking the prediction result of the target characteristic subset corresponding to the federal learning model as a sample label of a corresponding target training sample, and training the linear regression model to obtain model parameters when the linear regression model is converged.
In some embodiments, the training module 5553 is further configured to perform the following for each target feature subset: comparing the target characteristic subset with the characteristic set to obtain the characteristic of the target characteristic subset which is different from the characteristic set and used as the missing characteristic; and respectively carrying out characteristic value assignment on each missing characteristic, and filling the missing characteristics of the target characteristic subset based on the assignment result to obtain a target training sample of the linear regression model.
In some embodiments, the training module 5553 is further configured to determine a default value for each missing feature; assigning a feature value to the missing feature based on a default value.
In some embodiments, the determining module 5554 is further configured to obtain a linear mapping relationship corresponding to the linear regression model, where a target prediction result in the linear mapping relationship is a dependent variable, each feature in the feature set is an independent variable, and the model parameter is a coefficient of the independent variable; and determining contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters and the linear mapping relation.
It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated.
The embodiment of the present application provides a computer program product, which includes a computer program, and is characterized in that when being executed by a processor, the computer program implements the data processing method of the federal learning model provided in the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform a method provided by embodiments of the present application, for example, a data processing method of the federal learning model as shown in fig. 3.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, the application range in a federal learning scene can be expanded through the embodiment of the application, the contribution information of each feature in a single training sample and the contribution information of a second participant can be accurately measured, meanwhile, the calculation amount of the model can be greatly reduced, and the calculation efficiency is improved.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (13)

1. A data processing method of a federated learning model is applied to a first participant device, and comprises the following steps:
obtaining a feature set of a training sample used as a federated learning model and a target prediction result of the training sample corresponding to the federated learning model, and combining features in the feature set to obtain a plurality of feature subsets, wherein the feature set comprises: a first party provided feature having tag information and at least one second party provided feature;
acquiring a weight score of each feature subset, and sampling from the plurality of feature subsets to obtain a plurality of target feature subsets based on the weight scores;
training a linear regression model based on a plurality of target feature subsets and the prediction results of the target feature subsets corresponding to the federated learning model to obtain model parameters when the linear regression model converges;
and determining contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters.
2. The method of claim 1, wherein sampling a plurality of target feature subsets from the plurality of feature subsets based on the magnitude of the weight scores comprises:
sequencing the weight scores of the feature subsets according to the descending order of the weight scores to obtain a weight score sequence;
according to the weight fraction sequence, sequentially sampling from the feature subset with the largest weight fraction to obtain a first number of feature subsets serving as target feature subsets;
wherein the first number is less than a total number of feature subsets corresponding to the feature set.
3. The method of claim 1, wherein sampling a plurality of target feature subsets from the plurality of feature subsets based on the magnitude of the weight scores comprises:
regularizing the weight fraction of each feature subset to obtain a proportional coefficient corresponding to each weight fraction;
and sampling a plurality of target feature subsets from the plurality of feature subset samples based on the scaling coefficients corresponding to the weight scores.
4. The method of claim 3, wherein sampling a plurality of target feature subsets from the plurality of feature subset samples based on the scaling factors corresponding to each of the weight scores comprises:
sorting the proportional coefficients corresponding to the weight scores according to the size of the proportional coefficients to obtain a proportional coefficient sequence;
according to the sequence of the scale coefficients in the scale coefficient sequence, sequentially performing the following processing on each scale coefficient until a target feature subset of a target sampling number is obtained:
acquiring the current sampling number, determining the product of the proportional coefficient and the current sampling number, and taking the product as the current capacity value;
acquiring the number of feature subsets corresponding to the proportionality coefficient;
when the current capacity value is larger than the quantity, taking the feature subset corresponding to the proportionality coefficient as a target feature subset;
and when the current capacity value is smaller than the quantity, randomly selecting the feature subset with the quantity same as the current sampling quantity from the feature subsets which are not selected as target feature subsets.
5. The method of claim 3, wherein sampling a plurality of target feature subsets from the plurality of feature subset samples based on the scaling factors corresponding to each of the weight scores comprises:
performing the following processing for each of the scale coefficients:
acquiring the current sampling number, determining the product of the proportional coefficient and the current sampling number, and taking the product as the current capacity value;
acquiring the number of feature subsets corresponding to the proportionality coefficient;
when the current capacity value is larger than or equal to the quantity, taking the feature subset corresponding to the proportionality coefficient as a target feature subset
And when the current capacity value is smaller than the quantity, randomly selecting the quantity of the feature subsets as target feature subsets from the feature subsets which are not selected.
6. The method according to claim 1, wherein the training a linear regression model based on the plurality of target feature subsets and the prediction results of the target feature subsets corresponding to the federated learning model to obtain model parameters when the linear regression model converges comprises:
obtaining a conversion relation between a target feature subset and a target training sample;
converting the features in each target feature subset based on the conversion relation to obtain a target training sample of the linear regression model;
and taking the prediction result of the target feature subset corresponding to the federated learning model as a sample label of a corresponding target training sample, and training the linear regression model to obtain model parameters when the linear regression model is converged.
7. The method of claim 6, wherein the transforming the features in each of the target feature subsets based on the transformation relationship to obtain target training samples of the linear regression model comprises:
performing the following processing respectively for each target feature subset:
comparing the target feature subset with the feature set to obtain features which are different from the feature set and serve as missing features;
and respectively carrying out characteristic value assignment on each missing characteristic, and filling the missing characteristics of the target characteristic subset based on assignment results to obtain a target training sample of the linear regression model.
8. The method of claim 7, wherein said assigning a feature value for each of said missing features comprises:
determining a default value corresponding to each missing feature;
assigning a feature value to the missing feature based on the default value.
9. The method according to claim 1, wherein the determining contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters comprises:
acquiring a linear mapping relation corresponding to the linear regression model, wherein the target prediction result in the linear mapping relation is a dependent variable, each feature in the feature set is an independent variable, and the model parameter is a coefficient of the independent variable;
and determining contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters and the linear mapping relation.
10. An apparatus, comprising:
an obtaining module, configured to obtain a feature set of a training sample used as a federated learning model and a target prediction result of the training sample corresponding to the federated learning model, and combine features in the feature set to obtain a plurality of feature subsets, where the feature set includes: a first party provided feature having tag information and at least one second party provided feature;
the sampling module is used for acquiring the weight fraction of each feature subset and sampling a plurality of target feature subsets from the plurality of feature subsets based on the weight fraction;
the training module is used for training a linear regression model based on the target feature subsets and the prediction results of the target feature subsets corresponding to the federated learning model to obtain model parameters when the linear regression model converges;
and the determining module is used for determining the contribution information of each feature in the feature set corresponding to the target prediction result based on the model parameters.
11. An electronic device, comprising:
a memory for storing executable instructions;
a processor configured to implement the data processing method of the federated learning model of any one of claims 1 to 9 when executing the executable instructions stored in the memory.
12. A computer-readable storage medium having stored thereon executable instructions for, when executed by a processor, implementing the data processing method of the federal learning model as claimed in any of claims 1 to 9.
13. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the data processing method of the federal learning model of any of claims 1 to 9.
CN202110736203.7A 2021-06-30 2021-06-30 Data processing method and device of federal learning model and storage medium Pending CN113326900A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110736203.7A CN113326900A (en) 2021-06-30 2021-06-30 Data processing method and device of federal learning model and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110736203.7A CN113326900A (en) 2021-06-30 2021-06-30 Data processing method and device of federal learning model and storage medium

Publications (1)

Publication Number Publication Date
CN113326900A true CN113326900A (en) 2021-08-31

Family

ID=77423524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110736203.7A Pending CN113326900A (en) 2021-06-30 2021-06-30 Data processing method and device of federal learning model and storage medium

Country Status (1)

Country Link
CN (1) CN113326900A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114240101A (en) * 2021-12-02 2022-03-25 支付宝(杭州)信息技术有限公司 Risk identification model verification method, device and equipment
CN114565030A (en) * 2022-02-17 2022-05-31 北京百度网讯科技有限公司 Feature screening method and device, electronic equipment and storage medium
CN115034400A (en) * 2022-04-21 2022-09-09 建信金融科技有限责任公司 Business data processing method and device, electronic equipment and storage medium
CN115277454A (en) * 2022-07-28 2022-11-01 中国人民解放军国防科技大学 Aggregation communication method for distributed deep learning training
CN115034400B (en) * 2022-04-21 2024-05-14 建信金融科技有限责任公司 Service data processing method and device, electronic equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114240101A (en) * 2021-12-02 2022-03-25 支付宝(杭州)信息技术有限公司 Risk identification model verification method, device and equipment
CN114565030A (en) * 2022-02-17 2022-05-31 北京百度网讯科技有限公司 Feature screening method and device, electronic equipment and storage medium
CN114565030B (en) * 2022-02-17 2022-12-20 北京百度网讯科技有限公司 Feature screening method and device, electronic equipment and storage medium
CN115034400A (en) * 2022-04-21 2022-09-09 建信金融科技有限责任公司 Business data processing method and device, electronic equipment and storage medium
CN115034400B (en) * 2022-04-21 2024-05-14 建信金融科技有限责任公司 Service data processing method and device, electronic equipment and storage medium
CN115277454A (en) * 2022-07-28 2022-11-01 中国人民解放军国防科技大学 Aggregation communication method for distributed deep learning training
CN115277454B (en) * 2022-07-28 2023-10-24 中国人民解放军国防科技大学 Aggregation communication method for distributed deep learning training

Similar Documents

Publication Publication Date Title
JP7000341B2 (en) Machine learning-based web interface generation and testing system
CN113326900A (en) Data processing method and device of federal learning model and storage medium
WO2021027256A1 (en) Method and apparatus for processing interactive sequence data
WO2015062209A1 (en) Visualized optimization processing method and device for random forest classification model
Helwig Adding bias to reduce variance in psychological results: A tutorial on penalized regression
JP2014130408A (en) Graph preparation program, information processing device, and graph preparation method
CN108280104A (en) The characteristics information extraction method and device of target object
KR20190084866A (en) Collaborative filtering method, device, server and storage medium combined with time factor
CN111242310A (en) Feature validity evaluation method and device, electronic equipment and storage medium
Weber et al. Predicting default probabilities in emerging markets by new conic generalized partial linear models and their optimization
CN113761359B (en) Data packet recommendation method, device, electronic equipment and storage medium
CN112561031A (en) Model searching method and device based on artificial intelligence and electronic equipment
CN112070310A (en) Loss user prediction method and device based on artificial intelligence and electronic equipment
CN112487199A (en) User characteristic prediction method based on user purchasing behavior
CN110415103A (en) The method, apparatus and electronic equipment that tenant group mentions volume are carried out based on variable disturbance degree index
CN108255706A (en) Edit methods, device, terminal device and the storage medium of automatic test script
CN110349007A (en) The method, apparatus and electronic equipment that tenant group mentions volume are carried out based on variable discrimination index
CN112785005A (en) Multi-target task assistant decision-making method and device, computer equipment and medium
Dutta et al. An overview of computational tools for preparing, constructing and using resistance surfaces in connectivity research
CN110245310A (en) A kind of behavior analysis method of object, device and storage medium
CN113326948A (en) Data processing method, device, equipment and storage medium of federal learning model
Kozlova et al. Development of the toolkit to process the internet memes meant for the modeling, analysis, monitoring and management of social processes
CN112463964B (en) Text classification and model training method, device, equipment and storage medium
CN114298327A (en) Data processing method and device of federal learning model and storage medium
CN114418120A (en) Data processing method, device, equipment and storage medium of federal tree model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination