CN111325353A - Method, device, equipment and storage medium for calculating contribution of training data set - Google Patents
Method, device, equipment and storage medium for calculating contribution of training data set Download PDFInfo
- Publication number
- CN111325353A CN111325353A CN202010123970.6A CN202010123970A CN111325353A CN 111325353 A CN111325353 A CN 111325353A CN 202010123970 A CN202010123970 A CN 202010123970A CN 111325353 A CN111325353 A CN 111325353A
- Authority
- CN
- China
- Prior art keywords
- training data
- data set
- shap
- calculating
- contribution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method, a device, equipment and a storage medium for calculating contribution of a training data set, relating to the field of financial science and technology, wherein the method comprises the following steps: acquiring each training data set of a training machine learning model; calculating SHAP target values of the Shapril additive model interpretation method of each characteristic in the training data set; and calculating the contribution degree of the training data set according to the SHAP target value of each feature in the training data set. The method and the device realize that the contribution degree of each training data set is obtained by correspondingly calculating the SHAP target value of each characteristic in the training data set, so that the importance degree of each training data set in the process of training the machine learning model is evaluated through the contribution degree of each training data set, the training data set of the training machine learning model is conveniently and accurately selected, and the accuracy of the machine learning model obtained by training on data prediction is improved.
Description
Technical Field
The invention relates to the technical field of data processing of financial technology (Fintech), in particular to a method, a device, equipment and a storage medium for calculating contribution of a training data set.
Background
With the development of computer technology, more and more technologies are applied in the financial field, the traditional financial industry is gradually changing to financial technology (Fintech), and the data processing technology is no exception, but due to the requirements of the financial industry on safety and real-time performance, higher requirements are also put forward on the technology.
The data is very important for machine learning modeling, and one part of high-quality data is beneficial to improving the performance of a machine learning model, so that the business profit is improved or the operation cost is reduced. In the large-scale practical deployment process of machine learning, the value of data is more and more highlighted, and the data gradually evolves into an asset. With the development and popularization of technologies such as information technology, internet, big data and the like, various industries accumulate a lot of data, and more data selection spaces are provided for machine learning modeling. In the actual modeling process, business personnel often use data from different suppliers and different categories, and features contained in data sets of all parties are different and contribute to the performance of the machine learning model differently. The contribution degree of the data set is not only beneficial to understanding the model, but also can be used as a reference index of data pricing when the data set is transacted. Party A is willing to spend more cost on the data sets with high contribution, and party B will ask for higher price on the data sets with high contribution.
Currently there are methods of calculating the importance of a single feature in a training dataset, but there is no method of evaluating the importance of each training dataset when modeling multiparty data. Therefore, how to calculate the contribution degree of each training data set is an urgent problem to be solved.
Disclosure of Invention
The invention mainly aims to provide a method, a device, equipment and a storage medium for calculating contribution of a training data set, and aims to solve the technical problem of how to calculate the contribution of each training data set in the prior art.
In order to achieve the above object, the present invention provides a method for calculating a contribution of a training data set, including the steps of:
acquiring each training data set of a training machine learning model;
calculating SHAP target values of the Shapril additive model interpretation method of each characteristic in the training data set;
and calculating the contribution degree of the training data set according to the SHAP target value of each feature in the training data set.
Preferably, the step of calculating the SHAP target values of the respective features in the training data set comprises:
calculating SHAP values corresponding to the features in the training data set, and calculating absolute values of the SHAP values corresponding to the features to obtain SHAP absolute values corresponding to the features;
determining each feature in the training data set as a target feature, and obtaining a SHAP target value corresponding to the target feature according to SHAP absolute values of the target feature in different training data sets.
Preferably, the step of obtaining the target value of the SHAP corresponding to the target feature according to the SHAP absolute values of the target feature in different training data sets includes:
determining SHAP absolute values of the target features in different training data sets, and calculating SHAP average values corresponding to the SHAP absolute values of the target features in the different training data sets;
and determining the SHAP average value as a SHAP target value corresponding to the target feature.
Preferably, the step of calculating the SHAP value corresponding to each feature in the training data set includes:
calculating marginal profit expectation corresponding to each feature in the training data set;
and calculating the SHAP value of each characteristic corresponding to the marginal profit expectation according to the marginal profit expectation to obtain the SHAP value corresponding to each characteristic in the training data set.
Preferably, the step of calculating the contribution degree of the training data set according to the SHAP target value of each feature in the training data set comprises:
determining SHAP target values of all the characteristics in the training data set, and determining the number of data sets of the training data set where all the characteristics are located;
and calculating the contribution degree of the training data set according to the SHAP target value corresponding to each feature in the training data set and the number of the data sets.
Preferably, the step of calculating the contribution degree of the training data set according to the SHAP target value and the number of the data sets corresponding to each feature in the training data set includes:
calculating a quotient value between the SHAP target value corresponding to each feature in the training data set and the number of the data sets;
and adding the quotient values corresponding to the features in the training data set to obtain the contribution degree of the training data set.
Preferably, after the step of calculating the contribution degree of the training data set according to the SHAP target value of each feature in the training data set, the method further includes:
selecting a target training data set for training the machine learning model according to the contribution degree of each training data set;
inputting the target training data set into the machine learning model to train the machine learning model.
Further, to achieve the above object, the present invention provides a contribution calculation apparatus for a training data set, including:
the acquisition module is used for acquiring each training data set of the training machine learning model;
the calculation module is used for calculating SHAPPI additive model interpretation method SHAP target values of all the characteristics in the training data set; and calculating the contribution degree of the training data set according to the SHAP target value of each feature in the training data set.
Further, to achieve the above object, the present invention provides a contribution calculating apparatus for a training data set, including a memory, a processor, and a contribution calculating program for a training data set stored on the memory and executable on the processor, wherein the contribution calculating program for a training data set, when executed by the processor, implements a step of a contribution calculating method for a training data set corresponding to a federal learning server.
Further, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a contribution degree calculation program of a training data set, which when executed by a processor, realizes the steps of the contribution degree calculation method of the training data set as described above.
According to the method, after the training data sets of the training machine learning model are obtained, the SHAP target values of the features in the training data sets are calculated, the contribution degree of each training data set is correspondingly calculated according to the SHAP target values of the features in the training data sets, the contribution degree of each training data set is obtained through the corresponding calculation of the SHAP target values of the features in the training data sets, the importance degree of each training data set in the process of training the machine learning model is evaluated through the contribution degree of each training data set, the training data sets of the training machine learning model are conveniently and accurately selected, and therefore the accuracy of the machine learning model obtained through training on data prediction is improved.
Drawings
FIG. 1 is a schematic flow chart diagram of a first embodiment of a method for calculating contribution of a training data set according to the present invention;
FIG. 2 is a flow chart illustrating a second embodiment of a method for calculating contribution of a training data set according to the present invention;
FIG. 3 is a block diagram illustrating the functions of a preferred embodiment of the apparatus for calculating contribution of training data set according to the present invention;
fig. 4 is a schematic structural diagram of a hardware operating environment according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a method for calculating contribution of a training data set, and referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of the method for calculating contribution of the training data set according to the invention.
While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown.
The contribution calculation method of the training data set is applied to a server or a terminal, and the terminal may include a mobile terminal such as a mobile phone, a tablet computer, a notebook computer, a palm computer, a Personal Digital Assistant (PDA), and the like, and a fixed terminal such as a Digital TV, a desktop computer, and the like. In the embodiments of the contribution calculating method of the training data set, the execution subject is omitted for convenience of description to explain the embodiments. The contribution calculation method of the training data set comprises the following steps:
in step S10, training data sets for training the machine learning model are acquired.
Each training data set for training the machine learning model is obtained. In this embodiment, each training data set may be acquired when a training instruction for training the machine learning model is received, where the training instruction may be triggered by a user as needed, or may be triggered by a preset timing task at a fixed time, where a duration corresponding to the timing task may be set according to a specific need. The machine learning models include, but are not limited to, linear regression models, logistic regression models, tree models, and random forest models, and the user can select which specific machine learning model to use according to specific needs. Linear regression is a statistical analysis method that uses regression analysis in mathematical statistics to determine the quantitative relationship of interdependence between two or more variables. Linear regression can describe the relationship between data more accurately with a straight line, so that when new data appears, a simple value can be predicted. Random forest refers to a classifier that trains and predicts a sample using multiple trees.
In this embodiment, the training data sets are obtained from different terminals, which may be the same type of terminal or different types of terminals, such as a bank terminal or a shopping terminal. The number of the acquired training data sets is not limited in this embodiment, and may be 5, 10, or 16, for example. One training data set may correspond to one type of terminal, or multiple training data sets may correspond to one type of terminal. In each training data set, the number of sample data included may be the same or different, and each sample data corresponds to at least one feature. Such as that a certain sample data comprises 3 characteristics of age, gender and income. The same feature may exist in multiple training data sets or may exist in only one training data set.
Further, each training data set to be acquired may be stored in a database in advance, and may be acquired from the database when the acquisition is required. Or may be obtained from the terminal generating the training data set when the acquisition is needed.
And step S20, calculating SHAPPI additive model interpretation method SHAP target values of all the characteristics in the training data set.
After each training data set for training the machine learning model is obtained, SHAP (SHAPLey Additive ExPlatics, Shapril Additive model interpretation method) target values of each feature in the training data set are calculated. It should be noted that, in the training data set, each feature has a corresponding SHAP target value.
Further, step S20 includes:
step a, calculating SHAP values corresponding to all the features in the training data set, and calculating absolute values of the SHAP values corresponding to all the features to obtain SHAP absolute values corresponding to all the features.
After each training data set is obtained, SHAP values corresponding to each feature in the training data sets are calculated. It should be noted that, because the SHAP values corresponding to some features are negative values, in order to reduce the influence of the negative values on the contribution degree of each training data set in the subsequent calculation, after the SHAP value corresponding to each feature is obtained, the absolute value of the SHAP value corresponding to each feature is calculated, and the SHAP absolute value corresponding to each feature is obtained, that is, the absolute value of the SHAP value corresponding to each feature is determined as the SHAP absolute value.
Further, the step of calculating the SHAP value corresponding to each feature in the training data set includes:
step a1, calculating the marginal profit expectation corresponding to each feature in the training data set.
Further, the process of calculating the swap value corresponding to each feature in the training data set is as follows: calculating the marginal benefit expectation corresponding to each feature in the training data set, wherein it should be noted that, in the process of calculating the marginal benefit expectation, 1 training data set is used as a unit. Specifically, the formula corresponding to the calculation of the marginal profit expectation is as follows:
wherein phi isiFor marginal profit expectation, F is a set of all characteristics in each training data set, namely F is a set of all characteristics in a certain training data set, F \ i is a removed data set after the ith characteristic is removed from F, S is a subset of the removed data set, Fs∪{i}(xs∪{i}) For the output value of the machine learning model on the set S, fs(xs) Combining the S set with the feature i and then outputting the value f on the machine learning models∪{i}(xs∪{i}) And fs(xs) And subtracting to obtain the marginal benefit of the corresponding feature under the current condition, wherein the removed feature is the ith feature, the removed data set is the probability of S occupying all conditions, the probability is a fractional coefficient with a factorial sign in a preset marginal benefit expectation calculation formula, and the fractional coefficient represents the probability of the current condition of the feature for calculating the SHAP value occupying all conditions.
Step a2, calculating the SHAP value of each feature corresponding to the marginal profit expectation according to the marginal profit expectation to obtain the SHAP value corresponding to each feature in the training data set.
And after the marginal profit expectation corresponding to each feature in the training data set is obtained through calculation, calculating the SHAP value of each marginal profit expectation corresponding to the feature according to the marginal profit expectation so as to obtain the SHAP value corresponding to each feature in the training data set. It should be noted that the objective of the SHAP is to explain the prediction of the instance x by calculating the contribution of each feature to the prediction, the theoretical basis is to calculate the sharley value (SHAP value) in the league game theory, each feature of the sample instance acts as a participant in the league, the sharey value tells us how to fairly distribute the "profit" (i.e. the contribution to the final prediction result) among the features, and what is involved in the distribution may be a single feature value of the sample, such as a certain feature in the sample data; or a set of feature values, for example, for interpreting an image, the SHAP values of a single pixel may not account for what, but the SHAP values of the eye's whole set of pixels may account for the output of the model. In the SHAPLey value interpretation is represented as an additive feature attribute method, a linear model, in the SHAP algorithm.
Specifically, the formula for calculating the SHAP value is:
wherein z' ∈ {0,1} represents whether the sample features participate in modeling of the machine learning model, M represents the number of features in the training data set, if there are 15 features in a certain training data set, M is 15, Φ0Representing machine learning model bias, phijRepresenting the degree of contribution of a feature to the prediction of the machine learning model, i.e. phijFor marginal benefit expectations, g (z') is the SHAP value.
And b, determining each feature in the training data set as a target feature, and obtaining a SHAP target value corresponding to the target feature according to SHAP absolute values of the target feature in different training data sets.
And after the SHAP absolute value corresponding to each feature in each training data set is obtained, sequentially determining each feature in the training data set as a target feature. It should be noted that a feature may exist in a plurality of training data sets, for example, if 10 training data sets are acquired in the present embodiment, the feature may exist in 6 training data sets. After the target features are determined, the SHAP target values corresponding to the target features are obtained according to SHAP absolute values of the target features in different training data sets. It should be noted that, if there are 7 training data sets for a certain target feature, there will be corresponding absolute values of the SHAP for all of the 7 training data sets for the target feature.
Further, the step of obtaining the target value of the SHAP corresponding to the target feature according to the SHAP absolute values of the target feature in different training data sets includes:
and c, determining the SHAP absolute values of the target features in different training data sets, and calculating SHAP average values corresponding to the SHAP absolute values of the target features in different training data sets.
And d, determining the SHAP average value as a SHAP target value corresponding to the target feature.
After the target feature is determined, the SHAP absolute values of the target feature in different training data sets are determined. It should be noted that, for the same target feature, the absolute values of the SHAP in different training data sets may be the same or different. In this embodiment, when a target feature exists in a certain training data set, the SHAP absolute value of the target feature in the training data set is not zero; when a target feature does not exist in a training data set, the SHAP absolute value of the target feature in the training data set is zero.
After SHAP absolute values of the target feature in different training data sets are determined, adding the SHAP absolute values of the target feature in the different training data sets to obtain a SHAP sum value corresponding to the target feature, then calculating the number of SHAP absolute values of the target feature, which are not zero, dividing the SHAP sum value by the number of the SHAP absolute values to obtain a SHAP average value corresponding to the SHAP absolute values of the target feature in the different training data sets, and determining the SHAP average value as a SHAP target value corresponding to the target feature.
Further, in the process of calculating the average value of the SHAP, the maximum value and the minimum value of the absolute values of the SHAP can be removed, the average value of the absolute values of the SHAP with the maximum value and the minimum value removed is calculated, and the average value of the SHAP corresponding to the absolute values of the SHAP of the target feature in different training data sets is obtained.
Step S30, calculating the contribution of the training data set according to the SHAP target values of the features in the training data set.
And after the SHAP target values corresponding to the target features are obtained, calculating the contribution degree of the training data set according to the SHAP target values of the features in the training data set.
Further, step S30 includes:
and e, determining the SHAP target value of each feature in the training data set, and determining the number of the data sets of the training data set in which each feature is located.
And f, calculating the contribution degree of the training data set according to the SHAP target value corresponding to each feature in the training data set and the number of the data sets.
Further, the SHAP target value of each feature in the training data set is determined, and the number of data sets of the training data set where each feature is located is determined. If the a feature exists in 8 training data sets, the number of the data sets corresponding to the a feature is 8. And after the number of the data sets corresponding to each feature is obtained, calculating the contribution degree of each feature to the training data set according to the SHAP target value and the number of the data sets corresponding to each feature in the training data set, and then determining the sum of the contribution degrees of each feature to the training data set as the contribution degree of the training data set. If there are 5 features in a training data set, the contribution degrees to the training data set calculated according to the SHAP target values corresponding to the 5 features and the number of data sets in the training data set are a1, a2, A3, a4, and a5, respectively, then the contribution degrees of the training data set are: a1+ A2+ A3+ A4+ A5.
Further, step f comprises:
step f1, calculating the quotient between the SHAP target value corresponding to each feature in the training data set and the number of the data sets.
And f2, adding the quotient values corresponding to the features in the training data set to obtain the contribution degree of the training data set.
Specifically, quotient values between SHAP target values corresponding to the features in the training data set and the number of the data sets are calculated, and the quotient values corresponding to the features in the training data set are added to obtain the contribution degree of the training data set. Specifically, the formula for calculating the contribution of the training data set is as follows:
wherein Q isSet-jRepresenting the contribution degree of a certain training data set, namely representing the integral contribution degree of the certain training data set; qfiIndicating a SHAP target value corresponding to a certain characteristic; mfiRepresenting that the feature appears in a plurality of training data sets, namely the number of the feature corresponding to the data sets; i (Q)fi∈QSet-j) Indicates whether a feature is present in the Set-j training data Set, and if the feature is present in the Set-j training data Set, I (Q)fi∈QSet-j) The value is 1; if the feature is not present in the Set-j training data Set, I (Q)fi∈QSet-j) The value is 0.
It should be noted that, when a certain feature is only present in one training data set, the training data set exclusively shares the SHAP target value of the feature; when a feature appears in at least two training data sets, then the at least two training data sets need to share the SHAP target value for the feature.
In this embodiment, after obtaining each training data set of the training machine learning model, the shp target value of each feature in the training data set is calculated, and the contribution degree of each training data set is correspondingly calculated according to the shp target value of each feature in the training data set, so that the contribution degree of each training data set is obtained through the corresponding calculation of the shp target value of each feature in the training data set, the importance of each training data set in the training machine learning model process is evaluated through the contribution degree of each training data set, so that the training data set of the training machine learning model is selected more accurately, and the accuracy of the machine learning model obtained through training for data prediction is improved.
Further, a second embodiment of the inventive method for calculating a contribution of a training data set is proposed. The second embodiment of the method for calculating the contribution of the training data set differs from the first embodiment of the method for calculating the contribution of the training data set in that, with reference to fig. 2, the method for calculating the contribution of the training data set further includes:
step S40, selecting a target training data set for training the machine learning model according to the contribution degree of each training data set.
And after the contribution degrees of the training data sets are obtained, selecting a target training data set of the machine learning model according to the contribution degrees of the training data sets. Specifically, the contribution degrees of the training data sets may be compared, and the training data set with the largest contribution degree is selected as the target training data set, or the 3 training data sets with the top 3 contribution degrees may be selected as the target training data set, or a part of data in the training data set with the largest contribution degree and the training data set with the second contribution degree is selected to form the target training data set. Specifically, according to the contribution degree of each training data set, the user can set the contribution degree according to specific needs.
Step S50, inputting the target training data set into the machine learning model to train the machine learning model.
After the target training data set is determined, the target training data set is input into a machine learning model to train the machine learning model. Further, after determining the contribution degree of each training data set, the contribution degree of each training data set may also be sent to the terminal corresponding to the service staff. After the terminal receives the contribution degrees of the training data sets, the terminal outputs the contribution degrees of the training data sets, so that business personnel can select suppliers of the training data sets according to the contribution degrees of the training data sets, and the business personnel can evaluate benefit values brought by the training data sets during data transaction. If the training data set with large contribution degree is preferentially selected, the training data set with large contribution degree is priced higher, and the training data set with small contribution degree is priced lower.
In the embodiment, the target training data set of the machine learning model is trained according to the contribution degree of the training data set, and the target training data set is input into the machine learning model to train the machine learning model, so that the accuracy of the machine learning model obtained by training on data prediction is improved.
Further, the present invention provides a device for calculating the contribution of a training data set, which, with reference to fig. 3, includes:
an obtaining module 10, configured to obtain each training data set of a training machine learning model;
a calculating module 20, configured to calculate a shapril additive model interpretation method SHAP target value of each feature in the training data set; and calculating the contribution degree of the training data set according to the SHAP target value of each feature in the training data set.
Further, the calculation module 20 includes:
the first calculation unit is used for calculating SHAP values corresponding to all the features in the training data set, calculating absolute values of the SHAP values corresponding to all the features, and obtaining the SHAP absolute values corresponding to all the features;
and the first determining unit is used for determining each feature in the training data set as a target feature, and obtaining a SHAP target value corresponding to the target feature according to SHAP absolute values of the target feature in different training data sets.
Further, the first determination unit includes:
the determining subunit is used for determining SHAP absolute values of the target features in different training data sets;
the calculating subunit is used for calculating SHAP average values corresponding to SHAP absolute values of the target features in different training data sets;
the determining subunit is further configured to determine the average SHAP value as a SHAP target value corresponding to the target feature.
Further, the first calculating unit is further configured to calculate a marginal profit expectation corresponding to each feature in the training data set; and calculating the SHAP value of each characteristic corresponding to the marginal profit expectation according to the marginal profit expectation to obtain the SHAP value corresponding to each characteristic in the training data set.
Further, the calculation module 20 further includes:
a second determining unit, configured to determine a SHAP target value of each feature in the training data set, and determine the number of data sets of the training data set in which each feature is located;
and the second calculating unit is used for calculating the contribution degree of the training data set according to the SHAP target value corresponding to each feature in the training data set and the number of the data sets.
Further, the second calculating unit is further configured to calculate a quotient between the SHAP target value corresponding to each feature in the training data set and the number of the data sets; and adding the quotient values corresponding to the features in the training data set to obtain the contribution degree of the training data set.
Further, the contribution calculating means of the training data set further includes:
the selection module is used for selecting a target training data set for training the machine learning model according to the contribution degree of each training data set;
an input module to input the target training data set into the machine learning model to train the machine learning model.
The specific implementation of the apparatus for calculating contribution of training data set of the present invention is substantially the same as the embodiments of the method for calculating contribution of training data set, and will not be described herein again.
In addition, the invention also provides a contribution calculation device of the training data set. As shown in fig. 4, fig. 4 is a schematic structural diagram of a hardware operating environment according to an embodiment of the present invention.
It should be noted that fig. 4 is a schematic structural diagram of a hardware operating environment of a computing device that can calculate the contribution of a training data set. The contribution degree calculation device of the training data set in the embodiment of the invention can be a terminal device such as a PC (personal computer) or a portable computer.
As shown in fig. 4, the contribution degree calculation device of the training data set may include: a processor 1001, such as a CPU, a memory 1005, a user interface 1003, a network interface 1004, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration of the contribution calculating device for the training data set shown in fig. 4 does not constitute a limitation of the contribution calculating device for the training data set, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.
As shown in fig. 4, a memory 1005, which is a kind of computer storage medium, may include therein a contribution degree calculation program of an operating system, a network communication module, a user interface module, and a training data set. The operating system is a program for managing and controlling the contribution degree calculation device hardware and software resources of the training data set, and supports the operation of the contribution degree calculation program of the training data set and other software or programs.
In the contribution calculating apparatus of the training data set shown in fig. 4, the user interface 1003 is mainly used for connecting other terminals, and performing data communication with other terminals, such as acquiring the training data set from other terminals; the network interface 1004 is mainly used for the background server and performs data communication with the background server; the processor 1001 may be configured to call a contribution calculating program of the training data set stored in the memory 1005 and perform the steps of the contribution calculating method of the training data set as described above.
The specific implementation of the device for calculating the contribution of the training data set of the present invention is substantially the same as the embodiments of the method for calculating the contribution of the training data set, and will not be described herein again.
Furthermore, an embodiment of the present invention also provides a computer-readable storage medium, on which a contribution calculating program of a training data set is stored, which, when executed by a processor, implements the steps of the contribution calculating method of the training data set as described above.
The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the method for calculating contribution of the training data set, and will not be described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. A method of calculating the contribution of a training data set, the method comprising the steps of:
acquiring each training data set of a training machine learning model;
calculating SHAP target values of the Shapril additive model interpretation method of each characteristic in the training data set;
and calculating the contribution degree of the training data set according to the SHAP target value of each feature in the training data set.
2. The method of calculating the contribution of a training data set of claim 1, wherein the step of calculating the SHAP target value for each feature in the training data set comprises:
calculating SHAP values corresponding to the features in the training data set, and calculating absolute values of the SHAP values corresponding to the features to obtain SHAP absolute values corresponding to the features;
determining each feature in the training data set as a target feature, and obtaining a SHAP target value corresponding to the target feature according to SHAP absolute values of the target feature in different training data sets.
3. The method of calculating the contribution of the training data set according to claim 2, wherein the step of obtaining the target value of the SHAP corresponding to the target feature according to the SHAP absolute values of the target feature in different training data sets comprises:
determining SHAP absolute values of the target features in different training data sets, and calculating SHAP average values corresponding to the SHAP absolute values of the target features in the different training data sets;
and determining the SHAP average value as a SHAP target value corresponding to the target feature.
4. The method of calculating a contribution of a training data set according to claim 2, wherein the step of calculating the SHAP value for each feature in the training data set comprises:
calculating marginal profit expectation corresponding to each feature in the training data set;
and calculating the SHAP value of each characteristic corresponding to the marginal profit expectation according to the marginal profit expectation to obtain the SHAP value corresponding to each characteristic in the training data set.
5. The method of calculating the contribution of the training data set according to claim 1, wherein the step of calculating the contribution of the training data set based on the SHAP target values of the respective features in the training data set comprises:
determining SHAP target values of all the characteristics in the training data set, and determining the number of data sets of the training data set where all the characteristics are located;
and calculating the contribution degree of the training data set according to the SHAP target value corresponding to each feature in the training data set and the number of the data sets.
6. The method of claim 5, wherein the step of calculating the contribution of the training dataset based on the SHAP target value and the number of the datasets corresponding to each feature in the training dataset comprises:
calculating a quotient value between the SHAP target value corresponding to each feature in the training data set and the number of the data sets;
and adding the quotient values corresponding to the features in the training data set to obtain the contribution degree of the training data set.
7. The method of calculating the contribution of the training data set according to any of claims 1 to 6, wherein the step of calculating the contribution of the training data set according to the SHAP target value of each feature in the training data set further comprises:
selecting a target training data set for training the machine learning model according to the contribution degree of each training data set;
inputting the target training data set into the machine learning model to train the machine learning model.
8. An apparatus for calculating a contribution of a training data set, the apparatus comprising:
the acquisition module is used for acquiring each training data set of the training machine learning model;
the calculation module is used for calculating SHAPPI additive model interpretation method SHAP target values of all the characteristics in the training data set; and calculating the contribution degree of the training data set according to the SHAP target value of each feature in the training data set.
9. A contribution calculation apparatus for a training data set, characterized in that the contribution calculation apparatus for a training data set comprises a memory, a processor and a contribution calculation program for a training data set stored on the memory and executable on the processor, which when executed by the processor implements the steps of the contribution calculation method for a training data set as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a contribution calculation program of a training data set, which when executed by a processor implements the steps of the contribution calculation method of the training data set according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010123970.6A CN111325353A (en) | 2020-02-28 | 2020-02-28 | Method, device, equipment and storage medium for calculating contribution of training data set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010123970.6A CN111325353A (en) | 2020-02-28 | 2020-02-28 | Method, device, equipment and storage medium for calculating contribution of training data set |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111325353A true CN111325353A (en) | 2020-06-23 |
Family
ID=71172957
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010123970.6A Pending CN111325353A (en) | 2020-02-28 | 2020-02-28 | Method, device, equipment and storage medium for calculating contribution of training data set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111325353A (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111784506A (en) * | 2020-07-01 | 2020-10-16 | 深圳前海微众银行股份有限公司 | Overdue risk control method, device and readable storage medium |
CN111932018A (en) * | 2020-08-13 | 2020-11-13 | 中国工商银行股份有限公司 | Bank business performance contribution information prediction method and device |
CN111959518A (en) * | 2020-08-14 | 2020-11-20 | 北京嘀嘀无限科技发展有限公司 | Data processing method, device and equipment |
CN112101528A (en) * | 2020-09-17 | 2020-12-18 | 上海交通大学 | Terminal contribution measurement method based on back propagation |
CN112784986A (en) * | 2021-02-08 | 2021-05-11 | 中国工商银行股份有限公司 | Feature interpretation method, device, equipment and medium for deep learning calculation result |
CN113111977A (en) * | 2021-05-20 | 2021-07-13 | 润联软件系统(深圳)有限公司 | Method and device for evaluating contribution degree of training sample and related equipment |
CN113240527A (en) * | 2021-06-03 | 2021-08-10 | 厦门太也网络科技有限公司 | Bond market default risk early warning method based on interpretable machine learning |
CN113297593A (en) * | 2021-05-14 | 2021-08-24 | 同盾控股有限公司 | Method, device, equipment and medium for calculating contribution degree based on privacy calculation |
CN113592557A (en) * | 2021-08-03 | 2021-11-02 | 北京有竹居网络技术有限公司 | Attribution method and device of advertisement putting result, storage medium and electronic equipment |
CN113657996A (en) * | 2021-08-26 | 2021-11-16 | 深圳市洞见智慧科技有限公司 | Method and device for determining feature contribution degree in federated learning and electronic equipment |
CN114021918A (en) * | 2021-10-26 | 2022-02-08 | 江苏苏宁银行股份有限公司 | Reason code real-time model interpretation method based on model |
CN114418132A (en) * | 2021-12-27 | 2022-04-29 | 海信集团控股股份有限公司 | Use and training method of family health management model |
CN114664382A (en) * | 2022-04-28 | 2022-06-24 | 中国人民解放军总医院 | Multi-group association analysis method and device and computing equipment |
CN115099540A (en) * | 2022-08-25 | 2022-09-23 | 中国工业互联网研究院 | Carbon neutralization treatment method based on artificial intelligence |
CN115358348A (en) * | 2022-10-19 | 2022-11-18 | 成都数之联科技股份有限公司 | Vehicle straight-through rate influence characteristic determination method, device, equipment and medium |
WO2023082969A1 (en) * | 2021-11-11 | 2023-05-19 | 重庆邮电大学 | Data feature combination pricing method and system based on shapley value and electronic device |
CN117273670A (en) * | 2023-11-23 | 2023-12-22 | 深圳市云图华祥科技有限公司 | Engineering data management system with learning function |
CN117435580A (en) * | 2023-12-21 | 2024-01-23 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Database parameter screening method and related equipment |
-
2020
- 2020-02-28 CN CN202010123970.6A patent/CN111325353A/en active Pending
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111784506A (en) * | 2020-07-01 | 2020-10-16 | 深圳前海微众银行股份有限公司 | Overdue risk control method, device and readable storage medium |
CN111932018B (en) * | 2020-08-13 | 2023-09-19 | 中国工商银行股份有限公司 | Bank business performance contribution information prediction method and device |
CN111932018A (en) * | 2020-08-13 | 2020-11-13 | 中国工商银行股份有限公司 | Bank business performance contribution information prediction method and device |
CN111959518A (en) * | 2020-08-14 | 2020-11-20 | 北京嘀嘀无限科技发展有限公司 | Data processing method, device and equipment |
CN112101528A (en) * | 2020-09-17 | 2020-12-18 | 上海交通大学 | Terminal contribution measurement method based on back propagation |
CN112101528B (en) * | 2020-09-17 | 2023-10-24 | 上海交通大学 | Terminal contribution measurement method based on back propagation |
CN112784986A (en) * | 2021-02-08 | 2021-05-11 | 中国工商银行股份有限公司 | Feature interpretation method, device, equipment and medium for deep learning calculation result |
CN113297593A (en) * | 2021-05-14 | 2021-08-24 | 同盾控股有限公司 | Method, device, equipment and medium for calculating contribution degree based on privacy calculation |
CN113111977A (en) * | 2021-05-20 | 2021-07-13 | 润联软件系统(深圳)有限公司 | Method and device for evaluating contribution degree of training sample and related equipment |
CN113240527A (en) * | 2021-06-03 | 2021-08-10 | 厦门太也网络科技有限公司 | Bond market default risk early warning method based on interpretable machine learning |
CN113592557A (en) * | 2021-08-03 | 2021-11-02 | 北京有竹居网络技术有限公司 | Attribution method and device of advertisement putting result, storage medium and electronic equipment |
CN113657996A (en) * | 2021-08-26 | 2021-11-16 | 深圳市洞见智慧科技有限公司 | Method and device for determining feature contribution degree in federated learning and electronic equipment |
CN114021918A (en) * | 2021-10-26 | 2022-02-08 | 江苏苏宁银行股份有限公司 | Reason code real-time model interpretation method based on model |
WO2023082969A1 (en) * | 2021-11-11 | 2023-05-19 | 重庆邮电大学 | Data feature combination pricing method and system based on shapley value and electronic device |
CN114418132A (en) * | 2021-12-27 | 2022-04-29 | 海信集团控股股份有限公司 | Use and training method of family health management model |
CN114664382B (en) * | 2022-04-28 | 2023-01-31 | 中国人民解放军总医院 | Multi-group association analysis method and device and computing equipment |
CN114664382A (en) * | 2022-04-28 | 2022-06-24 | 中国人民解放军总医院 | Multi-group association analysis method and device and computing equipment |
CN115099540B (en) * | 2022-08-25 | 2022-11-08 | 中国工业互联网研究院 | Carbon neutralization treatment method based on artificial intelligence |
CN115099540A (en) * | 2022-08-25 | 2022-09-23 | 中国工业互联网研究院 | Carbon neutralization treatment method based on artificial intelligence |
CN115358348A (en) * | 2022-10-19 | 2022-11-18 | 成都数之联科技股份有限公司 | Vehicle straight-through rate influence characteristic determination method, device, equipment and medium |
CN117273670A (en) * | 2023-11-23 | 2023-12-22 | 深圳市云图华祥科技有限公司 | Engineering data management system with learning function |
CN117273670B (en) * | 2023-11-23 | 2024-03-12 | 深圳市云图华祥科技有限公司 | Engineering data management system with learning function |
CN117435580A (en) * | 2023-12-21 | 2024-01-23 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Database parameter screening method and related equipment |
CN117435580B (en) * | 2023-12-21 | 2024-03-22 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Database parameter screening method and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111325353A (en) | Method, device, equipment and storage medium for calculating contribution of training data set | |
US12067060B2 (en) | Automatic document negotiation | |
WO2020253357A1 (en) | Data product recommendation method and apparatus, computer device and storage medium | |
KR20200030252A (en) | Apparatus and method for providing artwork | |
CN111552870A (en) | Object recommendation method, electronic device and storage medium | |
US20150178134A1 (en) | Hybrid Crowdsourcing Platform | |
US11748452B2 (en) | Method for data processing by performing different non-linear combination processing | |
WO2018184548A1 (en) | Method and device for providing proposed quote for insurance policy, terminal apparatus, and medium | |
CN113947336A (en) | Method, device, storage medium and computer equipment for evaluating risks of bidding enterprises | |
CN115311676A (en) | Picture examination method and device, computer equipment and storage medium | |
CN112785418B (en) | Credit risk modeling method, apparatus, device and computer readable storage medium | |
CN108876422B (en) | Method and device for information popularization, electronic equipment and computer readable medium | |
CN113327132A (en) | Multimedia recommendation method, device, equipment and storage medium | |
CN113052246A (en) | Method and related device for training classification model and image classification | |
JP6489340B1 (en) | Comparison target company selection system | |
CN116128607A (en) | Product recommendation method, device, equipment and storage medium | |
CN114897607A (en) | Data processing method and device for product resources, electronic equipment and storage medium | |
CN113515701A (en) | Information recommendation method and device | |
CN113610385A (en) | Energy enterprise commodity evaluation result obtaining method and system and computer equipment | |
CN112307334A (en) | Information recommendation method, information recommendation device, storage medium and electronic equipment | |
CN111343265A (en) | Information pushing method, device, equipment and readable storage medium | |
CN111897910A (en) | Information pushing method and device | |
US11972358B1 (en) | Contextually relevant content sharing in high-dimensional conceptual content mapping | |
US20240220902A1 (en) | Systems and methods for automatic handling of score revision requests | |
US20230042156A1 (en) | Systems and methods for valuation of a vehicle |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |