WO2023082969A1

WO2023082969A1 - Data feature combination pricing method and system based on shapley value and electronic device

Info

Publication number: WO2023082969A1
Application number: PCT/CN2022/126712
Authority: WO
Inventors: 余海燕; 刘珂; 缪红霞
Original assignee: 重庆邮电大学
Priority date: 2021-11-11
Filing date: 2022-10-21
Publication date: 2023-05-19
Also published as: CN113919886A

Abstract

The present invention relates to machine learning, and in particular to a data feature combination pricing method and system based on a Shapley value and an electronic device. The method comprises: collecting feature variables of a feature data set provided by a seller, and preprocessing the feature variables; constructing a learning model based on machine learning, and selecting an optimal feature classification variable from feature classification variables; estimating feature Shapley values constructed on the basis of ghost data instances so as to calculate marginal contribution and an average Shapley value of selected feature variables; and according to the marginal contribution and the average Shapley value of the feature variables, determining whether the feature variables can be subjected to transaction, and if yes, performing the transaction. In the embodiments of the present invention, long-term benefit maximization of a data provider can be realized, risk assessment of the data seller on the data buyer company is met, and the risk loss is reduced.

Description

Data characteristic combination pricing method, system and electronic equipment based on Shapley value

technical field

The present invention relates to machine learning, in particular to a data feature combination pricing method, system and electronic equipment based on Shapley values.

Background technique

The advancement of data analysis brought about by machine learning and data mining technology makes the value of big data generated immeasurable, so data has become a new type of asset. A large amount of data will be generated during the operation of the enterprise, and the collected data can also be traded to increase the income of the enterprise and maximize the income of the enterprise. Because data is different from traditional commodities, it has the characteristics of a large number, variety, high speed, and reproducibility. In addition, data is extremely dependent on its timeliness. Data that lacks timeliness will have a significant impact on data prices, and the value of data also has a significant impact. Certainty, diversity, and sparsity, so pricing data is still a relatively new puzzle.

For example, a bank uses financial technology to analyze various data, machine learning and forecasting through the purchased feature data, which provides an important tool for solving the problem of information asymmetry. In the process of lending to a certain enterprise, the bank will not only use the data about the enterprise within the banking system, but also use the valuable external data that can be obtained about the operating ability of the enterprise. Obtain data about the enterprise through purchases and other means, and use machine learning technology to analyze the operating capabilities of the enterprise to reduce loan risks. By capturing the trajectory of the enterprise's production and operation, it provides financial institutions with reliable "credit data", which not only improves the possibility of successful loans, but also reduces transaction costs and credit service thresholds.

This data transaction process is realized through a third-party data transaction platform, which can not only guarantee the privacy and security of the buyer's data to a certain extent, but also ensure that the price of the data buyer is reasonable through dynamic market pricing. The third-party trading platform needs to price the purchased data in the market, and provide the data and payment fees required by both parties to the transaction. In order to ensure the interests of data sellers and third-party data trading platforms, companies that successfully purchase data need to sign a confidentiality agreement with the platform. The data is limited to the company's own business use and cannot be disseminated or re-sold.

The third-party data trading platform builds a data feature selection model and a feature value distribution algorithm that approximates the Shapley value, and can judge which feature variables have the greatest impact on the results and which feature variables have less impact on the results based on the obtained results. The buyer pays attention to the set of features with greater influence, and controls risks and reduces losses through machine learning results to a certain extent. For banks, the purchase of this data can obtain specific information of the corresponding industry, which provides support for the loan evaluation and analysis of the industry, and can also reduce loan risks. At the same time, data sellers can also get a profit.

The third-party trading platform provides data dynamic pricing methods and systems. For problems such as massive data features and redundancy, a feature selection algorithm based on increasing prediction accuracy is used. Through the random forest prediction algorithm, the combination of recursive feature elimination method, cross-validation and feature combination can effectively select the data features, and then carry out information mining analysis on the selected data features. Since different data features have different contributions to prediction, the present invention proposes a data feature contribution distribution method based on the Shapley value, which can calculate the corresponding effect of each feature (marginal contribution to prediction accuracy). Finally, the monitoring feature data of the transaction is used to realize dynamic pricing by means of auction and multiplication weight update algorithm. Based on the payment function of Myerson's optimal auction, the improved multiplicative weight update algorithm realizes dynamic pricing of data characteristics, which is conducive to fully realizing the value of data and bringing additional income to enterprises.

Contents of the invention

In order to enable the third-party data transaction platform to make full use of the characteristics obtained from the detection of the enterprise's products to realize the auction of data, so that the buyer can extract key information from the purchased data, and also obtain information about the industry in which the data seller's enterprise belongs, the present invention proposes a Data feature combination pricing method, system and electronic equipment based on Shapley value, the method includes being

Collect the characteristic variables of the characteristic data set provided by the seller and preprocess it;

Construct a learning model based on machine learning, and select the optimal feature classification variable from the feature classification variables;

When selecting the optimal variable, estimate the characteristic Shapley value based on the ghost data instance to calculate the marginal contribution and average Shapley value of the selected characteristic variable;

And when selecting the optimal variable, use the Shapley value to allocate the value of each feature according to its marginal contribution, quantify the impact of different input features on the output prediction results of the training model, and keep the features that meet the setter's marginal contribution;

Detect whether the data can be used for machine learning and trading. If it can be used for machine learning and trading, the data buyer and seller construct a transaction, and obtain the predicted value of the current data through the constructed learning model as the payment price of the data.

Further, the process of selecting the optimal feature classification variables from the feature classification variables includes the following steps:

Use all feature subvariable data to train the learning model based on machine learning;

Sort the importance of the feature variables, and select the top k features with the largest importance values;

Evaluate the model on the validation set, recalculate and rank the importance of each feature variable;

Split the training set into a new training set and a new validation set, use the new training set and all feature variables to train the model, use the validation set to evaluate the model, calculate and rank the importance of all feature variables.

Further, the estimation of the Shapley value of the feature variable based on the ghost data instance includes randomly selecting an instance from the feature variable, constructing an instance with a certain feature and an instance without the aforementioned feature, and combining the two instances As a ghost data instance.

Further, the marginal contribution of feature variables is expressed as:

in,

is the boundary contribution value of the jth feature in the instance x during the mth iteration;

For instance x at the mth iteration using the prediction achieved by the instance with feature j,

is the feature vector after the features after the jth feature in the instance x are randomly replaced by the features in the instance z at the mth iteration;

For instance x at the mth iteration the prediction is achieved using the instance without feature j,

It is the feature vector after the jth feature in the instance x and the features after the jth feature are randomly replaced by the features in the instance z at the mth iteration.

Further, the process of pricing the characteristic variable includes the following steps:

S41. Before trading with the data buyer, the data seller first sets the price p _n of the transaction data, the number of buyers and the buyer's quotation, and calculates the data buyer's income function;

S42. Calculate the final payment of the data buyer according to the buyer's revenue function; the data buyer pays the fee and trades the selected characteristic variables;

S43. The seller updates the data price based on the multiplication weight update algorithm, returns to S41, and starts the next round of pricing.

Further, the data buyer pays the fee R _n expressed as:

Among them, G(b _n ,p _n ) is the buyer's profit function when the seller sets the price of transaction data as p _n and the buyer's quotation is b _n .

Further, the seller’s income function is determined according to the price of the transaction data set by the seller and the quotation of the buyer. When the seller’s price is fixed, when the quotation b _n is smaller than the price p _n of the transaction data set by the seller, as the quotation b _n increases The profit of the big buyer increases until the quotation b _n is equal to the price p _n of the transaction data set by the seller to reach the maximum profit; when the quotation b _n is greater than the price p _n of the transaction data set by the seller, the buyer’s utility remains at the maximum value and the buyer pays Fees also remain at the same maximum value.

Furthermore, each time the price is determined, when the same data is sold to multiple users, the data is priced according to the data copy price. If the data is copied into i samples, the selling price Sn of each sample is:

Among them, S is the selling price when there is only one piece of data, and e is the penalty factor.

The present invention proposes a data feature combination pricing system based on the Shapley value, including a feature selection subsystem and a pricing subsystem. The feature selection subsystem screens features, and the pricing subsystem performs pricing auctions on the screened features;

The feature subsystem includes the machine learning model and the Shapley analysis model. The machine learning model performs training and prediction based on the data, sorts the predicted values as the importance of the features, and sends the K features with the greatest importance to the Shapley analysis The model is analyzed; the Shapley analysis model calculates the editorial contribution and the average Shapley value of the feature variable;

In the pricing subsystem, data buyers base their prices on data sellers.

The present invention also proposes a pricing electronic device based on the combination of data features of the Shapley value, including a processor and a memory, any one of the aforementioned pricing methods based on the combination of data features of the Shapley value according to claim 1, and processing The processor is capable of running a Shapley value-based data feature combination pricing method stored in memory.

The present invention has the following advantages:

1. For the feature selection problem in the prediction process, the idea of recursive feature elimination based on cross-validation combined with feature permutation and combination, and considering the prediction accuracy, a feature selection method combining the two is designed.

2. Be able to adapt to feature selection under different forecasting costs, and select the most valuable input features for the forecasting model.

3. The prediction contribution distribution problem of data information features The approximate Shapley value method is used to explain the data features globally and locally.

4. The designed data transaction model and real-time dynamic pricing algorithm can maximize the long-term profit of the enterprise; at the same time, the characteristic data obtained from the auction also provides data buyers such as banks or insurance companies with loan evaluation business decision support, reducing the loss of loans and compensation .

5. The auction data obtained by the third-party trading platform can be visualized through the transaction control panel to quickly extract key information.

Description of drawings

Figure 1 is a schematic diagram of the overall architecture of the dynamic pricing based on the combination of Sharpe value data features disclosed in the embodiment of the present invention;

Fig. 2 is a schematic diagram of data feature selection and sorting based on Sharpley value disclosed in the embodiment of the present invention;

Fig. 3 is a schematic diagram of a characteristic Shapley value based on machine learning disclosed in an embodiment of the present invention;

Fig. 4 is a schematic diagram of auction pricing based on data characteristics disclosed in the embodiment of the present invention;

Fig. 5 is a schematic diagram of a data dynamic pricing control panel disclosed in an embodiment of the present invention;

Fig. 6 is a schematic diagram of information interaction of a Shapley value-based data feature combination dynamic pricing method disclosed in an embodiment of the present invention;

Fig. 7 is a schematic diagram of introducing a penalty function based on the Shapley value based on data replicability disclosed by the embodiment of the present invention;

Fig. 8 is a schematic structural diagram of a Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention;

Fig. 9 is a schematic structural diagram of another Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

The present invention proposes a data feature combination pricing method based on the Shapley value, which specifically includes the following steps:

Construct a learning model based on machine learning, and select the optimal feature classification variables from the feature classification variables;

Estimation of the characteristic Shapley value constructed based on the ghost data instance to calculate the marginal contribution and average Shapley value of the selected characteristic variables;

The quality inspector judges whether the data can be traded, and if it can be traded.

There are already mature methods for data quality inspection in the field. In the embodiment of the present invention, a sample-assisted real-time inspection method for data quality inspection is described for illustration, including:

When the transaction platform labeler completes the data labeling task, the quality inspector first deletes the missing data, first deletes the columns (attributes) with a missing rate higher than 10%, and then deletes the rows (tuples) with missing data; The labeled data undergoes multiple rounds of manual inspection;

In the first round of manual inspection, the quality inspector conducts a round of sampling inspection on the marked data, and conducts random sampling or stratified sampling on 50% of the marked data for inspection. If all the marked data are qualified in the first round, then in the second round In the first round of inspection, 25% of the labeled data will be inspected for quality;

If there are more than 50% unqualified labeling data in the first round, the quality inspector needs to conduct a full sample inspection of the labeler's data labeling in the second round of inspection;

If there are more than 10% and less than 50% unqualified labeled data in the first round, the amount of labeled data inspected in the second round of sampling inspection will be doubled compared with the first round;

If the unqualified labeled data in the first round is less than 10%, the amount of labeled data inspected in the second round of sampling inspection will increase by 30% compared with the first round;

If the unqualified labeled data in the first round is less than 1%, data transactions can be carried out;

Repeat the above inspection process until the conditions for data transactions are met.

In this embodiment, a learning model based on machine learning is constructed to predict a single instance. The prediction process is "payment", and "revenue" is the actual prediction of the instance minus the average predicted value of all samples, and the Shapley value of the feature It is the average marginal contribution of the feature in all feature sequences, so as to fairly divide the contribution of each feature to the prediction result.

In the present invention, about the characteristic Shapley value estimation of construction, what solve is the characteristic value distribution (contribution) problem based on Shapley value, and specific content is to use the difference of the average prediction value of the prediction result of specific instance and data set as the characteristic of this instance Feature Shapley value (revenue), through two random examples to simulate the appearance or absence of features, calculate the marginal value of features in a specific instance, and use the mean of the absolute value of its Shapley value as the feature in the data set global value. The data features in the experimental dataset work together in a machine learning algorithm to produce a predicted value. Use the Shapley value to allocate the value of each feature according to its marginal value (contribution), quantify the impact of different input features on the output prediction results of the training model, and the distribution of feature values balances the data prediction accuracy and prediction cost, and determines the choice of a certain feature. .

Example 1

This embodiment proposes a dynamic pricing method based on the combination of data features of the Shapley value. The schematic diagram of the general architecture is shown in Figure 1, which includes the following steps:

101. The sensor data of the third-party data trading platform is associated with relevant historical files to obtain a characteristic data set. Sensor data comes from data sets collected by data sellers using sensors in all aspects of production and operation.

102. Perform feature selection and sorting on the collected feature data. First, cross-validate and recursive feature elimination (CV-RFE) based on cross-validation is carried out, and corresponding weights are assigned to all S features for training, and then the random forest prediction model is used for training on these original data features. This step obtains the weight value of each input feature. Then take the absolute value of the weight, and remove the feature corresponding to the minimum absolute value of the weight. This step eliminates several weight coefficient features. Finally, the next round of training is performed based on the new feature set. The first two steps are regarded as a round of training and continuous recursion. After multiple rounds of training, the number of remaining features reaches the required number of features, and features are selected by recursively reducing the size of the feature set under investigation. Use the cross-validation method to finally determine the k features with increasing prediction accuracy, and then combine and arrange the k features generated by the above-mentioned CV-RFE and the incremental screening of prediction accuracy to output all subsets, resulting in 2 ^k -1 (remove the empty set) feature subsets, then set all feature subsets as training set and verification set, bring them into model training and calculate the accuracy of feature subsets respectively, take the mean value of multiple rounds of iterative experiments, and finally compare the feature subsets with the corresponding Accuracy output.

103. Estimation of the characteristic Shapley value constructed based on the ghost data instance. The Shapley value is the feature's contribution to the prediction; the value function is the payoff function of the coalition of participants (features). To calculate the exact Shapley value of the i-th feature, it is necessary to evaluate the predicted values of all eigenvalue unions with (without) feature i. The more features, the number of alliances increases exponentially with the increase of features. The approximate Shapley value calculation based on Monte Carlo sampling can solve this problem:

in,

For instance x at the m-th iteration the prediction is realized using instances with feature j, the eigenvalues after feature j are replaced by the eigenvalues of randomly sampled instance z. x vector

and

approximately equal, but

Part of the eigenvalues of is also from random instance sampling z, both of which are combined new samples. Combined with the nature of the data features, the Shapley value is used to distribute the value of the data features according to the predicted contribution, and the result is fair. The specific calculation steps are shown in Table 1.

Table 1 Revenue split algorithm based on approximate Shapley value

104. Design an auction mechanism for data characteristics, and test whether the data is suitable for machine learning and trading based on factors such as data quality. If the transaction can be carried out, the mechanism design is carried out; otherwise, the mechanism design is not carried out.

105. Data dynamic pricing system and data transaction control panel based on multiplication weight update algorithm. Based on the idea of multiplicative weights and the characteristics of data transactions, a pricing algorithm based on multiplicative weight update weights is designed to maximize the long-term income of the platform, so that the generated price income is the same as the income obtained by the optimal price in hindsight. The average regret value of participants is 0, which is conducive to the maximum utility of both buyers and sellers, forming a benign transaction relationship, and giving full play to the value of data; the data transaction control panel summarizes the obtained auction price and other information, and displays it in a variety of visual information such as graphics.

Example 2

An important feature of the present invention is the feature selection and feature Shapley value estimation algorithm based on machine learning to obtain the distribution of individual feature prediction contributions, as well as the correlation trend and global importance of features. This embodiment further illustrates this.

In the process of model selection (see Figure 2), the feature data (201) is collected by sensors and the like and then divided into a training set (202) and a verification set (209), and the training set is divided into optimal feature selection under a fixed number of features , using CV-RFE feature selection (203) to obtain the optimal number of features (204); the optimal feature selection under the variable number of features is to perform feature combination arrangement (205), and determine the optimal combination of the number of features (206) . Perform machine learning (207) on the optimal number of features obtained in the training set and the optimal combination to obtain a prediction model (208), and finally perform model verification (210) on the data of the verification set (209) to obtain the optimal features The optimal combination of the number of features and the number of features.

In the reasoning stage (see Figure 3), the sorted feature vectors are used in the prediction model of machine learning (301) to obtain the prediction results (302), and then all the feature data and prediction results are brought into the Shapley value analysis model (303 ). Finally, the global importance of features, the trend of correlation between features and prediction results, and the distribution of prediction contributions of individual feature data are obtained. Based on the established model, the contribution analysis of prediction results using the Shapley value method can be divided into two levels. On the global level (306, 305), the distribution of the Shapley value can be used to describe the specific influence, law and correlation of features. ; At the local level (304), the quantified contribution of each feature in each sample prediction can be given. After using the Shapley value algorithm to get the value contribution of each feature, it can be balanced with the cost of data collection.

Example 3

Fig. 4 is a data feature auction transaction pricing mechanism of a multiplication weight update algorithm disclosed in an embodiment of the present invention. The multiplication weight update process maintains the weight of each pricing strategy and randomly selects strategies for repeated iterations to achieve the maximum long-term operating income. Assume that a decision set contains α alternative decisions, corresponding to a specific income β (income is not a priori), and multiple rounds of selection are performed on it. In each round, the current weight of each decision is multiplied by the income factor related to the current round of income and updated Weight, the decision-making party repeatedly makes choices and obtains corresponding benefits. After multiple rounds, the weight value of the strategy with the highest profit will become prominent, and the probability of the strategy being selected will increase significantly.

Taking experts' opinions to predict data auction prices as an example, the core idea of the multiplication weight update algorithm is illustrated. Assuming that the auction price trend is random, and it is desired to predict the state of the auction price (fall or rise) through the opinions of experts, all N experts form a set C. Before the data auction, the suggestion of an expert i in C is randomly selected to predict the trend of the data auction (down or up). If the expert’s prediction is wrong, the price will be 1; if the prediction is correct, the loss will be 0. Since expert i is randomly selected for prediction, in order to make better decisions, the algorithm aims to control the prediction near the best-performing expert in the long run, that is, in the next round of prediction, the probability of being selected by the expert who made the correct prediction is higher. Large, by maintaining the weight of this group of experts, each round obeys the opinions of the weighted majority of experts. Let the initial weight of N experts be 1, each round of forecast results is two (down or up) to choose one; introduce the parameter η (η<0.5) as a factor related to income, and in the next round of selection, give the prediction error expert (1 -η) times lowering penalty. After T steps of selection, the upper bound of the error of the algorithm is

The multiplication weight update algorithm mainly has the following four steps:

Step 1: The data seller sets the current price of the data as p _n ; the number of data buyers is n, and the data is purchased sequentially; the data buyer n quotes the data to be purchased as b _n , and for any group of n∈[N] buyers The bids b _n all come from a closed and bounded set B, the diameter of the set B is D, and D<∞, that is, b _n ∈ B.

Step 2: The income function of the data buyer is G(p _n , b _n ), which is related to the buyer’s quotation b _n and the existing price p _n . Different quotations and different current prices will lead to different income for buyer n .

Step 3: The data seller determines the buyer’s payment function RF(p _n , b _n ) based on the existing price p _n and the buyer’s quotation b _n , which is the Lipschitz function and is used to calculate the buyer’s final payment .

Among them, L is the Lipschitz coefficient, b is the buyer's quotation, p ⁽¹⁾ and p ⁽²⁾ are two prices.

Step 4: The data buyer pays the fee R _n , takes away the data prediction result, and completes a single transaction; the data seller updates the data price p _n+1 , returns to the first step, and starts the next round of pricing.

When Lipsitz, bounded quotes hold. Let p _{n: n ∈ [N]} be the output of the pricing algorithm. L is the Lipsitz constant of the payoff function RF. is the largest element of the bounded quote set B. Then by choosing the algorithm parameters:

Then the total average regret value is bounded:

Among them, B _max ∈ R is the buyer’s maximum offer of set B, B _net (ε) ∈ R is the minimum ε grid of B, which means

For all x∈B, there is x ₀ ∈K such that |xx ₀ |≤ε. The elements in B _net (ε) are different prices tested in the multiplicative weighting algorithm, and N refers to the number of different prices.

Table 2 Data characteristics Auction transaction pricing mechanism symbol description

Example 4

The present invention also describes a data dynamic pricing control panel disclosed in the embodiment (see FIG. 5 ). The relevant introduction (501) of the auction data input by the third-party data trading platform, such as the industry information and attributes of the data, for data buyers to view. Multiple data buyers enter the auction market anonymously and choose whether to conduct an auction (502) based on data-related information. If they choose an auction, they will conduct a buyer's bidding (503). If they do not choose an auction, they will wait for the next round of data auction. If only one data buyer bids, the data will belong to that buyer; if multiple buyers bid, the auction will be conducted according to the principle of "the highest bidder wins". Finally, a buyer bids successfully (504), and the data belongs to the buyer, and the remaining buyers wait for the next round of data auction. The third-party data trading platform summarizes the transaction records (505) based on the transaction volume obtained in the above auction steps, such as transaction volume, transaction amount year-on-year, buyer industry proportion display and other information, and displays them in various visual forms such as graphics , and supplemented by relevant information research and judgment and other decision-making.

Example 5

This embodiment provides an information interaction process of a Shapley value data feature combination pricing method, as shown in Figure 6, which is described from the perspective of the data buyer, the third-party data transaction platform server, and the seller's control terminal panel, including the following steps :

The third-party data trading platform transmits and acquires various feature data on site in the production and operation of the data seller (601), and then performs CV-RFE feature selection and sorting (603) on the acquired feature data to obtain a feature data combination and sort;

Using the Shapley value model (604) for these characteristic data to obtain prediction contribution, characteristics and prediction result trend, global importance of characteristics, etc.;

The third-party data transaction platform conducts dynamic transaction pricing on the auction (605). The data seller determines whether to purchase the data according to the value of the auction data and whether it can increase the revenue of the enterprise. The third-party data trading platform collects auction prices, and judges abnormal auction prices such as being too low or too high (606);

Summarize the auction price data information (607) and send (608) to the control panel terminal, and the control panel terminal publishes the result information (609).

Due to the reproducibility of data, the duplication process of data will dilute the income of each data source (described by penalty function), and the overall income distribution Sn(0<S _n ≤1) is certain, S _n =G(P _n , b _n ) is the benefit of the data buyer. The penalty function e (0<e<1) is introduced to solve the problem of data duplication, and the robust Shapley value algorithm is used to solve it (see Figure 7). Given R _n =PD(S _n ,Y _n ;M,G), where S _n is the overall revenue distribution, Y _n is the prediction task, M is the machine learning prediction algorithm; G is the prediction gain function, and PD is the profit Robust Shapley value algorithm (Table 3). Given the similarity measure SM, the i-th replica A′ _i of A, the payment distribution function of A is R _n (A); the output R _n (A) of the sticky Shapley value algorithm (Table 3) is ε- Replication robustness gains.

Table 3 Robust Shapley value algorithm

Referring to Fig. 7, it is taken as an example that the datasets on the market have no replica, 1 replica and 2 replicas. When there is no copy of the data set (701), the overall income distribution is S _n ; when there is 1 copy in the market (702), there are a total of 2 data sets in the market, and the penalty function e is introduced, then the income of each data set Each distribution is 1/2Se; when there are 2 replicas in the market (703), there are 3 data sets in the market, and the penalty function becomes e ² , then the income distribution of each data set is 1/3Se ² , as And so on. When there are more copies of a data set on the market, the less revenue is distributed. After the pricing S is determined each time, when the same data is sold to multiple users, the data is priced according to the data copy price. If the data is copied into i samples, the selling price S _n of each sample is:

Among them, S is the selling price when there is only one piece of data, which is different from the quotation b _n and the price p _n of the data set by the seller. S is the actual selling price of the selling price when there is only one piece of data; e is the penalty factor.

Example 6

A schematic structural diagram of a Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention. Wherein, the data testing device may be electronic equipment. The data testing device may include: the processor 801 transmits effective information, the memory 802 is responsible for storing data such as characteristic data, and after performing characteristic selection and sorting of the production and operation data of the enterprise (803), the data analyzed by the Shapley value is used for auction and the result It is transmitted to the control panel terminal, the communication interface 803 refers to the interface between the central processing unit and the standard communication subsystem, and the control panel 804 performs screen display.

Please refer to FIG. 9 , which is a schematic structural diagram of another Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention. Wherein, the pricing device may be an electronic device. Three algorithms among the present invention carry out according to the following steps:

①Feature selection and sorting: the device can use the data obtained by cascading sensors and files to pass through the acquisition unit 901 and send it to the calculation unit for analysis. After the calculation unit 902 receives the signal, it predicts and sorts the feature data through the control unit 903 and so on, and then the storage unit 904 stores the predicted and sorted feature data, and the storage unit transmits the result to the output unit 905 after the work is completed.

②Estimation of characteristic Shapley value of constructing ghost data instance: Input the above-mentioned feature combination sorting data into 901, hand it over to calculation unit 902 for analysis, and design and construct instances containing ghost data through control unit 903 to calculate the marginal contribution and mean, etc., and then store the result and output the result.

③Dynamic pricing of data based on the multiplication weight update algorithm: input the existing price, number of buyers and quotations set by the seller to the acquisition unit 901, and calculate the buyer’s payment fee through the calculation unit 902 and control unit 903 according to the buyer’s income function and payment function, The storage unit 804 stores the relevant data, and outputs the data price updated by the next round of sellers.

Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications and substitutions can be made to these embodiments without departing from the principle and spirit of the present invention. and modifications, the scope of the invention is defined by the appended claims and their equivalents.

Claims

A data feature combination pricing method based on the Shapley value is characterized in that it specifically includes the following steps:

Collect the characteristic variables of the characteristic data set provided by the seller and preprocess it;

Construct a learning model based on machine learning, and select the optimal feature classification variable from the feature classification variables;

When selecting the optimal variable, estimate the characteristic Shapley value based on the ghost data instance to calculate the marginal contribution and average Shapley value of the selected characteristic variable;

And when selecting the optimal variable, use the Shapley value to allocate the value of each feature according to its marginal contribution, quantify the impact of different input features on the output prediction results of the training model, and keep the features that meet the setter's marginal contribution;

Detect whether the data can be used for machine learning and trading. If it can be used for machine learning and trading, the data buyer and seller construct a transaction, and obtain the predicted value of the current data through the constructed learning model as the payment price of the data.
A kind of data feature combination pricing method based on Shapley value according to claim 1, it is characterized in that, the process of selecting optimal feature classification variables from feature classification variables comprises the following steps:

Use all feature subvariable data to train the learning model based on machine learning;

Sort the importance of the feature variables, and select the top k features with the largest importance values;

Evaluate the model on the validation set, recalculate and rank the importance of each feature variable;

Split the training set into a new training set and a new validation set, use the new training set and all feature variables to train the model, use the validation set to evaluate the model, calculate and rank the importance of all feature variables.
According to claim 1, a data feature combination pricing method based on the Shapley value is characterized in that the estimation of the Shapley value of the characteristic variable based on the ghost data instance includes randomly extracting an instance from the characteristic variable, and constructing a An instance with a certain feature and an instance without the aforementioned feature, and use these two instances as ghost data instances.
According to a kind of data feature combination pricing method based on Shapley value according to claim 1, it is characterized in that, the marginal contribution of feature variable is expressed as:

in,
is the boundary contribution value of the jth feature in the instance x during the mth iteration;
For instance x at the mth iteration using the prediction achieved by the instance with feature j,
is the feature vector after the features after the jth feature in the instance x are randomly replaced by the features in the instance z at the mth iteration;
For instance x at the mth iteration the prediction is achieved using the instance without feature j,
It is the feature vector after the jth feature in the instance x and the features after the jth feature are randomly replaced by the features in the instance z at the mth iteration.
A kind of data feature combination pricing method based on Shapley value according to claim 1, it is characterized in that, the process of pricing this feature variable comprises the following steps:

S41. Before trading with the data buyer, the data seller first sets the price p n of the transaction data, the number of buyers and the buyer's quotation, and calculates the data buyer's income function;

S42. Calculate the final payment of the data buyer according to the buyer's revenue function; the data buyer pays the fee and trades the selected characteristic variables;

S43. The seller updates the data price based on the multiplication weight update algorithm, returns to S41, and starts the next round of pricing.
The data feature combination pricing method based on the Shapley value according to claim 1, wherein the data buyer pays Rn as:

Among them, G(b n ,p n ) is the buyer's profit function when the seller sets the price of transaction data as p n and the buyer's quotation is b n .
According to the data feature combination pricing method based on the Shapley value according to claim 6, it is characterized in that the seller's profit function is determined according to the price of the transaction data set by the seller and the quotation of the buyer. When b n is less than the price p n of the transaction data set by the seller, the profit of the buyer increases as the quotation b n increases until the quotation b n is equal to the price p n of the transaction data set by the seller to reach the maximum profit; when the quotation b n is greater than the seller When setting the price p n of the transaction data, the buyer’s utility remains at the maximum value and the buyer’s payment fee also maintains the maximum value.
According to the data feature combination pricing method based on the Shapley value according to claim 5, it is characterized in that, after the pricing S is determined each time, when the same data is sold to multiple users, the data is priced according to the data copy price, if the data copy is i samples, then the selling price S n of each sample is:

Among them, S is the selling price when there is only one piece of data, and e is the penalty factor.
The data feature combination pricing system based on the Shapley value is characterized in that it includes a feature selection subsystem and a pricing subsystem, the feature selection subsystem screens the features, and the pricing subsystem performs a pricing auction on the screened features;

The feature subsystem includes the machine learning model and the Shapley analysis model. The machine learning model performs training and prediction based on the data, sorts the predicted values as the importance of the features, and sends the K features with the greatest importance to the Shapley analysis The model is analyzed; the Shapley analysis model calculates the editorial contribution and the average Shapley value of the feature variable;

In the pricing subsystem, data buyers base their prices on data sellers.
Electronic equipment based on the combination of data features of the Shapley value, including a processor and a memory, is characterized in that the memory stores any one of claims 1 to 8 based on the Shapley value according to claim 1. The data feature combination pricing method, and the processor can run the data feature combination pricing method based on the Shapley value stored in the memory.