WO2023082969A1 - Data feature combination pricing method and system based on shapley value and electronic device - Google Patents

Data feature combination pricing method and system based on shapley value and electronic device Download PDF

Info

Publication number
WO2023082969A1
WO2023082969A1 PCT/CN2022/126712 CN2022126712W WO2023082969A1 WO 2023082969 A1 WO2023082969 A1 WO 2023082969A1 CN 2022126712 W CN2022126712 W CN 2022126712W WO 2023082969 A1 WO2023082969 A1 WO 2023082969A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
feature
shapley
buyer
features
Prior art date
Application number
PCT/CN2022/126712
Other languages
French (fr)
Chinese (zh)
Inventor
余海燕
刘珂
缪红霞
Original Assignee
重庆邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 重庆邮电大学 filed Critical 重庆邮电大学
Publication of WO2023082969A1 publication Critical patent/WO2023082969A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0283Price estimation or determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/08Auctions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Definitions

  • the present invention relates to machine learning, in particular to a data feature combination pricing method, system and electronic equipment based on Shapley values.
  • a bank uses financial technology to analyze various data, machine learning and forecasting through the purchased feature data, which provides an important tool for solving the problem of information asymmetry.
  • the bank will not only use the data about the enterprise within the banking system, but also use the valuable external data that can be obtained about the operating ability of the enterprise.
  • By capturing the trajectory of the enterprise's production and operation it provides financial institutions with reliable "credit data", which not only improves the possibility of successful loans, but also reduces transaction costs and credit service thresholds.
  • This data transaction process is realized through a third-party data transaction platform, which can not only guarantee the privacy and security of the buyer's data to a certain extent, but also ensure that the price of the data buyer is reasonable through dynamic market pricing.
  • the third-party trading platform needs to price the purchased data in the market, and provide the data and payment fees required by both parties to the transaction.
  • companies that successfully purchase data need to sign a confidentiality agreement with the platform. The data is limited to the company's own business use and cannot be disseminated or re-sold.
  • the third-party data trading platform builds a data feature selection model and a feature value distribution algorithm that approximates the Shapley value, and can judge which feature variables have the greatest impact on the results and which feature variables have less impact on the results based on the obtained results.
  • the buyer pays attention to the set of features with greater influence, and controls risks and reduces losses through machine learning results to a certain extent.
  • the purchase of this data can obtain specific information of the corresponding industry, which provides support for the loan evaluation and analysis of the industry, and can also reduce loan risks.
  • data sellers can also get a profit.
  • the third-party trading platform provides data dynamic pricing methods and systems.
  • a feature selection algorithm based on increasing prediction accuracy is used.
  • the combination of recursive feature elimination method, cross-validation and feature combination can effectively select the data features, and then carry out information mining analysis on the selected data features.
  • the present invention proposes a data feature contribution distribution method based on the Shapley value, which can calculate the corresponding effect of each feature (marginal contribution to prediction accuracy).
  • the monitoring feature data of the transaction is used to realize dynamic pricing by means of auction and multiplication weight update algorithm.
  • the improved multiplicative weight update algorithm realizes dynamic pricing of data characteristics, which is conducive to fully realizing the value of data and bringing additional income to enterprises.
  • the present invention proposes a Data feature combination pricing method, system and electronic equipment based on Shapley value, the method includes being
  • the data buyer and seller construct a transaction, and obtain the predicted value of the current data through the constructed learning model as the payment price of the data.
  • the process of selecting the optimal feature classification variables from the feature classification variables includes the following steps:
  • the estimation of the Shapley value of the feature variable based on the ghost data instance includes randomly selecting an instance from the feature variable, constructing an instance with a certain feature and an instance without the aforementioned feature, and combining the two instances As a ghost data instance.
  • the process of pricing the characteristic variable includes the following steps:
  • the data seller Before trading with the data buyer, the data seller first sets the price p n of the transaction data, the number of buyers and the buyer's quotation, and calculates the data buyer's income function;
  • the seller updates the data price based on the multiplication weight update algorithm, returns to S41, and starts the next round of pricing.
  • G(b n ,p n ) is the buyer's profit function when the seller sets the price of transaction data as p n and the buyer's quotation is b n .
  • the seller’s income function is determined according to the price of the transaction data set by the seller and the quotation of the buyer.
  • the seller’s price is fixed, when the quotation b n is smaller than the price p n of the transaction data set by the seller, as the quotation b n increases The profit of the big buyer increases until the quotation b n is equal to the price p n of the transaction data set by the seller to reach the maximum profit; when the quotation b n is greater than the price p n of the transaction data set by the seller, the buyer’s utility remains at the maximum value and the buyer pays Fees also remain at the same maximum value.
  • the selling price Sn of each sample is:
  • S is the selling price when there is only one piece of data
  • e is the penalty factor
  • the present invention proposes a data feature combination pricing system based on the Shapley value, including a feature selection subsystem and a pricing subsystem.
  • the feature selection subsystem screens features, and the pricing subsystem performs pricing auctions on the screened features;
  • the feature subsystem includes the machine learning model and the Shapley analysis model.
  • the machine learning model performs training and prediction based on the data, sorts the predicted values as the importance of the features, and sends the K features with the greatest importance to the Shapley analysis
  • the model is analyzed; the Shapley analysis model calculates the editorial contribution and the average Shapley value of the feature variable;
  • the present invention also proposes a pricing electronic device based on the combination of data features of the Shapley value, including a processor and a memory, any one of the aforementioned pricing methods based on the combination of data features of the Shapley value according to claim 1, and processing
  • the processor is capable of running a Shapley value-based data feature combination pricing method stored in memory.
  • the designed data transaction model and real-time dynamic pricing algorithm can maximize the long-term profit of the enterprise; at the same time, the characteristic data obtained from the auction also provides data buyers such as banks or insurance companies with loan evaluation business decision support, reducing the loss of loans and compensation .
  • the auction data obtained by the third-party trading platform can be visualized through the transaction control panel to quickly extract key information.
  • Figure 1 is a schematic diagram of the overall architecture of the dynamic pricing based on the combination of Sharpe value data features disclosed in the embodiment of the present invention
  • Fig. 2 is a schematic diagram of data feature selection and sorting based on Sharpley value disclosed in the embodiment of the present invention
  • Fig. 3 is a schematic diagram of a characteristic Shapley value based on machine learning disclosed in an embodiment of the present invention.
  • Fig. 4 is a schematic diagram of auction pricing based on data characteristics disclosed in the embodiment of the present invention.
  • Fig. 5 is a schematic diagram of a data dynamic pricing control panel disclosed in an embodiment of the present invention.
  • Fig. 6 is a schematic diagram of information interaction of a Shapley value-based data feature combination dynamic pricing method disclosed in an embodiment of the present invention
  • Fig. 7 is a schematic diagram of introducing a penalty function based on the Shapley value based on data replicability disclosed by the embodiment of the present invention.
  • Fig. 8 is a schematic structural diagram of a Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention.
  • Fig. 9 is a schematic structural diagram of another Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention.
  • the present invention proposes a data feature combination pricing method based on the Shapley value, which specifically includes the following steps:
  • the quality inspector judges whether the data can be traded, and if it can be traded.
  • the quality inspector first deletes the missing data, first deletes the columns (attributes) with a missing rate higher than 10%, and then deletes the rows (tuples) with missing data;
  • the labeled data undergoes multiple rounds of manual inspection;
  • the quality inspector conducts a round of sampling inspection on the marked data, and conducts random sampling or stratified sampling on 50% of the marked data for inspection. If all the marked data are qualified in the first round, then in the second round In the first round of inspection, 25% of the labeled data will be inspected for quality;
  • the quality inspector needs to conduct a full sample inspection of the labeler's data labeling in the second round of inspection;
  • the amount of labeled data inspected in the second round of sampling inspection will be doubled compared with the first round;
  • the amount of labeled data inspected in the second round of sampling inspection will increase by 30% compared with the first round;
  • a learning model based on machine learning is constructed to predict a single instance.
  • the prediction process is "payment”, and "revenue” is the actual prediction of the instance minus the average predicted value of all samples, and the Shapley value of the feature It is the average marginal contribution of the feature in all feature sequences, so as to fairly divide the contribution of each feature to the prediction result.
  • the characteristic Shapley value estimation of construction what solve is the characteristic value distribution (contribution) problem based on Shapley value, and specific content is to use the difference of the average prediction value of the prediction result of specific instance and data set as the characteristic of this instance Feature Shapley value (revenue), through two random examples to simulate the appearance or absence of features, calculate the marginal value of features in a specific instance, and use the mean of the absolute value of its Shapley value as the feature in the data set global value.
  • the data features in the experimental dataset work together in a machine learning algorithm to produce a predicted value.
  • Shapley value to allocate the value of each feature according to its marginal value (contribution), quantify the impact of different input features on the output prediction results of the training model, and the distribution of feature values balances the data prediction accuracy and prediction cost, and determines the choice of a certain feature.
  • This embodiment proposes a dynamic pricing method based on the combination of data features of the Shapley value.
  • the schematic diagram of the general architecture is shown in Figure 1, which includes the following steps:
  • the sensor data of the third-party data trading platform is associated with relevant historical files to obtain a characteristic data set.
  • Sensor data comes from data sets collected by data sellers using sensors in all aspects of production and operation.
  • the number of remaining features reaches the required number of features, and features are selected by recursively reducing the size of the feature set under investigation.
  • Use the cross-validation method to finally determine the k features with increasing prediction accuracy, and then combine and arrange the k features generated by the above-mentioned CV-RFE and the incremental screening of prediction accuracy to output all subsets, resulting in 2 k -1 (remove the empty set) feature subsets, then set all feature subsets as training set and verification set, bring them into model training and calculate the accuracy of feature subsets respectively, take the mean value of multiple rounds of iterative experiments, and finally compare the feature subsets with the corresponding Accuracy output.
  • Data dynamic pricing system and data transaction control panel based on multiplication weight update algorithm. Based on the idea of multiplicative weights and the characteristics of data transactions, a pricing algorithm based on multiplicative weight update weights is designed to maximize the long-term income of the platform, so that the generated price income is the same as the income obtained by the optimal price in hindsight.
  • the average regret value of participants is 0, which is conducive to the maximum utility of both buyers and sellers, forming a benign transaction relationship, and giving full play to the value of data; the data transaction control panel summarizes the obtained auction price and other information, and displays it in a variety of visual information such as graphics.
  • An important feature of the present invention is the feature selection and feature Shapley value estimation algorithm based on machine learning to obtain the distribution of individual feature prediction contributions, as well as the correlation trend and global importance of features. This embodiment further illustrates this.
  • the feature data (201) is collected by sensors and the like and then divided into a training set (202) and a verification set (209), and the training set is divided into optimal feature selection under a fixed number of features , using CV-RFE feature selection (203) to obtain the optimal number of features (204); the optimal feature selection under the variable number of features is to perform feature combination arrangement (205), and determine the optimal combination of the number of features (206) .
  • the optimal combination of the number of features and the number of features are performed by sensors and the like and then divided into a training set (202) and a verification set (209), and the training set is divided into optimal feature selection under a fixed number of features , using CV-RFE feature selection (203) to obtain the optimal number of features (204); the optimal feature selection under the variable number of features is to perform feature combination arrangement (205), and determine the optimal combination
  • the sorted feature vectors are used in the prediction model of machine learning (301) to obtain the prediction results (302), and then all the feature data and prediction results are brought into the Shapley value analysis model (303 ). Finally, the global importance of features, the trend of correlation between features and prediction results, and the distribution of prediction contributions of individual feature data are obtained.
  • the contribution analysis of prediction results using the Shapley value method can be divided into two levels. On the global level (306, 305), the distribution of the Shapley value can be used to describe the specific influence, law and correlation of features. ; At the local level (304), the quantified contribution of each feature in each sample prediction can be given. After using the Shapley value algorithm to get the value contribution of each feature, it can be balanced with the cost of data collection.
  • Fig. 4 is a data feature auction transaction pricing mechanism of a multiplication weight update algorithm disclosed in an embodiment of the present invention.
  • the multiplication weight update process maintains the weight of each pricing strategy and randomly selects strategies for repeated iterations to achieve the maximum long-term operating income.
  • a decision set contains ⁇ alternative decisions, corresponding to a specific income ⁇ (income is not a priori), and multiple rounds of selection are performed on it.
  • the current weight of each decision is multiplied by the income factor related to the current round of income and updated Weight, the decision-making party repeatedly makes choices and obtains corresponding benefits. After multiple rounds, the weight value of the strategy with the highest profit will become prominent, and the probability of the strategy being selected will increase significantly.
  • the core idea of the multiplication weight update algorithm is illustrated. Assuming that the auction price trend is random, and it is desired to predict the state of the auction price (fall or rise) through the opinions of experts, all N experts form a set C. Before the data auction, the suggestion of an expert i in C is randomly selected to predict the trend of the data auction (down or up). If the expert’s prediction is wrong, the price will be 1; if the prediction is correct, the loss will be 0. Since expert i is randomly selected for prediction, in order to make better decisions, the algorithm aims to control the prediction near the best-performing expert in the long run, that is, in the next round of prediction, the probability of being selected by the expert who made the correct prediction is higher.
  • each round obeys the opinions of the weighted majority of experts.
  • the initial weight of N experts be 1
  • each round of forecast results is two (down or up) to choose one; introduce the parameter ⁇ ( ⁇ 0.5) as a factor related to income, and in the next round of selection, give the prediction error expert (1 - ⁇ ) times lowering penalty.
  • the upper bound of the error of the algorithm is The multiplication weight update algorithm mainly has the following four steps:
  • Step 1 The data seller sets the current price of the data as p n ; the number of data buyers is n, and the data is purchased sequentially; the data buyer n quotes the data to be purchased as b n , and for any group of n ⁇ [N] buyers
  • the bids b n all come from a closed and bounded set B, the diameter of the set B is D, and D ⁇ , that is, b n ⁇ B.
  • Step 2 The income function of the data buyer is G(p n , b n ), which is related to the buyer’s quotation b n and the existing price p n . Different quotations and different current prices will lead to different income for buyer n .
  • Step 3 The data seller determines the buyer’s payment function RF(p n , b n ) based on the existing price p n and the buyer’s quotation b n , which is the Lipschitz function and is used to calculate the buyer’s final payment .
  • L is the Lipschitz coefficient
  • b is the buyer's quotation
  • p (1) and p (2) are two prices.
  • Step 4 The data buyer pays the fee R n , takes away the data prediction result, and completes a single transaction; the data seller updates the data price p n+1 , returns to the first step, and starts the next round of pricing.
  • B max ⁇ R is the buyer’s maximum offer of set B
  • B net ( ⁇ ) ⁇ R is the minimum ⁇ grid of B, which means For all x ⁇ B, there is x 0 ⁇ K such that
  • the elements in B net ( ⁇ ) are different prices tested in the multiplicative weighting algorithm, and N refers to the number of different prices.
  • the present invention also describes a data dynamic pricing control panel disclosed in the embodiment (see FIG. 5 ).
  • the relevant introduction (501) of the auction data input by the third-party data trading platform such as the industry information and attributes of the data, for data buyers to view.
  • Multiple data buyers enter the auction market anonymously and choose whether to conduct an auction (502) based on data-related information. If they choose an auction, they will conduct a buyer's bidding (503). If they do not choose an auction, they will wait for the next round of data auction. If only one data buyer bids, the data will belong to that buyer; if multiple buyers bid, the auction will be conducted according to the principle of "the highest bidder wins”.
  • the third-party data trading platform summarizes the transaction records (505) based on the transaction volume obtained in the above auction steps, such as transaction volume, transaction amount year-on-year, buyer industry proportion display and other information, and displays them in various visual forms such as graphics , and supplemented by relevant information research and judgment and other decision-making.
  • This embodiment provides an information interaction process of a Shapley value data feature combination pricing method, as shown in Figure 6, which is described from the perspective of the data buyer, the third-party data transaction platform server, and the seller's control terminal panel, including the following steps :
  • the third-party data trading platform transmits and acquires various feature data on site in the production and operation of the data seller (601), and then performs CV-RFE feature selection and sorting (603) on the acquired feature data to obtain a feature data combination and sort;
  • the third-party data transaction platform conducts dynamic transaction pricing on the auction (605).
  • the data seller determines whether to purchase the data according to the value of the auction data and whether it can increase the revenue of the enterprise.
  • the third-party data trading platform collects auction prices, and judges abnormal auction prices such as being too low or too high (606);
  • the payment distribution function of A is R n (A); the output R n (A) of the sticky Shapley value algorithm (Table 3) is ⁇ - Replication robustness gains.
  • the datasets on the market have no replica, 1 replica and 2 replicas.
  • the overall income distribution is S n ; when there is 1 copy in the market (702), there are a total of 2 data sets in the market, and the penalty function e is introduced, then the income of each data set Each distribution is 1/2Se; when there are 2 replicas in the market (703), there are 3 data sets in the market, and the penalty function becomes e 2 , then the income distribution of each data set is 1/3Se 2 , as And so on.
  • the pricing S is determined each time, when the same data is sold to multiple users, the data is priced according to the data copy price. If the data is copied into i samples, the selling price S n of each sample is:
  • S is the selling price when there is only one piece of data, which is different from the quotation b n and the price p n of the data set by the seller.
  • S is the actual selling price of the selling price when there is only one piece of data; e is the penalty factor.
  • the data testing device may be electronic equipment.
  • the data testing device may include: the processor 801 transmits effective information, the memory 802 is responsible for storing data such as characteristic data, and after performing characteristic selection and sorting of the production and operation data of the enterprise (803), the data analyzed by the Shapley value is used for auction and the result It is transmitted to the control panel terminal, the communication interface 803 refers to the interface between the central processing unit and the standard communication subsystem, and the control panel 804 performs screen display.
  • FIG. 9 is a schematic structural diagram of another Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention.
  • the pricing device may be an electronic device.
  • Three algorithms among the present invention carry out according to the following steps:
  • the device can use the data obtained by cascading sensors and files to pass through the acquisition unit 901 and send it to the calculation unit for analysis.
  • the calculation unit 902 receives the signal, it predicts and sorts the feature data through the control unit 903 and so on, and then the storage unit 904 stores the predicted and sorted feature data, and the storage unit transmits the result to the output unit 905 after the work is completed.
  • 3Dynamic pricing of data based on the multiplication weight update algorithm input the existing price, number of buyers and quotations set by the seller to the acquisition unit 901, and calculate the buyer’s payment fee through the calculation unit 902 and control unit 903 according to the buyer’s income function and payment function,
  • the storage unit 804 stores the relevant data, and outputs the data price updated by the next round of sellers.

Abstract

The present invention relates to machine learning, and in particular to a data feature combination pricing method and system based on a Shapley value and an electronic device. The method comprises: collecting feature variables of a feature data set provided by a seller, and preprocessing the feature variables; constructing a learning model based on machine learning, and selecting an optimal feature classification variable from feature classification variables; estimating feature Shapley values constructed on the basis of ghost data instances so as to calculate marginal contribution and an average Shapley value of selected feature variables; and according to the marginal contribution and the average Shapley value of the feature variables, determining whether the feature variables can be subjected to transaction, and if yes, performing the transaction. In the embodiments of the present invention, long-term benefit maximization of a data provider can be realized, risk assessment of the data seller on the data buyer company is met, and the risk loss is reduced.

Description

基于夏普利值的数据特征组合定价方法、系统及电子设备Data characteristic combination pricing method, system and electronic equipment based on Shapley value 技术领域technical field
本发明涉及机器学习,具体而言是一种基于夏普利值的数据特征组合定价方法、系统及电子设备。The present invention relates to machine learning, in particular to a data feature combination pricing method, system and electronic equipment based on Shapley values.
背景技术Background technique
机器学习和数据挖掘技术带来的数据分析的进步,让生成的大数据的价值不可估量,数据因此成为一种新型的资产。企业运作过程中会产生海量的数据,这些收集到的数据亦可进行交易从而为企业增收,使得企业收益最大化。数据因其不同于传统商品,具有大量、多样、高速、可复制的特征,加之数据极其依赖其时效性,缺乏时效性的数据会对数据价格带来较为重大的影响,而且数据价值也具有不确定性、多样性和稀疏性,因此对于数据的定价仍是一个较新的难题。The advancement of data analysis brought about by machine learning and data mining technology makes the value of big data generated immeasurable, so data has become a new type of asset. A large amount of data will be generated during the operation of the enterprise, and the collected data can also be traded to increase the income of the enterprise and maximize the income of the enterprise. Because data is different from traditional commodities, it has the characteristics of a large number, variety, high speed, and reproducibility. In addition, data is extremely dependent on its timeliness. Data that lacks timeliness will have a significant impact on data prices, and the value of data also has a significant impact. Certainty, diversity, and sparsity, so pricing data is still a relatively new puzzle.
例如,某银行利用金融科技对各种数据进行分析,通过购买的特征数据机器学习并进行预测,为破解信息不对称难题提供了重要工具。银行为某企业贷款过程中,除了利用银行系统内部关于该企业的数据外,还会利用能够获取的关于该企业经营能力的有价值的外部数据。通过购买等手段获取关于该企业的数据,进行机器学习技术分析该企业的经营能力以减少贷款风险。通过捕捉该企业生产经营的轨迹,为金融机构提供可靠的“信用数据”,既提高了贷款成功的可能性,还降低了交易成本和信贷服务门槛。For example, a bank uses financial technology to analyze various data, machine learning and forecasting through the purchased feature data, which provides an important tool for solving the problem of information asymmetry. In the process of lending to a certain enterprise, the bank will not only use the data about the enterprise within the banking system, but also use the valuable external data that can be obtained about the operating ability of the enterprise. Obtain data about the enterprise through purchases and other means, and use machine learning technology to analyze the operating capabilities of the enterprise to reduce loan risks. By capturing the trajectory of the enterprise's production and operation, it provides financial institutions with reliable "credit data", which not only improves the possibility of successful loans, but also reduces transaction costs and credit service thresholds.
这一数据交易过程通过第三方数据交易平台实现,既能在一定程度上保证买方数据的隐私和安全,又能通过动态化市场定价来保证数据买方的价格合理。第三方交易平台需要对所采购的数据进行市场定价,为交易双方提供各自所需的数据与支付费用。为了保证数据卖方及第三方数据交易平台的利益,购买数 据成功的企业需要与平台签订保密协议,该数据仅限于企业自身经营使用,不能进行传播及二次销售。This data transaction process is realized through a third-party data transaction platform, which can not only guarantee the privacy and security of the buyer's data to a certain extent, but also ensure that the price of the data buyer is reasonable through dynamic market pricing. The third-party trading platform needs to price the purchased data in the market, and provide the data and payment fees required by both parties to the transaction. In order to ensure the interests of data sellers and third-party data trading platforms, companies that successfully purchase data need to sign a confidentiality agreement with the platform. The data is limited to the company's own business use and cannot be disseminated or re-sold.
第三方数据交易平台构建数据特征选择模型和近似夏普利值的特征价值分配算法,可根据得出来得结果判断哪些特征变量对结果影响最大,哪些特征变量对结果影响较小。买方关注影响较大的特征集合,在一定程度上通过机器学习结果来控制风险和减少损失。对银行来说购买该数据可得到相应行业的具体信息,为该行业的贷款评估分析提供了支撑,也可减少贷款风险。同时,数据卖方也能获得一笔收益。The third-party data trading platform builds a data feature selection model and a feature value distribution algorithm that approximates the Shapley value, and can judge which feature variables have the greatest impact on the results and which feature variables have less impact on the results based on the obtained results. The buyer pays attention to the set of features with greater influence, and controls risks and reduces losses through machine learning results to a certain extent. For banks, the purchase of this data can obtain specific information of the corresponding industry, which provides support for the loan evaluation and analysis of the industry, and can also reduce loan risks. At the same time, data sellers can also get a profit.
第三方交易平台提供数据动态定价方法和系统,对于数据特征海量、存在冗余等问题,使用基于预测准确度递增的特征选择算法。通过随机森林预测算法,将递归特征消除法、交叉验证与特征组合结合,能够对数据特征进行有效选择,随后对选出的数据特征进行信息挖掘分析。由于不同的数据特征对预测产生不同的贡献,本发明提出基于夏普利值的数据特征贡献分配方法,可计算出每个特征对应的效应(对预测准确度的边际贡献)。最后将交易的监测特征数据利用拍卖的方式和乘权更新算法实现动态定价。基于Myerson最优拍卖的支付函数,改进乘性权重更新算法对数据特征实现动态定价,有利于充分实现数据价值,并为企业带来额外收入。The third-party trading platform provides data dynamic pricing methods and systems. For problems such as massive data features and redundancy, a feature selection algorithm based on increasing prediction accuracy is used. Through the random forest prediction algorithm, the combination of recursive feature elimination method, cross-validation and feature combination can effectively select the data features, and then carry out information mining analysis on the selected data features. Since different data features have different contributions to prediction, the present invention proposes a data feature contribution distribution method based on the Shapley value, which can calculate the corresponding effect of each feature (marginal contribution to prediction accuracy). Finally, the monitoring feature data of the transaction is used to realize dynamic pricing by means of auction and multiplication weight update algorithm. Based on the payment function of Myerson's optimal auction, the improved multiplicative weight update algorithm realizes dynamic pricing of data characteristics, which is conducive to fully realizing the value of data and bringing additional income to enterprises.
发明内容Contents of the invention
为了使第三方数据交易平台能够充分利用企业产品中检测获得的特性实现数据的拍卖,使买方在购买的数据中提取关键信息,也能获取关于数据卖方企业所在行业的信息,本发明提出一种基于夏普利值的数据特征组合定价方法、系统及电子设备,所述方法包括被In order to enable the third-party data transaction platform to make full use of the characteristics obtained from the detection of the enterprise's products to realize the auction of data, so that the buyer can extract key information from the purchased data, and also obtain information about the industry in which the data seller's enterprise belongs, the present invention proposes a Data feature combination pricing method, system and electronic equipment based on Shapley value, the method includes being
收集卖方提供的特征数据集的特征变量并对其进行预处理;Collect the characteristic variables of the characteristic data set provided by the seller and preprocess it;
构建基于机器学习的学习模型,从特征分类变量中选择最优的特征分类变量;Construct a learning model based on machine learning, and select the optimal feature classification variable from the feature classification variables;
在选择最优变量时,基于幽灵数据实例构造的特征夏普利值估计,以此计 算选择的特征变量的边际贡献和平均夏普利值;When selecting the optimal variable, estimate the characteristic Shapley value based on the ghost data instance to calculate the marginal contribution and average Shapley value of the selected characteristic variable;
并选择最优变量时使用夏普利值对于各个特征的价值按照其边际贡献进行分配,量化不同输入特征对训练模型输出预测结果的影响并将符合设置家边际贡献的特征保留;And when selecting the optimal variable, use the Shapley value to allocate the value of each feature according to its marginal contribution, quantify the impact of different input features on the output prediction results of the training model, and keep the features that meet the setter's marginal contribution;
检测数据是否能够用于机器学习和交易,如果能够进行及机器学习和交易,则数据买方和卖方构建交易,并通过构建的学习模型获取当前数据的预测值作为数据的支付价格。Detect whether the data can be used for machine learning and trading. If it can be used for machine learning and trading, the data buyer and seller construct a transaction, and obtain the predicted value of the current data through the constructed learning model as the payment price of the data.
进一步的,从特征分类个变量中选择最优的特征分类个变量的过程包括以下步骤:Further, the process of selecting the optimal feature classification variables from the feature classification variables includes the following steps:
使用所有特征分变量数据对基于机器学习的学习模型进行训练;Use all feature subvariable data to train the learning model based on machine learning;
对特征变量的重要性进行排序,选取重要性值最大的前k个特征;Sort the importance of the feature variables, and select the top k features with the largest importance values;
用验证集评估模型,重新计算每个特征变量的重要性并进行排序;Evaluate the model on the validation set, recalculate and rank the importance of each feature variable;
把训练集拆分成新训练集与新验证集,采用新训练集和所有特征变量训练模型,使用验证集评估模型,计算所有的特征变量重要性并进行排序。Split the training set into a new training set and a new validation set, use the new training set and all feature variables to train the model, use the validation set to evaluate the model, calculate and rank the importance of all feature variables.
进一步的,基于幽灵数据实例构造的特征变量的夏普利值估计包括从特征变量中随机抽取一个实例,并构造一个含有某一特征的实例和一个不含前述特征的实例,并将这两个实例作为幽灵数据实例。Further, the estimation of the Shapley value of the feature variable based on the ghost data instance includes randomly selecting an instance from the feature variable, constructing an instance with a certain feature and an instance without the aforementioned feature, and combining the two instances As a ghost data instance.
进一步的,特征变量的边际贡献表示为:Further, the marginal contribution of feature variables is expressed as:
Figure PCTCN2022126712-appb-000001
Figure PCTCN2022126712-appb-000001
其中,
Figure PCTCN2022126712-appb-000002
为第m次迭代过程中实例x中第j个特征的边界贡献值;
Figure PCTCN2022126712-appb-000003
为实例x在第m次迭代中使用带有特征j的实例实现的预测,
Figure PCTCN2022126712-appb-000004
为第m次迭代时实例x中第j个特征以后的特征被实例z中特征进行随机替换后的特征向量;
Figure PCTCN2022126712-appb-000005
为实例x在第m次迭代中使用不带特征j的实例实现的预测,
Figure PCTCN2022126712-appb-000006
为第m次迭代时实例x中第j个特征以及第j个特征以后的特征被实例z中特征进行随机替换后的特征向量。
in,
Figure PCTCN2022126712-appb-000002
is the boundary contribution value of the jth feature in the instance x during the mth iteration;
Figure PCTCN2022126712-appb-000003
For instance x at the mth iteration using the prediction achieved by the instance with feature j,
Figure PCTCN2022126712-appb-000004
is the feature vector after the features after the jth feature in the instance x are randomly replaced by the features in the instance z at the mth iteration;
Figure PCTCN2022126712-appb-000005
For instance x at the mth iteration the prediction is achieved using the instance without feature j,
Figure PCTCN2022126712-appb-000006
It is the feature vector after the jth feature in the instance x and the features after the jth feature are randomly replaced by the features in the instance z at the mth iteration.
进一步的,对该特征变量进行定价的过程包括以下步骤:Further, the process of pricing the characteristic variable includes the following steps:
S41、与数据买方交易之前,数据卖方先设置交易数据的价格p n,买方个数以及买方报价,并计算数据买方的收益函数; S41. Before trading with the data buyer, the data seller first sets the price p n of the transaction data, the number of buyers and the buyer's quotation, and calculates the data buyer's income function;
S42、根据买方的收益函数计算数据买方的最终支付;数据买方支付费用,将选择的特征变量进行交易;S42. Calculate the final payment of the data buyer according to the buyer's revenue function; the data buyer pays the fee and trades the selected characteristic variables;
S43、基于乘权更新算法卖方更新数据价格,返回到S41,开始下一轮定价。S43. The seller updates the data price based on the multiplication weight update algorithm, returns to S41, and starts the next round of pricing.
进一步的,数据买方支付费用R n表示为: Further, the data buyer pays the fee R n expressed as:
其中,G(b n,p n)为卖方设置交易数据的价格为p n且买方报价为b n时买方的收益函数。 Among them, G(b n ,p n ) is the buyer's profit function when the seller sets the price of transaction data as p n and the buyer's quotation is b n .
进一步的,卖方的收益函数根据卖方设置交易数据的价格以及买方的报价进行确定,在卖方价格固定的情况下,当报价b n小于卖方设置交易数据的价格p n时,随着报价b n增大买方的收益增大,直到报价b n等于卖方设置交易数据的价格p n时达到最大收益;当报价b n大于卖方设置交易数据的价格p n时,买方效用保持最大值不变且买方支付费用也维持最大值不变。 Further, the seller’s income function is determined according to the price of the transaction data set by the seller and the quotation of the buyer. When the seller’s price is fixed, when the quotation b n is smaller than the price p n of the transaction data set by the seller, as the quotation b n increases The profit of the big buyer increases until the quotation b n is equal to the price p n of the transaction data set by the seller to reach the maximum profit; when the quotation b n is greater than the price p n of the transaction data set by the seller, the buyer’s utility remains at the maximum value and the buyer pays Fees also remain at the same maximum value.
进一步的,每次确定定价后,同一数据卖给多个用户时,根据数据复制价格对数据进行定价,若数据复制为i个样本,则每个样本的售价Sn为:Furthermore, each time the price is determined, when the same data is sold to multiple users, the data is priced according to the data copy price. If the data is copied into i samples, the selling price Sn of each sample is:
Figure PCTCN2022126712-appb-000007
Figure PCTCN2022126712-appb-000007
其中,S为只有一份数据时的售价,e为惩罚因子。Among them, S is the selling price when there is only one piece of data, and e is the penalty factor.
本发明提出一种基于夏普利值的数据特征组合定价系统,包括特征选择子系统和定价子系统,特征选择子系统对特征进行筛选,定价子系统对筛选得到的特征进行定价拍卖;The present invention proposes a data feature combination pricing system based on the Shapley value, including a feature selection subsystem and a pricing subsystem. The feature selection subsystem screens features, and the pricing subsystem performs pricing auctions on the screened features;
特征子系统中包括机器学习模型和夏普利分析模型,机器学习模型根据数据进行训练预测,将预测得到的值作为特征的重要性进行排序,并将重要性最大的K个特征送入夏普利分析模型进行分析;夏普利分析模型计算特征变量的 编辑贡献和平均夏普利值;The feature subsystem includes the machine learning model and the Shapley analysis model. The machine learning model performs training and prediction based on the data, sorts the predicted values as the importance of the features, and sends the K features with the greatest importance to the Shapley analysis The model is analyzed; the Shapley analysis model calculates the editorial contribution and the average Shapley value of the feature variable;
定价子系统中数据买方根据数据卖方的定价。In the pricing subsystem, data buyers base their prices on data sellers.
本发明还提出一种基于夏普利值的数据特征组合定价电子设备,包括处理器和存储器,前述的任一一种根据权利要求1所述的基于夏普利值的数据特征组合定价方法,并且处理器能够运行存储器中存储的基于夏普利值的数据特征组合定价方法。The present invention also proposes a pricing electronic device based on the combination of data features of the Shapley value, including a processor and a memory, any one of the aforementioned pricing methods based on the combination of data features of the Shapley value according to claim 1, and processing The processor is capable of running a Shapley value-based data feature combination pricing method stored in memory.
本发明具有以下优点:The present invention has the following advantages:
1、对于预测过程的特征选择问题,基于交叉验证的递归特征消除思想结合特征排列组合,一并考虑预测准确度,设计二者结合的特征选择方法。1. For the feature selection problem in the prediction process, the idea of recursive feature elimination based on cross-validation combined with feature permutation and combination, and considering the prediction accuracy, a feature selection method combining the two is designed.
2、能够适配不同预测成本下的特征选择,为预测模型选择最大价值的输入特征。2. Be able to adapt to feature selection under different forecasting costs, and select the most valuable input features for the forecasting model.
3、数据信息特征的预测贡献分配问题应用近似夏普利值法对数据特征进行全局和局部解释。3. The prediction contribution distribution problem of data information features The approximate Shapley value method is used to explain the data features globally and locally.
4、设计的数据交易模型及实时动态定价算法,能够实现企业长期利润最大化;同时拍卖所得的特征数据也为银行或保险公司等数据买方提供了贷款评估业务决策支持,减少贷款及赔付的损失。4. The designed data transaction model and real-time dynamic pricing algorithm can maximize the long-term profit of the enterprise; at the same time, the characteristic data obtained from the auction also provides data buyers such as banks or insurance companies with loan evaluation business decision support, reducing the loss of loans and compensation .
5、第三方交易平台将获得的拍卖数据通过交易控制面板使拍卖信息可视化,快速提取关键信息。5. The auction data obtained by the third-party trading platform can be visualized through the transaction control panel to quickly extract key information.
附图说明Description of drawings
图1是本发明实施例公开的基于夏普利值得数据特征组合动态定价的总架构示意图;Figure 1 is a schematic diagram of the overall architecture of the dynamic pricing based on the combination of Sharpe value data features disclosed in the embodiment of the present invention;
图2是本发明实施例公开的一种基于夏普利值得数据特征选择与排序的示意图;Fig. 2 is a schematic diagram of data feature selection and sorting based on Sharpley value disclosed in the embodiment of the present invention;
图3是本发明实施例公开的一种基于机器学习的特征夏普利值示意图;Fig. 3 is a schematic diagram of a characteristic Shapley value based on machine learning disclosed in an embodiment of the present invention;
图4是本发明实施例公开的一种基于数据特征拍卖定价示意图;Fig. 4 is a schematic diagram of auction pricing based on data characteristics disclosed in the embodiment of the present invention;
图5是本发明实施例公开的一种数据动态定价控制面板示意图;Fig. 5 is a schematic diagram of a data dynamic pricing control panel disclosed in an embodiment of the present invention;
图6是本发明实施例公开的一种基于夏普利值的数据特征组合动态定价方法的信息交互示意图;Fig. 6 is a schematic diagram of information interaction of a Shapley value-based data feature combination dynamic pricing method disclosed in an embodiment of the present invention;
图7是本发明实施例公开的一种基于数据复制性的夏普利值引入惩罚函数的示意图;Fig. 7 is a schematic diagram of introducing a penalty function based on the Shapley value based on data replicability disclosed by the embodiment of the present invention;
图8是本发明实施例公开的一种夏普利值的数据特征组合动态定价方法装置的结构示意图;Fig. 8 is a schematic structural diagram of a Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention;
图9是本发明实施例公开的另一种夏普利值的数据特征组合动态定价方法装置的结构示意图。Fig. 9 is a schematic structural diagram of another Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
本发明提出一种基于夏普利值的数据特征组合定价方法,具体包括以下步骤:The present invention proposes a data feature combination pricing method based on the Shapley value, which specifically includes the following steps:
收集卖方提供的特征数据集的特征变量并对其进行预处理;Collect the characteristic variables of the characteristic data set provided by the seller and preprocess it;
构建基于机器学习的学习模型,从特征分类个变量中选择最优的特征分类个变量;Construct a learning model based on machine learning, and select the optimal feature classification variables from the feature classification variables;
基于幽灵数据实例构造的特征夏普利值估计,以此计算选择的特征变量的边际贡献和平均夏普利值;Estimation of the characteristic Shapley value constructed based on the ghost data instance to calculate the marginal contribution and average Shapley value of the selected characteristic variables;
质检员判断数据是否能够进行交易,若能够进行交易则。The quality inspector judges whether the data can be traded, and if it can be traded.
本领域已有成熟的对数据进行质检的方法,在本发明实施例中列举一种数据品质检测采样辅助实时检验方法进行说明,包括:There are already mature methods for data quality inspection in the field. In the embodiment of the present invention, a sample-assisted real-time inspection method for data quality inspection is described for illustration, including:
当交易平台标注员完成数据标注任务后,质检员首先删除去缺失数据,先删除缺失率高于10%的列(属性),再删除带有缺失数据的行(元组);之后对其标注的数据进行多轮人工检验;When the transaction platform labeler completes the data labeling task, the quality inspector first deletes the missing data, first deletes the columns (attributes) with a missing rate higher than 10%, and then deletes the rows (tuples) with missing data; The labeled data undergoes multiple rounds of manual inspection;
在第一轮人工检验,质检员对标注数据进行一轮抽样检验,对标注数据50%进行随机抽样或分层抽样并进行检验,如果在第一轮中标注数据全部合格,则在第二轮检验中,对标注数据25%进行质检;In the first round of manual inspection, the quality inspector conducts a round of sampling inspection on the marked data, and conducts random sampling or stratified sampling on 50% of the marked data for inspection. If all the marked data are qualified in the first round, then in the second round In the first round of inspection, 25% of the labeled data will be inspected for quality;
如果第一轮有50%以上不合格的标注数据,则在第二轮检验时质检员需对标注员的数据标注进行全样检验;If there are more than 50% unqualified labeling data in the first round, the quality inspector needs to conduct a full sample inspection of the labeler's data labeling in the second round of inspection;
如果第一轮有10%以上和50%以下不合格的标注数据,则在第二轮抽样检验中检验的标注数据量较第一轮的增加1倍;If there are more than 10% and less than 50% unqualified labeled data in the first round, the amount of labeled data inspected in the second round of sampling inspection will be doubled compared with the first round;
如果第一轮不合格的标注数据少于10%,则在第二轮抽样检验中检验的标注数据量较第一轮的增加30%;If the unqualified labeled data in the first round is less than 10%, the amount of labeled data inspected in the second round of sampling inspection will increase by 30% compared with the first round;
如果第一轮不合格的标注数据少于1%,则可进行数据交易;If the unqualified labeled data in the first round is less than 1%, data transactions can be carried out;
重复以上检验过程,直到符合数据交易的条件。Repeat the above inspection process until the conditions for data transactions are met.
在本实施例中,构建基于机器学习的学习模型对单个实例进行预测任务,预测过程就是“支付”,“收益”为该实例的实际预测减去所有样本的平均预测值,特征的夏普利值就是该特征在所有的特征序列中的平均边际贡献,从而来公平划分每个特征对预测结果的贡献。In this embodiment, a learning model based on machine learning is constructed to predict a single instance. The prediction process is "payment", and "revenue" is the actual prediction of the instance minus the average predicted value of all samples, and the Shapley value of the feature It is the average marginal contribution of the feature in all feature sequences, so as to fairly divide the contribution of each feature to the prediction result.
本发明中关于构建的特征夏普利值估计,解决的是基于Shapley值的特征价值分配(贡献)问题,具体内容是将特定实例的预测结果与数据集的平均预测值之差作为该实例的特征特征夏普利值(收益),通过两个随机实例来模拟特征的出现与否情形,计算特征在特定实例中的边际价值,并将其夏普利值的绝对值的均值作为该特征在数据集中的全局价值。实验数据集中的数据特征在机器学习算法中共同协作,产生一个预测值。使用Shapley值对于各个特征的价值按照其边际价值(贡献)进行分配,量化不同输入特征对训练模型输出预测结果的影响,特征价值的分配平衡数据预测精度与预测成本,决定某特征的去留选择。In the present invention, about the characteristic Shapley value estimation of construction, what solve is the characteristic value distribution (contribution) problem based on Shapley value, and specific content is to use the difference of the average prediction value of the prediction result of specific instance and data set as the characteristic of this instance Feature Shapley value (revenue), through two random examples to simulate the appearance or absence of features, calculate the marginal value of features in a specific instance, and use the mean of the absolute value of its Shapley value as the feature in the data set global value. The data features in the experimental dataset work together in a machine learning algorithm to produce a predicted value. Use the Shapley value to allocate the value of each feature according to its marginal value (contribution), quantify the impact of different input features on the output prediction results of the training model, and the distribution of feature values balances the data prediction accuracy and prediction cost, and determines the choice of a certain feature. .
实施例1Example 1
本实施例提出一种基于夏普利值的数据特征组合动态定价方法,总架构示 意图见图1,包括以下步骤:This embodiment proposes a dynamic pricing method based on the combination of data features of the Shapley value. The schematic diagram of the general architecture is shown in Figure 1, which includes the following steps:
101、第三方数据交易平台传感器数据与相关的历史档案进行关联,得到特征数据集。传感器数据来源于数据卖方在生产经营各个环节中利用传感器收集的数据集。101. The sensor data of the third-party data trading platform is associated with relevant historical files to obtain a characteristic data set. Sensor data comes from data sets collected by data sellers using sensors in all aspects of production and operation.
102、将收集到的特征数据进行特征选择与排序。首先进行基于交叉验证的递归特征消除(Cross-validate and recursive feature elimination,CV-RFE),给训练的所有S特征各自指定对应权重,进而采用随机森林预测模型在这些原始数据特征上进行训练。这一步得到各输入特征的权重值。然后取权重绝对值,并将最小权重绝对值对应的特征剔除,这一步消除若干个权值系数特征。最后基于新的特征集进行下一轮训练。将前两步视为一轮训练不断循环递归,多轮训练后使得剩余的特征数量达到所需的特征数量,通过递归减少考察的特征集规模来选择特征。使用交叉验证的方法,最终确定预测准确度递增的k个特征,随后将上述经CV-RFE及预测准确度递增筛选后产生的k个特征进行组合排列输出所有子集,则产生2 k-1(去除空集)个特征子集,然后将所有特征子集设置训练集和验证集,带入模型训练并分别计算特征子集准确度,多轮迭代实验取均值,最后将特征子集与对应准确度输出。 102. Perform feature selection and sorting on the collected feature data. First, cross-validate and recursive feature elimination (CV-RFE) based on cross-validation is carried out, and corresponding weights are assigned to all S features for training, and then the random forest prediction model is used for training on these original data features. This step obtains the weight value of each input feature. Then take the absolute value of the weight, and remove the feature corresponding to the minimum absolute value of the weight. This step eliminates several weight coefficient features. Finally, the next round of training is performed based on the new feature set. The first two steps are regarded as a round of training and continuous recursion. After multiple rounds of training, the number of remaining features reaches the required number of features, and features are selected by recursively reducing the size of the feature set under investigation. Use the cross-validation method to finally determine the k features with increasing prediction accuracy, and then combine and arrange the k features generated by the above-mentioned CV-RFE and the incremental screening of prediction accuracy to output all subsets, resulting in 2 k -1 (remove the empty set) feature subsets, then set all feature subsets as training set and verification set, bring them into model training and calculate the accuracy of feature subsets respectively, take the mean value of multiple rounds of iterative experiments, and finally compare the feature subsets with the corresponding Accuracy output.
103、基于幽灵数据实例构造的特征夏普利值估计。夏普利值是特征对预测的贡献;价值函数是参与者联盟(特征)的支付函数。计算第i个特征的准确夏普利值,需评估所有特征值联盟含(不含)特征i的预测值。特征越多,联盟数目随着特征的增加呈指数增长,基于蒙特卡罗采样的近似夏普利值计算可解决该问题:103. Estimation of the characteristic Shapley value constructed based on the ghost data instance. The Shapley value is the feature's contribution to the prediction; the value function is the payoff function of the coalition of participants (features). To calculate the exact Shapley value of the i-th feature, it is necessary to evaluate the predicted values of all eigenvalue unions with (without) feature i. The more features, the number of alliances increases exponentially with the increase of features. The approximate Shapley value calculation based on Monte Carlo sampling can solve this problem:
Figure PCTCN2022126712-appb-000008
Figure PCTCN2022126712-appb-000008
其中,
Figure PCTCN2022126712-appb-000009
为实例x在第m次迭代中使用带有特征j的实例实现的预测,特征j之后的特征值被随机采样实例z的特征值替代。x向量的
Figure PCTCN2022126712-appb-000010
Figure PCTCN2022126712-appb-000011
近似相等,但
Figure PCTCN2022126712-appb-000012
的部分特征值同样来自随机实例采样z,二者均为组合而成的新样本。 结合数据特征的性质,使用夏普利值对于数据特征价值按预测贡献进行分配,结果公平。具体计算步骤如表1所示。
in,
Figure PCTCN2022126712-appb-000009
For instance x at the m-th iteration the prediction is realized using instances with feature j, the eigenvalues after feature j are replaced by the eigenvalues of randomly sampled instance z. x vector
Figure PCTCN2022126712-appb-000010
and
Figure PCTCN2022126712-appb-000011
approximately equal, but
Figure PCTCN2022126712-appb-000012
Part of the eigenvalues of is also from random instance sampling z, both of which are combined new samples. Combined with the nature of the data features, the Shapley value is used to distribute the value of the data features according to the predicted contribution, and the result is fair. The specific calculation steps are shown in Table 1.
表1 基于近似夏普利值的收益分割算法Table 1 Revenue split algorithm based on approximate Shapley value
Figure PCTCN2022126712-appb-000013
Figure PCTCN2022126712-appb-000013
104、设计数据特征拍卖机制,根据数据品质等因素检验数据是否适合机器学习和交易。如果能够进行交易,则进行机制设计;否则不进行机制设计。104. Design an auction mechanism for data characteristics, and test whether the data is suitable for machine learning and trading based on factors such as data quality. If the transaction can be carried out, the mechanism design is carried out; otherwise, the mechanism design is not carried out.
105、基于乘权更新算法的数据动态定价系统和数据交易控制面板。基于乘权更新(Multiplicative Weights)思想与数据交易特性,设计基于乘权更新权重的定价算法,满足平台长期收益最大化,使得产生的价格收益与后见之明的最优价格获得的收益,二者平均后悔值为0,有利于买卖双方各自效用最大,形成良性交易关系,发挥数据价值;数据交易控制面板将获得的拍卖价格等信息进行总结,并以图像化等多种可视化信息展现。105. Data dynamic pricing system and data transaction control panel based on multiplication weight update algorithm. Based on the idea of multiplicative weights and the characteristics of data transactions, a pricing algorithm based on multiplicative weight update weights is designed to maximize the long-term income of the platform, so that the generated price income is the same as the income obtained by the optimal price in hindsight. The average regret value of participants is 0, which is conducive to the maximum utility of both buyers and sellers, forming a benign transaction relationship, and giving full play to the value of data; the data transaction control panel summarizes the obtained auction price and other information, and displays it in a variety of visual information such as graphics.
实施例2Example 2
本发明的一个重要特征是基于机器学习的特征选择和特征夏普利值估计算法而得到单个特征预测贡献分配,以及相关性趋势和特征的全局重要性。本实施例对此进行进一步说明。An important feature of the present invention is the feature selection and feature Shapley value estimation algorithm based on machine learning to obtain the distribution of individual feature prediction contributions, as well as the correlation trend and global importance of features. This embodiment further illustrates this.
在模型选择过程中(见图2),利用传感器等收集特征数据(201)后分为训练集(202)与验证集(209),把训练集分为在固定特征数量下的最优特征选择,利用CV-RFE特征选择(203)获得最优特征个数(204);在可变特征数量下的最优特征选择进行特征组合排列(205),确定特征个数的最优组合(206)。把训练集中得出的最优特征个数与最优组合进行机器学习(207),得出预测模型(208),最后将验证集(209)的数据进行模型验证(210)得出最优特征个数与特征个数的最优组合。In the process of model selection (see Figure 2), the feature data (201) is collected by sensors and the like and then divided into a training set (202) and a verification set (209), and the training set is divided into optimal feature selection under a fixed number of features , using CV-RFE feature selection (203) to obtain the optimal number of features (204); the optimal feature selection under the variable number of features is to perform feature combination arrangement (205), and determine the optimal combination of the number of features (206) . Perform machine learning (207) on the optimal number of features obtained in the training set and the optimal combination to obtain a prediction model (208), and finally perform model verification (210) on the data of the verification set (209) to obtain the optimal features The optimal combination of the number of features and the number of features.
在推理阶段(见图3),将排序好的特征向量用于机器学习的预测模型(301),获得预测结果(302),再将特征数据与预测结果全部带入夏普利值分析模型(303)。最终得到特征全局的重要性、特征与预测结果相关性趋势以及单个特征数据的预测贡献分配。基于所建立的模型,利用夏普利值方法进行预测结果贡献分析可分为2个层面,全局层次上(306、305),可利用夏普利值的分布来描述特征的具体影响、规律和相关性;局部层次上(304),可给出每一个样本预测中每个特征的量化贡献。在使用夏普利值算法得到各个特征的价值贡献后,可与数据收集的成本进行平衡。In the reasoning stage (see Figure 3), the sorted feature vectors are used in the prediction model of machine learning (301) to obtain the prediction results (302), and then all the feature data and prediction results are brought into the Shapley value analysis model (303 ). Finally, the global importance of features, the trend of correlation between features and prediction results, and the distribution of prediction contributions of individual feature data are obtained. Based on the established model, the contribution analysis of prediction results using the Shapley value method can be divided into two levels. On the global level (306, 305), the distribution of the Shapley value can be used to describe the specific influence, law and correlation of features. ; At the local level (304), the quantified contribution of each feature in each sample prediction can be given. After using the Shapley value algorithm to get the value contribution of each feature, it can be balanced with the cost of data collection.
实施例3Example 3
图4是本发明实施例公开的一种乘权更新算法的数据特征拍卖交易定价机制。乘权更新过程通过维护各定价策略的权重并随机地选择策略进行反复迭代,来实现长期运行的收益最大。假定某决策集合包含α种备选决策,对应特定收益β(收益非先验),对其进行多轮选择,每一轮把各决策的当前权重乘以与当前轮收益相关的收益因子并更新权重,决策方反复做出选择并获取相应收益,多轮后拥有最高收益的策略权重值将变得突出,该策略被选中的概率显著增大。Fig. 4 is a data feature auction transaction pricing mechanism of a multiplication weight update algorithm disclosed in an embodiment of the present invention. The multiplication weight update process maintains the weight of each pricing strategy and randomly selects strategies for repeated iterations to achieve the maximum long-term operating income. Assume that a decision set contains α alternative decisions, corresponding to a specific income β (income is not a priori), and multiple rounds of selection are performed on it. In each round, the current weight of each decision is multiplied by the income factor related to the current round of income and updated Weight, the decision-making party repeatedly makes choices and obtains corresponding benefits. After multiple rounds, the weight value of the strategy with the highest profit will become prominent, and the probability of the strategy being selected will increase significantly.
以专家意见预测数据拍卖价格为例,说明该乘权更新算法核心思想。假设 拍卖价格走势随机,欲通过专家的意见预测拍卖价格的状态(下跌或上涨),所有N个专家构成集合C。在数据拍卖前,随机选取C中的某一位专家i的建议预测数据拍卖走势(下跌或上涨),若该专家预测错误,记代价为1;预测正确,则亏损为0。由于随机选择专家i进行预测,为了做出更好决策,算法目标为长期运行下将预测控制在表现最好的专家附近,即在下一轮预测中让之前做出正确预测的专家被选几率更大,通过维护这组专家的权重,每轮服从加权后多数专家的意见。令N个专家初始权重均为1,每轮预测结果为二(下跌或上涨)选一;引入参数η(η<0.5),作为与收益相关的因子,在下轮选择中给预测错误专家(1-η)倍的降权惩罚。经过T步选择后,算法的犯错上界为
Figure PCTCN2022126712-appb-000014
乘权更新算法的主要有以下四个步骤:
Taking experts' opinions to predict data auction prices as an example, the core idea of the multiplication weight update algorithm is illustrated. Assuming that the auction price trend is random, and it is desired to predict the state of the auction price (fall or rise) through the opinions of experts, all N experts form a set C. Before the data auction, the suggestion of an expert i in C is randomly selected to predict the trend of the data auction (down or up). If the expert’s prediction is wrong, the price will be 1; if the prediction is correct, the loss will be 0. Since expert i is randomly selected for prediction, in order to make better decisions, the algorithm aims to control the prediction near the best-performing expert in the long run, that is, in the next round of prediction, the probability of being selected by the expert who made the correct prediction is higher. Large, by maintaining the weight of this group of experts, each round obeys the opinions of the weighted majority of experts. Let the initial weight of N experts be 1, each round of forecast results is two (down or up) to choose one; introduce the parameter η (η<0.5) as a factor related to income, and in the next round of selection, give the prediction error expert (1 -η) times lowering penalty. After T steps of selection, the upper bound of the error of the algorithm is
Figure PCTCN2022126712-appb-000014
The multiplication weight update algorithm mainly has the following four steps:
第一步:数据卖方设置数据现有价格为p n;数据买方数为n,依次到达购买数据;数据买方n对于要购买的数据报价为b n,对于任意一组n∈[N]的买方出价b n,均来自一个封闭、有界集合B,集合B的直径为D,D<∞,即b n∈B。 Step 1: The data seller sets the current price of the data as p n ; the number of data buyers is n, and the data is purchased sequentially; the data buyer n quotes the data to be purchased as b n , and for any group of n∈[N] buyers The bids b n all come from a closed and bounded set B, the diameter of the set B is D, and D<∞, that is, b n ∈ B.
第二步:数据买方的收益函数为G(p n,b n),它与买方的报价b n及现有的价格p n有关,不同的报价、不同的现价将会导致买方n不同的收益。 Step 2: The income function of the data buyer is G(p n , b n ), which is related to the buyer’s quotation b n and the existing price p n . Different quotations and different current prices will lead to different income for buyer n .
第三步:数据卖方基于现有价格p n及买方报价b n,确定买方支付函数RF(p n,b n),该函数为利普西茨(Lipschitz)函数,用来计算买方的最终支付。 Step 3: The data seller determines the buyer’s payment function RF(p n , b n ) based on the existing price p n and the buyer’s quotation b n , which is the Lipschitz function and is used to calculate the buyer’s final payment .
Figure PCTCN2022126712-appb-000015
Figure PCTCN2022126712-appb-000015
其中L为利普希茨系数,b为买方的报价,p (1),p (2)为两个价格。 Among them, L is the Lipschitz coefficient, b is the buyer's quotation, p (1) and p (2) are two prices.
第四步:数据买方支付费用R n,带走数据预测结果,完成单次交易;数据卖方更新数据价格p n+1,返回到第一步,开始下一轮定价。 Step 4: The data buyer pays the fee R n , takes away the data prediction result, and completes a single transaction; the data seller updates the data price p n+1 , returns to the first step, and starts the next round of pricing.
当利普西茨、有界报价成立时。设p n:n∈[N]为定价算法的输出。L为支付函数RF的利普西茨常数。为有界报价集合B的最大元素。然后通过选择算法参数: When Lipsitz, bounded quotes hold. Let p n: n ∈ [N] be the output of the pricing algorithm. L is the Lipsitz constant of the payoff function RF. is the largest element of the bounded quote set B. Then by choosing the algorithm parameters:
Figure PCTCN2022126712-appb-000016
Figure PCTCN2022126712-appb-000016
则总的平均后悔值是有界的:Then the total average regret value is bounded:
Figure PCTCN2022126712-appb-000017
Figure PCTCN2022126712-appb-000017
其中,B max∈R为集合B的买方最大报价,B net(ε)∈R为B的最小ε网格,指
Figure PCTCN2022126712-appb-000018
对于所有的x∈B,有x 0∈K,使得|x-x 0|≤ε。B net(ε)中的元素在乘法权重算法中试验的不同价格,N是指不同价格的个数。
Among them, B max ∈ R is the buyer’s maximum offer of set B, B net (ε) ∈ R is the minimum ε grid of B, which means
Figure PCTCN2022126712-appb-000018
For all x∈B, there is x 0 ∈K such that |xx 0 |≤ε. The elements in B net (ε) are different prices tested in the multiplicative weighting algorithm, and N refers to the number of different prices.
表2 数据特征拍卖交易定价机制符号说明Table 2 Data characteristics Auction transaction pricing mechanism symbol description
Figure PCTCN2022126712-appb-000019
Figure PCTCN2022126712-appb-000019
实施例4Example 4
本发明还阐述实施例公开的一种数据动态定价控制面板(见图5)。第三方数据交易平台输入拍卖数据的相关介绍(501),如数据的行业信息、属性等供数据买家查看。多名数据买家匿名进入拍卖市场根据数据相关信息选择是否进行拍卖(502),如果选择拍卖就进行买方竞价(503),如果不选择拍卖就等待下一轮数据拍卖。如果只有一名数据买家出价,那么数据归该买家所有;如果多名买家竞价,就根据“价高者得”的原则进行拍卖。最终一名买家竞价成功(504),数据归该名买家所得,其余买家等待下一轮数据拍卖。第三方数据交易平台根据对上述拍卖步骤的得到的成交额等成交记录(505)进行总结,如交 易量、成交金额同比、买家行业占比显示等信息并以图像化等多种可视化形式展现,并辅之相关的信息研判等决策。The present invention also describes a data dynamic pricing control panel disclosed in the embodiment (see FIG. 5 ). The relevant introduction (501) of the auction data input by the third-party data trading platform, such as the industry information and attributes of the data, for data buyers to view. Multiple data buyers enter the auction market anonymously and choose whether to conduct an auction (502) based on data-related information. If they choose an auction, they will conduct a buyer's bidding (503). If they do not choose an auction, they will wait for the next round of data auction. If only one data buyer bids, the data will belong to that buyer; if multiple buyers bid, the auction will be conducted according to the principle of "the highest bidder wins". Finally, a buyer bids successfully (504), and the data belongs to the buyer, and the remaining buyers wait for the next round of data auction. The third-party data trading platform summarizes the transaction records (505) based on the transaction volume obtained in the above auction steps, such as transaction volume, transaction amount year-on-year, buyer industry proportion display and other information, and displays them in various visual forms such as graphics , and supplemented by relevant information research and judgment and other decision-making.
实施例5Example 5
本实施例给出一种夏普利值的数据特征组合定价方法的信息交互过程,如图6,从数据买方、第三方数据交易平台服务器、卖方控制终端面板的角度进行描述,包括以下几个步骤:This embodiment provides an information interaction process of a Shapley value data feature combination pricing method, as shown in Figure 6, which is described from the perspective of the data buyer, the third-party data transaction platform server, and the seller's control terminal panel, including the following steps :
第三方数据交易平台通过数据卖方生产经营中的现场各项特征数据的传输及获取(601),然后将获取到的特征数据进行CV-RFE特征选择与排序(603)处理,得出特征数据组合与排序;The third-party data trading platform transmits and acquires various feature data on site in the production and operation of the data seller (601), and then performs CV-RFE feature selection and sorting (603) on the acquired feature data to obtain a feature data combination and sort;
对这些特征数据利用夏普利值模型(604)得出预测贡献、特征与预测结果趋势、特征全局重要性等;Using the Shapley value model (604) for these characteristic data to obtain prediction contribution, characteristics and prediction result trend, global importance of characteristics, etc.;
第三方数据交易平台将拍卖进行动态交易定价(605),数据卖方根据拍卖数据得价值及是否可以增加企业得营收确定是否购买数据,如果可以就参与拍卖,如果不可以就不参与拍卖。第三方数据交易平台收集拍卖价格,判断过低、过高等异常拍卖价格(606);The third-party data transaction platform conducts dynamic transaction pricing on the auction (605). The data seller determines whether to purchase the data according to the value of the auction data and whether it can increase the revenue of the enterprise. The third-party data trading platform collects auction prices, and judges abnormal auction prices such as being too low or too high (606);
总结拍卖价格数据信息(607)并发送(608)到控制面板终端,由控制面板终端发布结果信息(609)。Summarize the auction price data information (607) and send (608) to the control panel terminal, and the control panel terminal publishes the result information (609).
由于数据可复制性,数据的复制过程会稀释(用惩罚函数刻画)每个数据源的收益,总体的收益分配Sn(0<S n≤1)是一定的,S n=G(P n,b n)为数据买方收益。引入惩罚函数e(0<e<1)解决数据复制性问题,使用鲁棒性夏普利值算法解决(如见图7)。给定R n=PD(S n,Y n;M,G),其中S n为总体收益分配,Y n是预测任务,M是机器学习预测算法;G是预测增益函数,PD是为收益的鲁棒性夏普利值算法(表3)。给定相似性度量SM,A的第i个复制本A′ i,A的支付分配函数为R n(A);棒性夏普利值算法(表3)输出的R n(A)为ε-复制的鲁棒性收益。 Due to the reproducibility of data, the duplication process of data will dilute the income of each data source (described by penalty function), and the overall income distribution Sn(0<S n ≤1) is certain, S n =G(P n , b n ) is the benefit of the data buyer. The penalty function e (0<e<1) is introduced to solve the problem of data duplication, and the robust Shapley value algorithm is used to solve it (see Figure 7). Given R n =PD(S n ,Y n ;M,G), where S n is the overall revenue distribution, Y n is the prediction task, M is the machine learning prediction algorithm; G is the prediction gain function, and PD is the profit Robust Shapley value algorithm (Table 3). Given the similarity measure SM, the i-th replica A′ i of A, the payment distribution function of A is R n (A); the output R n (A) of the sticky Shapley value algorithm (Table 3) is ε- Replication robustness gains.
表3 鲁棒性夏普利值算法Table 3 Robust Shapley value algorithm
Figure PCTCN2022126712-appb-000020
Figure PCTCN2022126712-appb-000020
Figure PCTCN2022126712-appb-000021
Figure PCTCN2022126712-appb-000021
参阅图7,以市场上的数据集无复制本、有1份复制本、有2份复制本为例。当数据集无复制本时(701),总体收益分配为S n;当市场上有1份复制本时(702),市场上总共2份数据集,引入惩罚函数e,那么每份数据集收益分配各为1/2Se;当市场上有2个复制本时(703),市场上总共3份数据集,惩罚函数变为e 2,那么每份数据集收益分配各为1/3Se 2,以此类推。当市场上数据集的复制本越多,那么收益分配得到就越少。每次确定定价S后,同一数据卖给多个用户时,根据数据复制价格对数据进行定价,若数据复制为i个样本,则每个样本的售价S n为: Referring to Fig. 7, it is taken as an example that the datasets on the market have no replica, 1 replica and 2 replicas. When there is no copy of the data set (701), the overall income distribution is S n ; when there is 1 copy in the market (702), there are a total of 2 data sets in the market, and the penalty function e is introduced, then the income of each data set Each distribution is 1/2Se; when there are 2 replicas in the market (703), there are 3 data sets in the market, and the penalty function becomes e 2 , then the income distribution of each data set is 1/3Se 2 , as And so on. When there are more copies of a data set on the market, the less revenue is distributed. After the pricing S is determined each time, when the same data is sold to multiple users, the data is priced according to the data copy price. If the data is copied into i samples, the selling price S n of each sample is:
Figure PCTCN2022126712-appb-000022
Figure PCTCN2022126712-appb-000022
其中,S为只有一份数据时的售价,不同于报价b n与卖方设置数据的价格p n,S为只有一份数据时的售价的实际售价;e为惩罚因子。 Among them, S is the selling price when there is only one piece of data, which is different from the quotation b n and the price p n of the data set by the seller. S is the actual selling price of the selling price when there is only one piece of data; e is the penalty factor.
实施例6Example 6
本发明实施例公开的一种夏普利值的数据特征组合动态定价方法装置的结构示意图。其中,该数据测试装置可以为电子设备。该数据测试装置可以包括:处理器801将有效信息传递,存储器802负责存储特征数据等资料,将企业生产经营数据进行特征选择与排序后(803)利用夏普利值分析的数据进行拍卖并将结果传到控制面板终端,通信接口803指中央处理器和标准通信子系统之间的接口,控制面板804进行画面显示。A schematic structural diagram of a Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention. Wherein, the data testing device may be electronic equipment. The data testing device may include: the processor 801 transmits effective information, the memory 802 is responsible for storing data such as characteristic data, and after performing characteristic selection and sorting of the production and operation data of the enterprise (803), the data analyzed by the Shapley value is used for auction and the result It is transmitted to the control panel terminal, the communication interface 803 refers to the interface between the central processing unit and the standard communication subsystem, and the control panel 804 performs screen display.
请参阅图9,本发明实施例公开的另一种夏普利值的数据特征组合动态定价方法装置的结构示意图。其中,该定价装置可以为电子设备。本发明中的三个 算法按照以下步骤进行:Please refer to FIG. 9 , which is a schematic structural diagram of another Shapley value data feature combination dynamic pricing method device disclosed in an embodiment of the present invention. Wherein, the pricing device may be an electronic device. Three algorithms among the present invention carry out according to the following steps:
①特征选择与排序:该装置可以利用传感器及档案级联获取到的数据通过获取单元901并交给计算单元进行分析,计算单元902收到信号以后,通过控制单元903对特征数据进行预测、排序等过程,随后储存单元904对预测、排序后的特征数据进行储存,储存单元工作完毕后将结果传递给输出单元905。①Feature selection and sorting: the device can use the data obtained by cascading sensors and files to pass through the acquisition unit 901 and send it to the calculation unit for analysis. After the calculation unit 902 receives the signal, it predicts and sorts the feature data through the control unit 903 and so on, and then the storage unit 904 stores the predicted and sorted feature data, and the storage unit transmits the result to the output unit 905 after the work is completed.
②构造幽灵数据实例的特征夏普利值估计:将上述得出来的特征组合排序数据输入901,交给计算单元902分析并通过控制单元903设计构造含有幽灵数据的实例计算每个实例的边际贡献与均值等,随后将结果储存再输出结果。②Estimation of characteristic Shapley value of constructing ghost data instance: Input the above-mentioned feature combination sorting data into 901, hand it over to calculation unit 902 for analysis, and design and construct instances containing ghost data through control unit 903 to calculate the marginal contribution and mean, etc., and then store the result and output the result.
③乘权更新算法的数据动态定价:输入卖方设置数据现有的价格、买方个数及报价到获取单元901,通过计算单元902与控制单元903根据买方的收益函数及支付函数计算买方支付费用,储存单元804对相关数据进行储存,输出下一轮卖方更新数据价格。③Dynamic pricing of data based on the multiplication weight update algorithm: input the existing price, number of buyers and quotations set by the seller to the acquisition unit 901, and calculate the buyer’s payment fee through the calculation unit 902 and control unit 903 according to the buyer’s income function and payment function, The storage unit 804 stores the relevant data, and outputs the data price updated by the next round of sellers.
尽管已经示出和描述了本发明的实施例,对于本领域的普通技术人员而言,可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由所附权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications and substitutions can be made to these embodiments without departing from the principle and spirit of the present invention. and modifications, the scope of the invention is defined by the appended claims and their equivalents.

Claims (10)

  1. 一种基于夏普利值的数据特征组合定价方法,其特征在于,具体包括以下步骤:A data feature combination pricing method based on the Shapley value is characterized in that it specifically includes the following steps:
    收集卖方提供的特征数据集的特征变量并对其进行预处理;Collect the characteristic variables of the characteristic data set provided by the seller and preprocess it;
    构建基于机器学习的学习模型,从特征分类变量中选择最优的特征分类变量;Construct a learning model based on machine learning, and select the optimal feature classification variable from the feature classification variables;
    在选择最优变量时,基于幽灵数据实例构造的特征夏普利值估计,以此计算选择的特征变量的边际贡献和平均夏普利值;When selecting the optimal variable, estimate the characteristic Shapley value based on the ghost data instance to calculate the marginal contribution and average Shapley value of the selected characteristic variable;
    并选择最优变量时使用夏普利值对于各个特征的价值按照其边际贡献进行分配,量化不同输入特征对训练模型输出预测结果的影响并将符合设置家边际贡献的特征保留;And when selecting the optimal variable, use the Shapley value to allocate the value of each feature according to its marginal contribution, quantify the impact of different input features on the output prediction results of the training model, and keep the features that meet the setter's marginal contribution;
    检测数据是否能够用于机器学习和交易,如果能够进行及机器学习和交易,则数据买方和卖方构建交易,并通过构建的学习模型获取当前数据的预测值作为数据的支付价格。Detect whether the data can be used for machine learning and trading. If it can be used for machine learning and trading, the data buyer and seller construct a transaction, and obtain the predicted value of the current data through the constructed learning model as the payment price of the data.
  2. 根据权利要求1所述的一种基于夏普利值的数据特征组合定价方法,其特征在于,从特征分类个变量中选择最优的特征分类个变量的过程包括以下步骤:A kind of data feature combination pricing method based on Shapley value according to claim 1, it is characterized in that, the process of selecting optimal feature classification variables from feature classification variables comprises the following steps:
    使用所有特征分变量数据对基于机器学习的学习模型进行训练;Use all feature subvariable data to train the learning model based on machine learning;
    对特征变量的重要性进行排序,选取重要性值最大的前k个特征;Sort the importance of the feature variables, and select the top k features with the largest importance values;
    用验证集评估模型,重新计算每个特征变量的重要性并进行排序;Evaluate the model on the validation set, recalculate and rank the importance of each feature variable;
    把训练集拆分成新训练集与新验证集,采用新训练集和所有特征变量训练模型,使用验证集评估模型,计算所有的特征变量重要性并进行排序。Split the training set into a new training set and a new validation set, use the new training set and all feature variables to train the model, use the validation set to evaluate the model, calculate and rank the importance of all feature variables.
  3. 根据权利要求1所述的一种基于夏普利值的数据特征组合定价方法,其特征在于,基于幽灵数据实例构造的特征变量的夏普利值估计包括从特征变量中随机抽取一个实例,并构造一个含有某一特征的实例和一个不含前述特征的实例,并将这两个实例作为幽灵数据实例。According to claim 1, a data feature combination pricing method based on the Shapley value is characterized in that the estimation of the Shapley value of the characteristic variable based on the ghost data instance includes randomly extracting an instance from the characteristic variable, and constructing a An instance with a certain feature and an instance without the aforementioned feature, and use these two instances as ghost data instances.
  4. 根据权利要求1所述的一种基于夏普利值的数据特征组合定价方法,其特征在于,特征变量的边际贡献表示为:According to a kind of data feature combination pricing method based on Shapley value according to claim 1, it is characterized in that, the marginal contribution of feature variable is expressed as:
    Figure PCTCN2022126712-appb-100001
    Figure PCTCN2022126712-appb-100001
    其中,
    Figure PCTCN2022126712-appb-100002
    为第m次迭代过程中实例x中第j个特征的边界贡献值;
    Figure PCTCN2022126712-appb-100003
    为实例x在第m次迭代中使用带有特征j的实例实现的预测,
    Figure PCTCN2022126712-appb-100004
    为第m次迭代时实例x中第j个特征以后的特征被实例z中特征进行随机替换后的特征向量;
    Figure PCTCN2022126712-appb-100005
    为实例x在第m次迭代中使用不带特征j的实例实现的预测,
    Figure PCTCN2022126712-appb-100006
    为第m次迭代时实例x中第j个特征以及第j个特征以后的特征被实例z中特征进行随机替换后的特征向量。
    in,
    Figure PCTCN2022126712-appb-100002
    is the boundary contribution value of the jth feature in the instance x during the mth iteration;
    Figure PCTCN2022126712-appb-100003
    For instance x at the mth iteration using the prediction achieved by the instance with feature j,
    Figure PCTCN2022126712-appb-100004
    is the feature vector after the features after the jth feature in the instance x are randomly replaced by the features in the instance z at the mth iteration;
    Figure PCTCN2022126712-appb-100005
    For instance x at the mth iteration the prediction is achieved using the instance without feature j,
    Figure PCTCN2022126712-appb-100006
    It is the feature vector after the jth feature in the instance x and the features after the jth feature are randomly replaced by the features in the instance z at the mth iteration.
  5. 根据权利要求1所述的一种基于夏普利值的数据特征组合定价方法,其特征在于,对该特征变量进行定价的过程包括以下步骤:A kind of data feature combination pricing method based on Shapley value according to claim 1, it is characterized in that, the process of pricing this feature variable comprises the following steps:
    S41、与数据买方交易之前,数据卖方先设置交易数据的价格p n,买方个数以及买方报价,并计算数据买方的收益函数; S41. Before trading with the data buyer, the data seller first sets the price p n of the transaction data, the number of buyers and the buyer's quotation, and calculates the data buyer's income function;
    S42、根据买方的收益函数计算数据买方的最终支付;数据买方支付费用,将选择的特征变量进行交易;S42. Calculate the final payment of the data buyer according to the buyer's revenue function; the data buyer pays the fee and trades the selected characteristic variables;
    S43、基于乘权更新算法卖方更新数据价格,返回到S41,开始下一轮定价。S43. The seller updates the data price based on the multiplication weight update algorithm, returns to S41, and starts the next round of pricing.
  6. 根据权利要求1所述的基于夏普利值的数据特征组合定价方法,其特征在于,数据买方支付费用R n表示为: The data feature combination pricing method based on the Shapley value according to claim 1, wherein the data buyer pays Rn as:
    Figure PCTCN2022126712-appb-100007
    Figure PCTCN2022126712-appb-100007
    其中,G(b n,p n)为卖方设置交易数据的价格为p n且买方报价为b n时买方的收益函数。 Among them, G(b n ,p n ) is the buyer's profit function when the seller sets the price of transaction data as p n and the buyer's quotation is b n .
  7. 根据权利要求6所述的基于夏普利值的数据特征组合定价方法,其特征在于,卖方的收益函数根据卖方设置交易数据的价格以及买方的报价进行确定,在卖方价格固定的情况下,当报价b n小于卖方设置交易数据的价格p n时,随着 报价b n增大买方的收益增大,直到报价b n等于卖方设置交易数据的价格p n时达到最大收益;当报价b n大于卖方设置交易数据的价格p n时,买方效用保持最大值不变且买方支付费用也维持最大值不变。 According to the data feature combination pricing method based on the Shapley value according to claim 6, it is characterized in that the seller's profit function is determined according to the price of the transaction data set by the seller and the quotation of the buyer. When b n is less than the price p n of the transaction data set by the seller, the profit of the buyer increases as the quotation b n increases until the quotation b n is equal to the price p n of the transaction data set by the seller to reach the maximum profit; when the quotation b n is greater than the seller When setting the price p n of the transaction data, the buyer’s utility remains at the maximum value and the buyer’s payment fee also maintains the maximum value.
  8. 根据权利要求5所述的基于夏普利值的数据特征组合定价方法,其特征在于,每次确定定价S后,同一数据卖给多个用户时,根据数据复制价格对数据进行定价,若数据复制为i个样本,则每个样本的售价S n为: According to the data feature combination pricing method based on the Shapley value according to claim 5, it is characterized in that, after the pricing S is determined each time, when the same data is sold to multiple users, the data is priced according to the data copy price, if the data copy is i samples, then the selling price S n of each sample is:
    Figure PCTCN2022126712-appb-100008
    Figure PCTCN2022126712-appb-100008
    其中,S为只有一份数据时的售价,e为惩罚因子。Among them, S is the selling price when there is only one piece of data, and e is the penalty factor.
  9. 基于夏普利值的数据特征组合定价系统,其特征在于,包括特征选择子系统和定价子系统,特征选择子系统对特征进行筛选,定价子系统对筛选得到的特征进行定价拍卖;The data feature combination pricing system based on the Shapley value is characterized in that it includes a feature selection subsystem and a pricing subsystem, the feature selection subsystem screens the features, and the pricing subsystem performs a pricing auction on the screened features;
    特征子系统中包括机器学习模型和夏普利分析模型,机器学习模型根据数据进行训练预测,将预测得到的值作为特征的重要性进行排序,并将重要性最大的K个特征送入夏普利分析模型进行分析;夏普利分析模型计算特征变量的编辑贡献和平均夏普利值;The feature subsystem includes the machine learning model and the Shapley analysis model. The machine learning model performs training and prediction based on the data, sorts the predicted values as the importance of the features, and sends the K features with the greatest importance to the Shapley analysis The model is analyzed; the Shapley analysis model calculates the editorial contribution and the average Shapley value of the feature variable;
    定价子系统中数据买方根据数据卖方的定价。In the pricing subsystem, data buyers base their prices on data sellers.
  10. 基于夏普利值的数据特征组合定价电子设备,包括处理器和存储器,其特征在于,所述存储器存储有权利要求1~8所述的任一一种根据权利要求1所述的基于夏普利值的数据特征组合定价方法,并且处理器能够运行存储器中存储的基于夏普利值的数据特征组合定价方法。Electronic equipment based on the combination of data features of the Shapley value, including a processor and a memory, is characterized in that the memory stores any one of claims 1 to 8 based on the Shapley value according to claim 1. The data feature combination pricing method, and the processor can run the data feature combination pricing method based on the Shapley value stored in the memory.
PCT/CN2022/126712 2021-11-11 2022-10-21 Data feature combination pricing method and system based on shapley value and electronic device WO2023082969A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111332244.6 2021-11-11
CN202111332244.6A CN113919886A (en) 2021-11-11 2021-11-11 Data characteristic combination pricing method and system based on summer pril value and electronic equipment

Publications (1)

Publication Number Publication Date
WO2023082969A1 true WO2023082969A1 (en) 2023-05-19

Family

ID=79246015

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/126712 WO2023082969A1 (en) 2021-11-11 2022-10-21 Data feature combination pricing method and system based on shapley value and electronic device

Country Status (2)

Country Link
CN (1) CN113919886A (en)
WO (1) WO2023082969A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919886A (en) * 2021-11-11 2022-01-11 重庆邮电大学 Data characteristic combination pricing method and system based on summer pril value and electronic equipment
US20230315885A1 (en) * 2022-04-04 2023-10-05 Gursimran Singh Systems, methods, and computer-readable media for secure and private data valuation and transfer
CN114780742B (en) * 2022-04-19 2023-02-24 中国水利水电科学研究院 Construction and use method of flow scheduling knowledge-graph question-answering system of irrigation area
CN115116240A (en) * 2022-06-27 2022-09-27 中国科学院电工研究所 Lantern-free intersection vehicle cooperative control method and system
CN114997549B (en) * 2022-08-08 2022-10-28 阿里巴巴(中国)有限公司 Interpretation method, device and equipment of black box model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7512558B1 (en) * 2000-05-03 2009-03-31 Quantum Leap Research, Inc. Automated method and system for facilitating market transactions
CN111028080A (en) * 2019-12-09 2020-04-17 北京理工大学 Multi-arm slot machine and Shapley value-based crowd sensing data dynamic transaction method
CN111325353A (en) * 2020-02-28 2020-06-23 深圳前海微众银行股份有限公司 Method, device, equipment and storage medium for calculating contribution of training data set
CN113159835A (en) * 2021-04-07 2021-07-23 远光软件股份有限公司 Power generation side electricity price quotation method and device based on artificial intelligence, storage medium and electronic equipment
CN113435938A (en) * 2021-07-06 2021-09-24 牡丹江大学 Distributed characteristic data selection method in electric power spot market
CN113919886A (en) * 2021-11-11 2022-01-11 重庆邮电大学 Data characteristic combination pricing method and system based on summer pril value and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7512558B1 (en) * 2000-05-03 2009-03-31 Quantum Leap Research, Inc. Automated method and system for facilitating market transactions
CN111028080A (en) * 2019-12-09 2020-04-17 北京理工大学 Multi-arm slot machine and Shapley value-based crowd sensing data dynamic transaction method
CN111325353A (en) * 2020-02-28 2020-06-23 深圳前海微众银行股份有限公司 Method, device, equipment and storage medium for calculating contribution of training data set
CN113159835A (en) * 2021-04-07 2021-07-23 远光软件股份有限公司 Power generation side electricity price quotation method and device based on artificial intelligence, storage medium and electronic equipment
CN113435938A (en) * 2021-07-06 2021-09-24 牡丹江大学 Distributed characteristic data selection method in electric power spot market
CN113919886A (en) * 2021-11-11 2022-01-11 重庆邮电大学 Data characteristic combination pricing method and system based on summer pril value and electronic equipment

Also Published As

Publication number Publication date
CN113919886A (en) 2022-01-11

Similar Documents

Publication Publication Date Title
WO2023082969A1 (en) Data feature combination pricing method and system based on shapley value and electronic device
TW530235B (en) Valuation prediction models in situations with missing inputs
TWI248001B (en) Methods and apparatus for automated underwriting of segmentable portfolio assets
US20010037278A1 (en) Methods and apparatus for simulating competitive bidding yield
JP2003529139A (en) Efficient portfolio sampling method and system for optimal underwriting
US20090177589A1 (en) Cross correlation tool for automated portfolio descriptive statistics
JP2003526146A (en) Method and system for reducing risk by obtaining evaluation values
JP2004500644A (en) Quick deployment method and system of evaluation system
JP2004500641A (en) Method and system for automatically estimating credit score evaluation value
JP2003535387A (en) Rapid evaluation of asset portfolios such as financial products
JP2003529138A (en) Methods and systems for optimizing revenue and present value
CN111861698B (en) Pre-loan approval early warning method and system based on loan multi-head data
JP2004500642A (en) Methods and systems for assessing cash flow recovery and risk
Deng Study of the prediction of micro-loan default based on Logit model
CN112182331B (en) SVM-RFE-based client risk feature screening method and application thereof
da Silva Mattos et al. Bankruptcy prediction with low-quality financial information
WO2023038571A1 (en) Multilevel, multivariate loan request evaluation platform and evaluation method
CN114240100A (en) Loan assessment method, loan assessment device, loan assessment computer equipment and loan assessment storage medium
Read et al. Loan-level Determinants of Housing Loan Arrears| RDP 2014-13: Mortgage-related Financial Difficulties: Evidence from Australian Micro-level Data
AU2600801A (en) Valuation prediction models in situations with missing inputs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891766

Country of ref document: EP

Kind code of ref document: A1