CN116821817A

CN116821817A - Data prediction method and device based on joint tree model and computer equipment

Info

Publication number: CN116821817A
Application number: CN202310595002.9A
Authority: CN
Inventors: 陈奎; 那崇宁; 杨耀; 卢冰洁
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2023-09-29

Abstract

The application relates to a data prediction method, a data prediction device, a computer device, a storage medium and a computer program product based on a joint tree model. The method comprises the following steps: firstly, acquiring a local data set and an initial third-party data set, then determining a local training tree model based on the local data set, determining a local joint tree model based on a joint data set, wherein the joint data set comprises the local data set and the initial third-party data set, then determining information entropy values of tree nodes of the local training tree model and the local joint tree model based on the local data set, then determining gain degree of the initial third-party data set participating in training based on the information entropy values, determining a target third-party data set based on the gain degree, determining a target joint tree model based on the local data set and the target third-party data set, and finally inputting data to be predicted into the target joint tree model to obtain a prediction result.

Description

Data prediction method and device based on joint tree model and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data prediction method and apparatus based on a joint tree model, a computer device, and a storage medium.

Background

The single-scene data multi-party safe storage and cross-scene data heterogeneous distribution are important forms of associated scene data distribution, and along with the development of multiparty data model training technology, how to obtain a model with better performance and stronger robustness is a key problem to be considered in researching the value of multiparty data models.

The multiparty Secure computing platform (Secure Multi-party Computation) provides a platform using a multiparty data security training model, which implements multiparty data security joint modeling, while meeting the privacy protection requirement that the multisource data does not go out locally. However, in a real scene, data provided by a third party data provider has a certain noise, which is represented by a low model value of the data and artificially generated invalid data, and at the same time, the data noise increases as the amount of data increases. When the data noise provided by the third-party data is greater than the model value of the data, the performance and the robustness of the model cannot be increased by using the third-party data for joint modeling, and the joint modeling effect is lower than that of a single data source.

Accordingly, there is a need in the related art for a way to identify data noise of third party data and reduce noise interference of the third party data in joint modeling.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data prediction method, apparatus, computer device, and computer-readable storage medium based on a joint tree model, which are capable of recognizing data noise of third party data and reducing noise interference of the third party data in joint modeling.

In a first aspect, the present application provides a data prediction method based on a joint tree model. The method comprises the following steps:

acquiring a local data set and an initial third party data set;

determining a local training tree model based on the local data set, determining a local joint tree model based on a joint data set, wherein the joint data set comprises the local data set and an initial third party data set;

determining information entropy values of tree nodes of the local training tree model and the local joint tree model based on the local data set;

determining the gain degree of the initial third-party data set participating in training based on the information entropy value, determining a target third-party data set based on the gain degree, and determining a target joint tree model based on the local data set and the target third-party data set;

and inputting the data to be predicted into the target joint tree model to obtain a prediction result.

Optionally, in one embodiment of the present application, the acquiring the local data set and the initial third party data set includes:

and matching the local data set with the data characteristics of the candidate third party data set, and determining the initial third party data set based on a matching result.

Optionally, in an embodiment of the present application, matching the local data set with data features of a candidate third party data set, and determining the initial third party data set based on a matching result includes:

and calculating the similarity of the data feature sets based on the data feature sets of the local data set and the candidate third-party data set, and taking the candidate third-party data set as the initial third-party data set if the similarity of the data feature sets accords with a preset condition.

Optionally, in one embodiment of the present application, the acquiring the local data set includes:

and acquiring a local data set, and splitting the local data set into a training data set and a testing data set by adopting a statistical principle or a scene experience principle.

Optionally, in an embodiment of the present application, the determining a local training tree model based on the local data set, and determining a local joint tree model based on the joint data set includes:

Calculating training data characteristic value dividing intervals of tree nodes based on the training data set, calculating training information entropy values based on the training data characteristic value dividing intervals, and determining a local training tree model;

and calculating joint data characteristic value dividing intervals of tree nodes based on the joint data set, calculating joint information entropy values based on the joint data characteristic value dividing intervals, and determining a local joint tree model.

Optionally, in an embodiment of the present application, the determining a target third party data set based on the gain level, and determining a target joint tree model based on the local data set and the target third party data set includes:

and if the gain degree accords with a preset threshold, determining the initial third party data set as a target third party data set, and determining a corresponding target joint tree model based on the training data set and the target third party data set.

Optionally, in an embodiment of the present application, the matching the local data set with the data features of the candidate third party data set, and before determining the initial third party data set based on the matching result, further includes:

and constructing a data feature mapping table, and unifying the data feature names of the local data set and the candidate third party data set based on the data feature mapping table. In a second aspect, the application further provides a data prediction device based on the joint tree model. The device comprises:

The data set acquisition module is used for acquiring a local data set and an initial third party data set;

a local tree model determination module for determining a local training tree model based on the local data set, determining a local joint tree model based on a joint data set, wherein the joint data set comprises the local data set and an initial third party data set;

an information entropy value determining module for determining information entropy values of tree nodes of the local training tree model and the local joint tree model based on the local data set;

the joint tree model determining module is used for determining the gain degree of the initial third-party data set participating in training based on the information entropy value, determining a target third-party data set based on the gain degree, and determining a target joint tree model based on the local data set and the target third-party data set;

and the prediction result determining module is used for inputting the data to be predicted into the target joint tree model to obtain a prediction result.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor executing the steps of the method according to the various embodiments described above.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method described in the above embodiments.

The data prediction method, the device, the computer equipment and the storage medium based on the joint tree model comprise the steps of firstly, acquiring a local data set and an initial third party data set, then, determining a local training tree model based on the local data set, determining a local joint tree model based on the joint data set, wherein the joint data set comprises the local data set and the initial third party data set, then, determining information entropy values of tree nodes of the local training tree model and the local joint tree model based on the local data set, then, determining gain degrees of the initial third party data set participating in training based on the information entropy values, determining a target third party data set based on the gain degrees, determining a target joint tree model based on the local data set and the target third party data set, and finally, inputting data to be predicted into the target joint tree model to obtain a prediction result. And the gain degree of each third party data set participating in the training of the joint tree model is calculated, so that the data noise of the third party data is effectively identified, the data noise is removed, and the noise interference of the third party data in the joint modeling is reduced.

Drawings

FIG. 1 is an application environment diagram of a data prediction method based on a joint tree model in one embodiment;

FIG. 2 is a flow diagram of a data prediction method based on a joint tree model in one embodiment;

FIG. 3 is a schematic diagram of the structure of a tree model in one embodiment;

FIG. 4 is a flow diagram of splitting a local data set in one embodiment;

FIG. 5 is a schematic diagram of a method of computing a partition in one embodiment;

FIG. 6 is a flow diagram of a method of data prediction based on a joint tree model in one embodiment;

FIG. 7 is a block diagram of a data prediction device based on a joint tree model in one embodiment;

fig. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The data prediction method based on the joint tree model provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, a data prediction method based on a joint tree model is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

s201: a local data set and an initial third party data set are acquired.

In an embodiment of the present application, first, a local data set and an initial third party data set are acquired. The local data set is a data set required by modeling party joint tree model training, wherein the data is structured data, and the data format comprises an ID type, a numerical type and enumeration type, such as loan data client data, repayment detail data and the like in a bank credit business. And aiming at the data with the data types of text and image, obtaining the structured data through operations such as data extraction, conversion and the like. The initial third party data set provides data statistical information required by joint tree model training for a plurality of third party data providers and a modeling party, and particularly, the data statistical information is usually encrypted statistical information based on multi-party security calculation or encryption statistical information based on various cryptography based on calculation confidentiality requirements, wherein encryption modes comprise privacy methods such as homomorphic encryption and differential privacy.

S203: a local training tree model is determined based on the local data set, and a local joint tree model is determined based on a joint data set, wherein the joint data set includes the local data set and an initial third party data set.

The tree model is a learning model structure with strong solvability and logic property, and has wide research and application in academia and industry. The training process of the tree model depends on statistical information of training data, and as shown in fig. 3, the tree model structure comprises tree structure and node parameter information, and the training process of the tree model is a process of exploring the optimal tree model structure and the tree node parameter from top to bottom based on the characteristics of the tree structure.

In the embodiment of the application, a local training tree model is determined by training by using a local data set, a local joint tree model is determined by training by using data statistical information of the local data set and an initial third party data set, wherein the local training tree model and the local joint tree model refer to a tree model which only completes calculation of partial nodes, as shown in fig. 3, the nodes 1 to fea of the current tree model complete calculation, and the rest nodes are not trained yet.

S205: information entropy values of tree nodes of the local training tree model and the local joint tree model are determined based on the local data set.

In the embodiment of the application, the information entropy value of the local training tree model at the current node is calculated based on the local data set, and the information entropy value of the local joint tree model at the current node is calculated based on the local data set. The entropy value entropy may describe the uncertainty of each possible event occurrence of the information source, and may be calculated using the following formula.

Wherein, the variable X represents a discrete random variable obeying a certain distribution, X represents all possible values of the discrete random variable, and p (X) represents the probability corresponding to the value X of the random variable X.

S207: and determining the gain degree of the initial third-party data set participating in training based on the information entropy value, determining a target third-party data set based on the gain degree, and determining a target joint tree model based on the local data set and the target third-party data set.

In the embodiment of the application, the gain degree of the initial third party data set participating in training is calculated based on the information entropy value obtained by the local training tree model and the information entropy value obtained by the local joint tree model, the gain degree refers to the difference value between the information entropy value corresponding to the local joint tree model and the information entropy value corresponding to the local training tree model, whether the initial third party data set participates in the training of the joint tree model is determined based on the difference value, the initial third party data set participating in the training is determined to be the target third party data set, and the joint tree model obtained by the local data set and the target third party data set participating in the training is determined to be the target joint tree model.

S209: and inputting the data to be predicted into the target joint tree model to obtain a prediction result.

In the embodiment of the application, after the target joint tree model is determined, the data to be predicted is input into the target joint tree model, for example, text data or image data to be predicted, so as to obtain a prediction result. The target joint tree model is used for predicting text data or image data, the prediction comprises a classification task, a regression task and the like, the classification task specifically comprises performing attribute classification on the data and outputting classification results, different application scenes have different attribute classification requirements, and the regression task predicts the data according to related parameters to obtain a determined result. Specifically, the adjustment may be performed according to the actual required function, which is not specifically limited herein. In specific application, the target joint tree model can be used for bank credit business scenes, insurance business scenes and the like, the data to be predicted can be related data such as loan data client data, repayment detail data and the like, the data to be predicted is predicted to be a classification task, the output prediction result is a classification result, such as risk grade classification, abnormal classification, risk attribute classification and the like, and specific classification requirements can be set according to requirements during modeling. In one embodiment of the application, the application scene is a bank credit service, the input data to be predicted is repayment detail data, particularly repayment conditions of a plurality of clients in the same historical time period, the risk level classification is classified according to the actual repayment capability of a borrower, the finally output classification result is in five classes of normal, concerned, secondary, suspicious and lost grades, and the classification result can be classified according to the risk attribute, wherein the secondary, suspicious and lost grades are classified as bad loans. In addition, optionally, in the embodiment of the present application, the application scenario may also be an image recognition scenario, the data to be predicted is image data, the image data is a photograph of different birds, the image data is first converted to obtain structured data, then the structured data is input into a target joint tree model for prediction, the birds in the photograph are classified according to the attribute of the bird, and the output result is classified according to the type attribute of the bird, for example, which is magpie and which is sparrow in the picture.

In the data prediction method based on the joint tree model, firstly, a local data set and an initial third party data set are obtained, then, a local training tree model is determined based on the local data set, a local joint tree model is determined based on the joint data set, wherein the joint data set comprises the local data set and the initial third party data set, then, information entropy values of tree nodes of the local training tree model and the local joint tree model are determined based on the local data set, then, gain degrees of the initial third party data set participating in training are determined based on the information entropy values, a target third party data set is determined based on the gain degrees, a target joint tree model is determined based on the local data set and the target third party data set, and finally, data to be predicted is input into the target joint tree model to obtain a prediction result. And the gain degree of each third party data set participating in the training of the joint tree model is calculated, so that the data noise of the third party data is effectively identified, the data noise is removed, and the noise interference of the third party data in the joint modeling is reduced.

In one embodiment of the application, the acquiring the local data set and the initial third party data set includes:

In one embodiment of the application, after the local data set and the initial third party data set are acquired, the data features of the local data set and the data features of the initial third party data set are matched, the initial third party data set is screened based on a matching result obtained by matching, and the third party data set participating in the training of the joint tree model is subjected to preliminary screening, wherein the matching mode can be the calculation of the similarity between the data feature sets.

In this embodiment, the local data set is matched with the data features of the initial third party data set, and the initial third party data set is screened based on the matching result, so that the third party data set can be screened preliminarily, and a part of the third party data set with a larger influence on the training result is filtered.

In one embodiment of the present application, the matching the local data set with the data features of the candidate third party data set, and determining the initial third party data set based on the matching result includes:

In one embodiment of the present application, the matching manner of the data features of the local data set and the candidate third party data set is to calculate the similarity between the data feature sets, specifically, the local data feature set is a feas (data (model)), the candidate third party data feature set is a feas (data (add (k)), and the difference set diff_fea (k) of the data feature set is calculated, where the calculation formula is

diff_fea(k)＝feas(data(model))-feas(data(add(k)))

A specific meaning is a set of data features contained in the local data feature set and not contained in the candidate third party data feature set. The calculation formula of the similarity between the local data characteristic set and the candidate third-party data characteristic set is as follows

Wherein the symbol || represents the number of elements in the x, which represents the data feature set. Alternatively, the similarity may be calculated based on the vectorized vector included angle of the data features.

And if the similarity of the data feature sets accords with a preset condition, determining that the current candidate third-party data set participates in the training process of the joint tree model. Specifically, the preset condition is realized by setting a preset feature threshold value thresh (fea), the value range of the preset feature threshold value is a decimal between 0 and 1, and the determination basis is business experience, data statistics rule or manual determination. When dist_fea (k) > = thresh (fea), the similarity of the data characteristics of the candidate third party data set and the local data set is higher, the current candidate third party data set participates in the joint tree model training process, and when dist_fea (k) > = thresh (fea), the similarity of the data characteristics of the candidate third party data set and the local data set is lower, the current candidate third party data set does not participate in the joint tree model training process.

In this embodiment, by calculating the similarity of the data feature sets, a part of the third party data set with a larger difference from the local data set can be screened out, so as to reduce the calculation cost.

In one embodiment of the application, the acquiring the local data set includes:

In one embodiment of the present application, as shown in fig. 4, after the local data set data (model) is acquired, the local data set is split into the training data set data (model) and the test data set data (test) based on the statistical principle or the scene experience principle, and the training data set and the test data set satisfy the orthogonal relationship, that is, the same data does not exist between the two data sets, and are expressed as

data(model)＝data(model,train)⊕data(model,test)

In one embodiment of the application, the statistical principle refers to using a random access process to obtain a training data set and a test data set from a local data set, the training data set and the test data set having no common data vector, the data vector representing a transverse vector in a data table, one transverse vector representing a piece of data. The random number proportion related in the process can be adjusted based on actual needs, and the default proportion is 0.2. The scene experience principle refers to that a training data set and a test data set are split from local data provided by a modeling party based on scene experience and business requirements, the training data set and the test data set possibly contain common data vectors, and the splitting proportion and the splitting mode of the data sets depend on the business requirements.

In the embodiment, the local data set is split, part of the local data set participates in training and part of the local data set participates in testing, so that the training result of the combined tree model is more accurate.

In one embodiment of the application, the determining a local training tree model based on the local data set, the determining a local joint tree model based on the joint data set comprises:

s301: and calculating training data characteristic value dividing intervals of the tree nodes based on the training data set, calculating training information entropy values based on the training data characteristic value dividing intervals, and determining a local training tree model.

S303: and calculating joint data characteristic value dividing intervals of tree nodes based on the joint data set, calculating joint information entropy values based on the joint data characteristic value dividing intervals, and determining a local joint tree model.

In one embodiment of the present application, the joint tree model training process is a process of calculating an optimal tree node parameter value and an optimal information entropy value based on a training data set and a third party data set, and node parameters are data feature value division, wherein the data feature value division refers to dividing a value range of the data feature value into a plurality of subintervals, and interval boundaries are division boundaries. As shown in fig. 5, assuming that the data characteristic value take value sets are [ a, B ], a partition t= [ T1, T2] is defined, and the partition T divides the value sets [ a, B ] into three subsets, which are set (1) = [ a, T1], set (2) = [ T1, T2] and set (3) = [ T2, B ], respectively. And dividing the data characteristic values corresponding to the optimal information entropy values into optimal divisions.

In one embodiment of the present application, the calculation process of the partition is to assume that, for the current node of the joint tree model, the corresponding feature name is referred to as fea, the initial set of values corresponding to the feature is set (fea), default set (fea) =empty set, and the empty set represents a set without elements. The local data is set (fea, model) at the value set of the feature fea, and then the initial set (fea), that is, set (fea) =set (fea, model) is updated; for the third party data set { data (1), data (2), data (M) }, assuming that the interval similarity threshold is thresh (domain), the value set of the third party data (1) at the fea is set (fea, add (1)), the similarity (fea, model, add (1)) between the local data value set (fea, model) and the third party data value set is calculated as follows:

here, inner (set (fea, model), set (fea, add (1)) refers to an intersection of set (fea, model) and set (fea, add (1)), and unit (set (fea, model), set (fea, add (1)) refers to a union of set (fea, model) and set (fea, add (1)), and the symbol | indicates a length of set, that is, a difference between a maximum value and a minimum value in set.

When similarity (fea, model, add (1)) > thresh (domain), it is explained that the similarity between the value set of the local data at the feature fea and the value set of the third party data (1) at the feature fea is higher, the initial set (fea) is updated, that is

set(fea)＝union(set(fea),set(fea,add(1)))

Otherwise, no operation is performed. Thereafter, the value set of the fea is updated sequentially based on the third party data set data (2), data (3), … …, data (M).

In one embodiment of the present application, the information entropy value is generally calculated based on a data characteristic value dividing interval, specifically, it is first assumed that the range of values of the joint tree model at the node fea is [ a, B ], where B > =a, and parameter values of fea are required to divide the interval into a disjoint set (1) and set (2), and [ a, B ] =unit (set (1), set (2)). At the set (1) and the set (2), calculating the information entropy value of the test data set data (model) and the third party data (k) at the node fea of the joint tree model, namely, the information entropy value of the joint tree model at the node fea is expressed as:

entropy(data(k’))＝entropy(set(1))+entropy(set(2))

in one embodiment of the application, the calculation mode of the data characteristic value dividing interval and the joint tree model information entropy value is adopted, the training data characteristic value dividing interval of the tree node is calculated based on the training data set, the training information entropy value is calculated based on the training data characteristic value dividing interval, and the local training tree model is determined; and calculating joint data characteristic value dividing intervals of tree nodes based on the joint data set, calculating joint information entropy values based on the joint data characteristic value dividing intervals, and determining a local joint tree model.

In this embodiment, by calculating the data feature value dividing section of the tree node based on the training data set and the joint data set, and calculating the information entropy value based on the data feature value dividing section, the local training tree model and the local joint tree model are determined, so that the joint tree model can be trained more accurately, and the subsequent judgment of the gain degree is more visual and clear.

In one embodiment of the present application, the determining a target third party data set based on the gain level and determining a target joint tree model based on the local data set and the target third party data set includes:

In one embodiment of the present application, the information entropy values of the tree nodes of the local training tree model and the local joint tree model, that is, entropy (model), entopy (data (1)), entopy (data (2)), … …, entopy (data (M)), are calculated based on the test data set in the local data set, and then the difference between the information entropy value of each local joint tree model and the information entropy value of the local training tree model is sequentially calculated to obtain a value gain entopy_add (data (k')), the value gain is the gain degree, that is

entropy_add(data(k’))＝entropy(data(k’))-entropy(data(model))

Setting a value gain preset threshold thresh (value), and determining the initial third party data set as a target third party data set based on the value gain preset threshold, wherein the value range of the preset threshold is a decimal between 0 and 1, namely [0,1], and the setting of the numerical value is based on service experience, data statistics rules or through multiple rounds of testing to find a relatively optimized value.

In one embodiment of the present application, if the entry_add (data (k ')) > = thresh (value), the initial third party data set data (k') is marked as data (k ', y), which means that the initial third party data set data (k') participates in the parameter calculation of the joint tree model at the current node fea, and the initial third party data set is determined to be the target third party data set. If the entry_add (data (k ')) < thresh (value), marking the initial third party data set data (k') as data (k ', n), which indicates that the data value gain provided by the initial third party data set data (k') at the current node fea is insufficient, the parameter calculation of the joint tree model at the current node is not participated.

In one embodiment of the application, the collection of the determined target third party data sets and the training data sets form a target data set, expressed as a data set { data (k ", y), k" = 1,2,3,..p }, k "representing a target third party data set provider number, and P representing a target third party data set number of data sets. And calculating a node parameter value and a corresponding information entropy value of the joint tree model at the current node fea based on the target data set to obtain a target joint tree model.

In this embodiment, by judging the gain degree brought by the participation of the third party data set in the training of the joint tree model, the influence of the third party data set on the training of the joint tree model can be intuitively determined, and the third party data set participating in the training of the joint tree model can be determined.

In one embodiment of the present application, the matching the local data set with the data features of the candidate third party data set further includes, before determining the initial third party data set based on the matching result:

and constructing a data feature mapping table, and unifying the data feature names of the local data set and the candidate third party data set based on the data feature mapping table.

In one embodiment of the application, as there may be a difference between the data feature names provided by different third party data providers, the same feature uses different writing formats, and in order to facilitate matching of the data feature values, a data feature mapping table is constructed to uniformly represent the data feature names.

In one embodiment of the present application, the data feature set of the local data set data (model) is denoted as fea (data (model)), in particular

fea(data(model))＝[fea_1,fea_2,...,fea_T]

T denotes the number of local data features, i.e. the local data feature set contains T features, the feature names being fea_1, fea_2, … …, fea_t.

The kth third party data set is data (k '), and the corresponding data feature set is fea (data (k'), denoted as

fea(data(k’))＝[fea_1(k’),fea_2(k’),...,fea_T(k’)],

T (k ') denotes the number of kth third party data features, the data feature name being fea_1 (k'), fea_2 (k '), … …, fea_t (k').

In one embodiment of the present application, the function expression fun_name (table_map (k'),) is used to indicate that the feature names in brackets are mapped to corresponding feature names based on the data feature mapping table. In a specific application, the local data feature set is as follows

fea(model)＝[name_1，name_2，name_3，name_4]，

Third party data feature set is

fea (data) = [ name_1', name_2', name_3', name_4', name_5' ], and data feature map table_map (k) is as follows

Local data features	Third party data features
		name_1	name_2’
name_2	name_1’
		name_3	name_3’
name_4	name_4’
		unknown	name_5’

By using the function expression fun_name (table_map (k')), the following results were obtained:

fun_name(table_map(k*),name_1)＝name_2’

fun_name(table_map(k*),name_2)＝name_1’

fun_name(table_map(k*),name_3)＝name_3’

fun_name(table_map(k*),name_4)＝name_4’

in this embodiment, by constructing the data feature mapping table, the data feature names of the local data set and the third party data set are unified, so that the efficiency of data feature matching can be improved, and the matching error can be reduced.

In the following, a procedure of a data prediction method based on a joint tree model is described in a specific embodiment, as shown in fig. 6, first, a local data set data (model) and a plurality of candidate third party data sets data (1), data (2), … …, data (N), where N represents the number of third party data providers are obtained. Then, the local data set data (model) is split into a training data set data (model) and a test data set data (model) by adopting a statistical principle or a scene experience principle, specifically, in the initial state, the aggregate sizes of the data set data (model) and the data set data (model) satisfy the following conditions

|data(model,train)|:|data(model,test)|＝8:2

The expression "data size" means the data set. The aggregate size ratio of the data set data (model) and the data set data (model) can be set according to the service requirement, and is not limited to 8:2. And then, constructing a data feature mapping table, and unifying the data feature names of the local data set and the candidate third party data set based on the data feature mapping table. Specifically, a data feature map table_map (1), a table_map (2), … … and a table_map (N) are constructed, and the data feature names of the initial third party data sets are collectively represented by the data feature names in the local data set based on the data feature map tables. Specifically, after the data feature names are unified, the initial third party data sets are data (1 '), data (2 '), … … and data (N '), and N ' represents the number of third party data providers, and N ' =n. And then, matching the data characteristics of the local data set and the data characteristics of the candidate third-party data set, determining the initial third-party data set based on a matching result, calculating the similarity of the data characteristics based on the data characteristics of the local data set and the data characteristics of the initial third-party data set, and taking the candidate third-party data set as the initial third-party data set if the similarity of the data characteristics meets a preset condition. After feature matching, the initial third party data sets participating in the joint tree training are data (1), data (2), … … and data (M '), wherein M' represents the number of third party data providers, and M '< N'. Then, calculating a training data characteristic value dividing interval of the tree node based on the training data set, calculating a training information entropy value based on the training data characteristic value dividing interval, and determining a local training tree model; and calculating joint data characteristic value dividing intervals of tree nodes based on a joint data set, calculating joint information entropy values based on the joint data characteristic value dividing intervals, and determining a local joint tree model, wherein the joint data set comprises the training data set and an initial third-party data set. Then, determining information entropy values of tree nodes of the local training tree model and the local joint tree model based on the test data set; determining the gain degree of the initial third party data set participating in training based on the information entropy value, specifically, the gain degree is entopy_add (data (1 ')), entopy_add (data (2 ')), … …, entopy_add (data (M ')), if the gain degree accords with a preset threshold value, determining the initial third party data set as a target third party data set, specifically, the target third party data sets are data (1), data (2), … … and data (P '), wherein P ' represents the number of third party data providers, and P ' < M '. And determining a corresponding target joint tree model based on the training data set and the target third party data set. And finally, inputting the data to be predicted into the target joint tree model to obtain a prediction result.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a data prediction device based on the joint tree model, which is used for realizing the data prediction method based on the joint tree model. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation in the embodiments of the data prediction apparatus based on the joint tree model provided below may be referred to the limitation of the data prediction method based on the joint tree model hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 7, there is provided a data prediction apparatus 700 based on a joint tree model, including: a data set acquisition module 701, a local tree model determination module 703, an information entropy value determination module 705, a joint tree model determination module 707, and a prediction result determination module 709, wherein:

a data set acquisition module 701, configured to acquire a local data set and an initial third party data set;

a local tree model determination module 703 for determining a local training tree model based on the local data set, determining a local joint tree model based on a joint data set, wherein the joint data set comprises the local data set and an initial third party data set;

an information entropy value determining module 705, configured to determine, based on the local dataset, information entropy values of tree nodes of the local training tree model and the local joint tree model;

a joint tree model determining module 707 configured to determine a gain level of the initial third party data set participating in training based on the information entropy value, determine a target third party data set based on the gain level, and determine a target joint tree model based on the local data set and the target third party data set;

the prediction result determining module 709 is configured to input data to be predicted into the target joint tree model to obtain a prediction result.

The data prediction device based on the joint tree model further comprises a data characteristic matching module. In one embodiment of the present application, the data feature matching module is configured to match the local data set with data features of a candidate third party data set, and determine the initial third party data set based on a matching result.

In one embodiment of the present application, the data feature matching module is further configured to calculate a data feature set similarity based on the local data set and a data feature set of an initial third party data set, and if the data feature set similarity meets a preset condition, use the candidate third party data set as the initial third party data set.

In one embodiment of the present application, the data set obtaining module is further configured to obtain a local data set, and split the local data set into a training data set and a test data set by adopting a statistical principle or a scene experience principle.

In one embodiment of the present application, the local tree model determining module is further configured to calculate a training data feature value dividing interval of the tree node based on the training data set, calculate a training information entropy value based on the training data feature value dividing interval, and determine a local training tree model;

In an embodiment of the present application, the joint tree model determining module is further configured to determine the initial third party data set as a target third party data set if the gain level meets a preset threshold, and determine a corresponding target joint tree model based on the local data set and the target third party data set.

In one embodiment of the present application, the data feature matching module is further configured to construct a data feature mapping table, and unify data feature names of the local data set and the candidate third party data set based on the data feature mapping table.

The respective modules in the data prediction apparatus based on the joint tree model may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a data prediction method based on a joint tree model. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of data prediction based on a joint tree model, the method comprising:

acquiring a local data set and an initial third party data set;

2. The method of claim 1, wherein the acquiring the local data set and the initial third party data set comprises:

3. The method of claim 2, wherein the matching the local data set with the data characteristics of the candidate third party data set, determining the initial third party data set based on the matching result comprises:

4. The method of claim 1, wherein the acquiring a local data set comprises:

5. The method of claim 4, wherein the determining a local training tree model based on the local data set, determining a local joint tree model based on the joint data set comprises:

6. The method of claim 4, wherein the determining a target third party data set based on the gain level and determining a target joint tree model based on the local data set and the target third party data set comprises:

7. The method of claim 2, wherein the matching the local data set with the data characteristics of the candidate third party data set further comprises, prior to determining the initial third party data set based on the matching result:

8. A data prediction apparatus based on a joint tree model, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.