CN110737731A

CN110737731A - accumulation fund user data refinement analysis system and method based on decision tree

Info

Publication number: CN110737731A
Application number: CN201911022440.6A
Authority: CN
Inventors: 李子龙; 鲍蓉; 潘晓博
Original assignee: Xuzhou University of Technology
Current assignee: Xuzhou University of Technology
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2020-01-31
Anticipated expiration: 2039-10-25
Also published as: CN110737731B

Abstract

The invention discloses system and method for refining analysis of accumulation fund user data based on a decision tree, which comprises a data acquisition module, a data storage module, a data preprocessing module and a data analysis module, wherein the data acquisition module is used for acquiring multi-source accumulation fund user data, identifying the relationship among entities, entity attributes and the entities and eliminating conflicts existing in the multi-source data, the data storage module is used for storing converted relational data into a relational database, the data preprocessing module is used for converting original relational data into characteristic data used by the decision tree in user refining analysis, the data analysis module is used for refining analysis of the user characteristic data by using the decision tree, and finally, an analysis result is displayed to a user in a chart form.

Description

accumulation fund user data refinement analysis system and method based on decision tree

Technical Field

The invention relates to public accumulation fund user data refinement analysis systems and methods based on decision trees, and belongs to the technical field of public accumulation fund data analysis management.

Background

At present, the informatization of the public accumulation fund service becomes necessary trends, and the reasonable and effective management of the data information of the public accumulation fund users is very important for the public accumulation fund management part . by carrying out detailed analysis on the public accumulation fund users, different strategies can be used for management aiming at different subdivided user services so as to enhance the service function and the management level.

In order to enable the public accumulation fund management part to master decision data, the analysis method of the business data is often implemented by adopting a decision tree or clustering method, but the analysis result is still not fine enough, and the analysis process is directly searching and analyzing on a relational database, which involves a large amount of data and data tables, so that the efficiency of query, processing and access is low, and the business requirement implementation cycle is long.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, the invention provides system and method for refining and analyzing accumulation fund user data based on decision trees, which extract characteristic data for refining and analyzing the decision trees from original relational model data by preprocessing the original relational model data and design a new method for refining and analyzing the accumulation fund user based on the decision trees on the basis of the characteristic data, thereby timely and accurately providing powerful decision support for an accumulation fund management part .

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

A system for refining and analyzing accumulation fund user data based on decision trees comprises the following components:

the data acquisition module is used for acquiring data information related to the accumulation fund user from each terminal device (a mobile terminal, a computer terminal, a video monitoring terminal and the like), identifying entities, entity attributes and relations among the entities and eliminating conflicts existing in the multi-source accumulation fund user data;

the data storage module is used for converting the entity obtained after the conflict is eliminated and the relation between the entities into relational model data, and storing the relational model data into a relational database in computer storage equipment to form original relational data;

the data preprocessing module is used for converting the original relational data into characteristic data used by a decision tree in the data analysis module and exists in the form of a relational database view;

the data analysis module is used for carrying out thinning analysis on the user characteristic data in the relational database view through the decision tree, training to generate a decision tree thinning analysis model, then transmitting the test data of the accumulation fund user to the decision tree and reaching a plurality of leaf nodes, and finally giving a thinning classification result according to the estimation values of all the leaf nodes;

and the data display module is used for displaying the detailed classification result to the user on terminal equipment (a mobile phone end, a computer end and the like) in a chart mode.

, when generating the decision tree, the data analysis module selects features from the feature set of the sample data to test, starting from the root node, and allocates the sample data to its child nodes according to the test result, so that the sample data is tested and allocated recursively until reaching the leaf nodes, and finally the sample data is allocated to the leaf nodes.

Further , the process of the data analysis module of allocating the sample data to the child nodes includes:

1) if the eigenvalues of the selected features are discrete and finite, a hard assignment is used, i.e. sample data are assigned to only of the child nodes according to the test result.

2) If the eigenvalues of the selected features are ordered and continuous, soft distribution is used, i.e. sample data are distributed to or more child nodes according to the result of the test using a piecewise linear fuzzy function:

x is certain sample data, and gamma and delta can be the mean value and the variance of the data value corresponding to a certain characteristic (the two parameters can be obtained by learning through a relevant machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are distributed to the left and right child nodes simultaneously.

3) Each sample data used in the decision tree is assigned membership values in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under that node, the membership of all sample data under the root node is assigned 1 by default, and for a given node N, the membership of its child node NC is defined recursively as:

μ_NC(x)＝μ_N(x)t_N(x,γ,δ)

wherein x is sample data, μ_NCDegree of membership, mu, of node NC_NIs the degree of membership, t, of node N_NIs the test function for the corresponding node N.

, the data analysis module obtains the refined classification result of the user test data of the accumulation fund according to the trained decision tree by the following formula:

wherein, x' is the test data,

representing the probability estimate of the c-th class of the ith leaf node,

representing degree of membership, y, of the ith leaf node^cIndicates the probability output for category C, y indicates the probability output for the largest category, C indicates the number of categories, and leaves indicates the set of leaf nodes.

A method for refining and analyzing accumulation fund user data based on decision tree, comprising the following steps:

step A: collecting data information related to the accumulation fund user from each terminal device through a data collection module, identifying the entity, the entity attribute and the relationship among the entities, and eliminating the conflict existing in the multi-source accumulation fund user data;

and B: converting the entity obtained after the conflict is eliminated and the relationship between the entities into relational model data through a data storage module, and storing the relational model data into a relational database to form original relational data;

and C: converting original relational data into characteristic data used by a decision tree in a data analysis module through a data preprocessing module, and storing the characteristic data in a relational database view form;

step D: refining and analyzing user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refining analysis model, transmitting test data of a accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refining and classifying result according to estimated values of all the leaf nodes;

step E: and displaying the detailed classification result to the user on the terminal equipment in a chart mode through a data display module.

Further , the data collected in step A includes structured data, semi-structured data and unstructured data, and entities and relationships between entities can be identified from different data sources by using entity linking technology.

And , eliminating conflicts of entities, entity attributes and relationships among entities from different data sources by means of manual judgment and identification, and mainly eliminating attribute conflicts, name conflicts and structure conflicts.

, the step C includes determining the characteristic attributes used by the decision tree in the refinement analysis of the user of the fund based on different application requirements (such as risk control, customer service, etc.), establishing a transformation relationship between the characteristic data and the original relational data (for example, the characteristic attributes used by the decision tree are the average annual income of each customer, and the monthly income of each customer is recorded in the relational database, so that it is necessary to count the average annual income of each customer for the customer data in the database year to establish the transformation relationship between the average annual income of the customer as the characteristic data and the income of the customer in the relational database), and storing the extracted characteristic data in the relational database view.

The characteristic data and original relation data conversion process, , is to search, process and integrate the data connected with single table or multiple tables in the relation database, the converted result is the characteristic data to be extracted, for the conversion, it can be written into independent program modules to realize, and can transfer the conversion rule as parameter to the module to supply different users, in order to ensure the characteristic data in the database view is up-to-date, timers can be set, the conversion of original relation data to characteristic data is continuously carried out, the updating period can be determined by users according to their needs.

Step , the training and generating process of the decision tree refinement analysis model in step D specifically includes:

when the decision tree is generated, features are selected from the feature set of the sample data for testing from the root node, the sample data is distributed to the child nodes according to the test result, the sample data is tested and distributed recursively until the leaf nodes are reached, and finally the sample data is distributed to the leaf nodes.

To select the appropriate feature from the feature set, different quantitative evaluation criteria may be used, such as information gain, information gain ratio, Gini index, etc.

The conventional decision tree is to distribute sample data to sub-nodes according to the test result on the node, that is, sample data can be distributed to sub-nodes, but the present invention can distribute sample data to a plurality of sub-nodes according to the characteristic of the selected feature, that is, sample data can be distributed to a plurality of sub-nodes, since the data information collected from the real world has ambiguity certainty of , the ambiguity can be embodied in the decision tree by using soft-hard distribution, thereby improving the effect of subdivision classification, specifically,

1) if the eigenvalues of the selected features are discrete and finite, a hard distribution is adopted, that is, sample data can be distributed to only child nodes according to the test result.

2) If the feature values of the selected features are ordered and continuous, a soft distribution is used, that is, sample data may be distributed to a plurality of child nodes according to the test result.

The test uses a piecewise linear fuzzy function:

in the formula, x is a certain sample data, and γ and δ can be the mean and variance of the data value corresponding to a certain feature (these two parameters can also be obtained by learning through a relevant machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are distributed to the left and right child nodes simultaneously.

3) Each sample data used in the decision tree is assigned membership values in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under that node, whereas the membership of all sample data under the root node is assigned by default 1, which reflects the fact that all samples belong to it.

Given a node N, the degree of membership of its child nodes NC may be defined recursively as:

μ_NC(x)＝μ_N(x)t_N(x,γ,δ)

And step , when the test data x of a user with a certain accumulation fund is transmitted to the decision tree and reaches a plurality of leaf nodes according to the decision tree obtained by training in the step D, and finally, the result of the refined classification is given according to the estimated values of all the leaf nodes:

wherein, x' is the test data,representing the probability estimate of the c-th class of the ith leaf node,

Compared with the prior art, the public accumulation fund user data refinement analysis system and method based on the decision tree have the advantages that 1, the characteristic data for decision tree refinement analysis is extracted from the preprocessed original relational model data, and a public accumulation fund user refinement analysis method based on a new decision tree is designed on the basis of the characteristic data, so that the public accumulation fund user data analysis management efficiency is greatly improved;

2. different from the traditional decision tree generation mode, the invention divides data into sub-nodes in a soft-hard combined mode, and can provide the accumulation fund management part with more efficient data resources and stronger detailed analysis capability, thereby providing scientific basis for the effective management of accumulation fund users.

Drawings

FIG. 1 is a block flow diagram of methods for refining and analyzing accumulation fund user data based on decision trees;

FIG. 2 is a schematic flow chart of the data transformation relationship of the accumulation fund user in the present invention.

Detailed Description

The invention is further described with reference to the following figures.

the data storage module is used for converting the entities obtained after the conflict is eliminated and the relationship between the entities into relational model data and storing the relational model data into a relational database to form original relational data;

The sample data allocation process specifically includes:

in the formula, x is certain sample data, and gamma and delta are the mean value and the variance of the data value corresponding to the selected characteristic; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; in other cases, the samples are simultaneously distributed to the left child node and the right child node;

3) each sample data used in the decision tree is assigned membership values in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under that node, and the membership of all sample data under the root node is given by default 1. given node N, the membership of its child node NC is defined recursively as:

μ_NC(x)＝μ_N(x)t_N(x,γ,δ)

And finally, according to the decision tree obtained by training, obtaining a refined classification result of the test data of the accumulation fund user according to the following formula:

wherein, x' is the test data,

representing the probability estimate of the c-th class of the ith leaf node,

FIG. 1 shows kinds of methods for refining and analyzing accumulation fund user data based on decision trees, which includes the following steps:

step A: collecting data from a plurality of different data sources related to a public accumulation fund user through a data collection module (the collected data comprises structured data, semi-structured data and unstructured data), identifying entities, entity attributes and relations among the entities from the different data sources by adopting intelligent technologies such as an entity link technology, deep learning and the like, eliminating conflicts among the entities, the entity attributes and the relations among the entities from the different data sources, and mainly eliminating attribute conflicts, name conflicts and structure conflicts;

and C: according to different application requirements (such as risk control, customer service and the like), determining characteristic attributes used by a decision tree in the refinement analysis of the accumulation fund user through a data preprocessing module, establishing a transformation relation between characteristic data and original relational data, and storing the extracted characteristic data into a relational database view;

for the transformation of the feature data and the original relational data, is generally used to search, process and integrate the data connected with a single table or multiple tables in the relational database, and the transformed result is the feature data to be extracted, for the transformation, independent program modules can be written to realize the transformation, and transformation rules can be transmitted to the modules as parameters for different users to use, in order to ensure that the feature data in the database view is up-to-date, timers can be set, the transformation of the original relational data to the feature data is continuously carried out, and the updating period can be determined by the users according to the needs of the users.

specifically, when generating a decision tree, features are selected from the feature set of sample data to be tested, starting from the root node, and the sample data is assigned to its child nodes according to the test result, so that the sample data is tested and assigned recursively until a leaf node is reached, and finally the sample data is assigned to a certain leaf node.

While conventional decision trees provide for hard distribution of sample data to children based on test results on the nodes, i.e., sample data can only be distributed to of the children, the present invention provides for soft distribution based on the characteristics of the selected features, i.e., sample data can be distributed to a plurality of children.

The test uses a piecewise linear fuzzy function:

in the formula, x is a certain sample data, and γ and δ can be the mean and variance of the data value corresponding to a certain feature (these two parameters can be obtained by learning through a relevant machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are distributed to the left and right child nodes simultaneously.

μ_NC(x)＝μ_N(x)t_N(x,γ,δ)

And finally, according to the decision tree obtained by training, when the test data x' of a user with a certain accumulation fund is transmitted to the decision tree and reaches a plurality of leaf nodes, finally, a refined classification result is given according to the estimated values of all the leaf nodes:

wherein, x' is the test data,

representing the probability estimate of the c-th class of the ith leaf node,

Step E: and displaying the detailed and classified results to the user on terminal equipment (a mobile phone end, a computer end and the like) in a chart mode through a data display module.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1, kind of system for refining and analyzing accumulation fund user data based on decision tree, which is characterized by comprising:

the data acquisition module is used for acquiring data information related to the accumulation fund user from each terminal device, identifying the entity, the entity attribute and the relationship among the entities, and eliminating the conflict existing in the multi-source accumulation fund user data;

and the data display module is used for displaying the detailed classification result to the user on the terminal equipment in a chart mode.

2. The decision tree-based accumulation fund user data fine analysis system, wherein, in the data analysis module, when generating the decision tree, starting from the root node, features are selected from the feature set of the sample data for testing, and the sample data is assigned to its child nodes according to the test result, so that the sample data is tested and assigned recursively until reaching the leaf nodes, and finally assigned to the leaf nodes.

3. The decision-tree-based accumulation fund user data refining analysis system according to claim 2, wherein the process of assigning sample data to child nodes in the data analysis module specifically comprises:

1) if the eigenvalue of the selected characteristic is discrete and limited, a hard distribution mode is adopted, namely sample data can be distributed to child nodes according to the test result;

in the formula, x is sample data, and gamma and delta are the mean value and variance of the data value corresponding to the selected characteristic; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; in other cases, the samples are simultaneously distributed to the left child node and the right child node;

μ_NC(x)＝μ_N(x)t_N(x,γ,δ)

4. The decision tree-based accumulation fund user data refining analysis system as claimed in claim 3, wherein the refined classification result of the accumulation fund user test data x' according to the trained decision tree in the data analysis module is obtained by the following formula:

wherein, x' is the test data,

representing the probability estimate of the c-th class of the ith leaf node,

representing degree of membership, y, of the ith leaf node^cProbability output representing class C, probability output representing maximum class y, and class CThe leaves represents a set of leaf nodes.

5, A method for refining and analyzing accumulation fund user data based on decision tree, which is characterized in that the method comprises the following steps:

6. The method for refining analysis of equity fund user data based on decision tree as claimed in claim 5, wherein said data collected in step A includes structured data, semi-structured data and unstructured data, and entity linking technique is used to identify entities and relationships between entities from the collected data.

7. The method for refining analysis of equity fund user data based on decision tree as claimed in claim 5, wherein said conflicts eliminated in step A include attribute conflicts, name conflicts and structure conflicts.

8. The method for refining and analyzing accumulation fund user data based on decision tree as claimed in claim 5, wherein the transformation of the feature data and the original relational data in step C is to search, process and integrate the data connected to a single table or multiple tables in the relational database, and the transformed result is the feature data to be extracted.

9. The method for refining analysis of accumulation fund user data based on decision tree as claimed in claim 5, wherein the training process of the decision tree refining analysis model in step D specifically comprises:

when a decision tree is generated, features are selected from the feature set of sample data to be tested from a root node, the sample data is distributed to child nodes of the sample data according to a test result, the sample data is tested and distributed recursively in such a way until a leaf node is reached, and finally the sample data is distributed to the leaf node;

the sample data allocation process specifically includes:

3) each sample data used in the decision tree is assigned membership values in the corresponding node of the decision tree, which indicates the degree of the sample data belonging to the sample data set under the node, the membership of all sample data under the root node is assigned 1 by default, and the membership of the child node NC of the node N is defined recursively as:

μ_NC(x)＝μ_N(x)t_N(x,γ,δ)

10. The method for refining analysis of accumulation fund user data based on decision tree as claimed in claim 9, wherein the refined classification result of the accumulation fund user test data x' based on the trained decision tree in step D is obtained by the following formula:

wherein, x' is the test data,

representing the probability estimate of the c-th class of the ith leaf node,