CN110737731B

CN110737731B - Decision tree-based public accumulation user data refinement analysis system and method

Info

Publication number: CN110737731B
Application number: CN201911022440.6A
Authority: CN
Inventors: 李子龙; 鲍蓉; 潘晓博
Original assignee: Xuzhou University of Technology
Current assignee: Xuzhou University of Technology
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2023-12-29
Anticipated expiration: 2039-10-25
Also published as: CN110737731A

Abstract

The invention discloses a decision tree-based public accumulation user data refinement analysis system and method, comprising the following steps: the data acquisition module is used for acquiring multi-source public accumulation user data, identifying entities, entity attributes and relationships among the entities, and eliminating conflicts existing in the multi-source data; the data storage module is used for storing the converted relational data into a relational database; the data preprocessing module is used for converting the original relational data into characteristic data used by a decision tree in user refinement analysis; and the data analysis module is used for carrying out refinement analysis on the user characteristic data by using the decision tree, and finally displaying an analysis result to a user in a chart form. According to the invention, the original relation model data is preprocessed, the characteristic data for refining analysis of the decision tree is extracted, and the new method for refining analysis of the public accumulation user data based on the decision tree is designed on the basis of the characteristic data, so that powerful decision support can be provided for the public accumulation management department timely and accurately.

Description

Decision tree-based public accumulation user data refinement analysis system and method

Technical Field

The invention relates to a decision tree-based public accumulation user data refinement analysis system and method, and belongs to the technical field of public accumulation data analysis management.

Background

At present, informatization of the public accumulation business has become a necessary trend, and reasonable and effective management of data information of public accumulation users is of great importance to public accumulation management departments. By performing a refinement analysis on the aggregated users, different countermeasures can be used to manage the user traffic of different segments to enhance service functionality and management level. For this reason, how to use the public accumulation user data to perform detailed analysis on the user has become a problem that is more important to the public accumulation management departments in various places.

In order to enable the public accumulation management department to master decision data, the analysis method of service data is often realized by adopting a decision tree or clustering method, but the analysis result is still not fine enough, and the analysis process generally searches and analyzes directly on a relational database, wherein a large amount of data and data tables are involved, so that the efficiency of query, processing and access is low, and the service requirement realization period is long. At present, each service unit of the public accumulation fund has the defects of large time range, complex processing logic and stricter time requirement, and the traditional user analysis method based on the public accumulation fund data can not meet the timeliness requirement of the public accumulation fund service department.

Disclosure of Invention

The invention aims to: in order to overcome the defects in the prior art, the invention provides a system and a method for refining and analyzing the public accumulation user data based on a decision tree, which are used for extracting characteristic data for refining and analyzing the decision tree from original relation model data by preprocessing the data, and designing a method for refining and analyzing the public accumulation user based on a new decision tree on the basis of the characteristic data, thereby providing powerful decision support for a public accumulation management department timely and accurately.

The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:

a decision tree-based public accumulation user data refinement analysis system, comprising the following components:

the data acquisition module is used for acquiring data information related to the public accumulation user from each terminal device (mobile terminal, computer terminal, video monitoring terminal and the like), identifying the entity, entity attribute and relationship among the entities, and eliminating conflicts existing in the multi-source public accumulation user data;

the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database in the computer storage equipment to form original relation data;

the data preprocessing module is used for converting the original relational data into characteristic data used by the decision tree in the data analysis module and exists in a relational database view mode;

the data analysis module performs refinement analysis on the user characteristic data in the relational database view through a decision tree, trains and generates a decision tree refinement analysis model, transmits test data of the public accumulation user to the decision tree and reaches a plurality of leaf nodes, and finally gives a refinement classification result according to estimated values of all the leaf nodes;

and the data display module is used for displaying the refined classification result to the user in a form of a chart on the terminal equipment (a mobile phone end, a computer end and the like).

Further, when the decision tree is generated in the data analysis module, one feature is selected from the feature set of the sample data from the root node for testing, the sample data is distributed to the child nodes according to the test result, the sample data is tested and distributed recursively until the leaf node is reached, and finally the sample data is distributed to the leaf node.

Further, the process of distributing the sample data to the child nodes in the data analysis module specifically includes:

1) If the feature values of the selected features are discrete and finite, a hard allocation is used, i.e., one sample of data can be allocated to only one of the child nodes based on the results of the test.

2) If the feature values of the selected features are ordered, continuous, a soft allocation is used, i.e., one sample of data is allocated to one or more child nodes based on the results of a test that uses a piecewise linear fuzzy function:

x is a certain sample data, and gamma and delta can take the mean value and variance of a data value under the condition that a certain characteristic corresponds to (the two parameters can be obtained through learning by a related machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are allocated to both the left and right child nodes.

3) Each sample data used in the decision tree is endowed with a membership value in a corresponding node of the decision tree, which indicates the degree to which the sample data belongs to a sample data set under the node; the membership of all sample data under the root node is given by default 1, while for a given node N, the membership of its child node NC is recursively defined as:

μ _NC (x)＝μ _N (x)t _N (x,γ,δ)

wherein x is sample data, μ _NC Is the membership grade, mu of the node NC _N Is the membership degree of the node N, t _N Is a test function of the corresponding node N.

Further, the data analysis module obtains a refined classification result of the metric user test data according to the decision tree obtained by training by the following formula:

wherein x' is the test data,probability estimation representing the c-th class of the i-th leaf node,>representing the membership degree, y, of the ith leaf node ^c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.

A decision tree-based public accumulation user data refinement analysis method comprises the following steps:

step A: the data information related to the public accumulation user is collected from each terminal device through the data collection module, the entity attribute and the relation among the entities are identified, and the conflict existing in the multi-source public accumulation user data is eliminated;

and (B) step (B): the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;

step C: converting the original relational data into characteristic data used by a decision tree in a data analysis module through a data preprocessing module, and enabling the characteristic data to exist in a relational database view mode;

step D: carrying out refinement analysis on user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refinement analysis model, transmitting test data of an accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refinement classification result according to estimated values of all the leaf nodes;

step E: and displaying the result of the refined classification to the user in a form of a chart on the terminal equipment through the data display module.

Further, the data collected in the step a includes structured data, semi-structured data and unstructured data, and the entity-to-entity relationship can be identified from different data sources by using entity linking technology.

Further, the conflict among the entities from different data sources, the entity attributes and the relationships among the entities is eliminated by means of manual judgment and identification, and the attribute conflict, the name conflict and the structure conflict are mainly eliminated.

Further, the step C specifically includes: according to different application requirements (such as risk control, customer service and the like), determining characteristic attributes used by a decision tree in the detail analysis of the public accumulation fund users, establishing a transformation relation between the characteristic data and original relational data (for example, the characteristic attributes used in the decision tree are annual average income of each customer, and the monthly income situation of each customer is recorded in a relational database, so that the average income of each customer needs to be counted according to years for the customer data in the database, thereby establishing a transformation relation between the annual average income of the characteristic data and the income situation of the customers in the relational database), and storing the extracted characteristic data into a relational database view.

The transformation process of the feature data and the original relational data generally searches, processes and integrates the data connected by a single table or multiple tables in the relational database, and the transformed result is the feature data to be extracted. For this transformation, it can be written as a separate program module and the transformation rules can be passed as parameters to the module for use by different users. In order to ensure that the feature data in the database view is up-to-date, a timer may be set to continuously perform the conversion from the original relational data to the feature data, and the update period may be determined by the user according to his own needs.

Further, the training generation process of the decision tree refinement analysis model in the step D specifically includes:

when generating the decision tree, starting from the root node, selecting a feature in the feature set of the sample data for testing, distributing the sample data to its child nodes according to the test result, recursively testing and distributing the sample data until reaching the leaf node, and finally distributing the sample data to the leaf node.

To select the appropriate feature from the feature set, different quantitative evaluation criteria may be used, such as information gain, information gain rate, gini index, etc.

The conventional decision tree is to allocate sample data to the child nodes hard according to the test results on the nodes, that is, the sample data can be allocated to only one of the child nodes. The invention may be soft-allocated, i.e. sample data may be allocated to a plurality of sub-nodes, depending on the nature of the selected feature. Since the data information collected from the real world has certain ambiguity certainty, the ambiguity can be more embodied in the decision tree by adopting a soft and hard allocation mode, thereby improving the subdivision classification effect, in particular,

1) If the feature values of the selected feature are discrete, limited, a hard allocation is used, i.e. a sample of data can only be allocated to one of the child nodes according to the test results.

2) If the feature values of the selected features are ordered and continuous, a soft allocation is used, i.e. a sample of data may be allocated to a plurality of sub-nodes according to the test results.

The test uses a piecewise linear fuzzy function:

wherein x is a certain sample data, and gamma and delta can be the mean value and variance of a data value under the corresponding condition of a certain characteristic (the two parameters can also be obtained through learning by a related machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are allocated to both the left and right child nodes.

3) Each sample data used in the decision tree is assigned a membership value in the corresponding node of the decision tree indicating the degree to which the sample data belongs to the sample data set under that node. And membership of all sample data under the root node is given by default 1, reflecting the fact that all samples belong to it.

Given node N, its membership of child node NC can be recursively defined as:

μ _NC (x)＝μ _N (x)t _N (x,γ,δ)

Further, in the step D, according to the decision tree obtained by training, when the test data x of a certain public accumulation user is transferred to the decision tree and reaches a plurality of leaf nodes, the result of refinement classification is finally given according to the estimated values of all the leaf nodes:

The beneficial effects are that: compared with the prior art, the system and the method for refining and analyzing the public accumulation fund user data based on the decision tree have the following advantages: 1. the feature data for the detailed analysis of the decision tree is extracted from the original relation model data by preprocessing, and a new method for the detailed analysis of the public accumulation user based on the decision tree is designed on the basis of the feature data, so that the data analysis and management efficiency of the public accumulation user is greatly improved;

2. different from the traditional decision tree generation mode, the invention adopts a soft and hard combination mode to divide data into sub-nodes, and can provide more efficient data resources and stronger refinement analysis capability for the public accumulation management department, thereby providing scientific basis for effective management of public accumulation users.

Drawings

FIG. 1 is a flow chart diagram of a decision tree-based method for refining and analyzing public accumulation fund user data;

FIG. 2 is a flow chart of the transformation relationship of the public accumulation user data in the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;

The sample data distribution process specifically comprises the following steps:

wherein x is a certain sample data, and gamma and delta are the mean value and variance of the data value under the corresponding condition of the selected characteristic; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is allocated to the right child node; in other cases, the samples are distributed to the left child node and the right child node at the same time;

3) Each sample data used in the decision tree is assigned a membership value in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under the node, and the membership of all sample data under the root node is assigned 1 by default. Given node N, its degrees of membership of child node NC are recursively defined as:

μ _NC (x)＝μ _N (x)t _N (x,γ,δ)

Finally, according to the decision tree obtained by training, the refined classification result of the public accumulation user test data is obtained by the following formula:

FIG. 1 shows a decision tree-based detailed analysis method for public accumulation fund user data, which comprises the following steps:

step A: collecting data (the collected data comprises structured data, semi-structured data and unstructured data) from a plurality of different data sources related to an public accumulation user through a data collecting module, respectively identifying entities, entity attributes and relations among the entities from the different data sources by adopting intelligent technologies such as entity linking technology, deep learning and the like, and eliminating conflicts among the entities, entity attributes and relations among the entities from the different data sources, wherein the conflict among the attributes, the name conflict and the structure conflict are mainly eliminated;

step C: according to different application requirements (such as risk control, customer service and the like), determining characteristic attributes used by a decision tree in the detail analysis of the user of the accumulation fund through a data preprocessing module, establishing a transformation relationship between characteristic data and original relationship data, and storing the extracted characteristic data into a relational database view;

for the transformation of the feature data and the original relational data, the data connected by a single table or multiple tables in the relational database is generally searched, processed and integrated, and the transformed result is the feature data to be extracted. For this transformation, it can be written as a separate program module and the transformation rules can be passed as parameters to the module for use by different users. In order to ensure that the feature data in the database view is up-to-date, a timer may be set to continuously perform the conversion from the original relational data to the feature data, and the update period may be determined by the user according to his own needs.

specifically, when generating the decision tree, starting from the root node, selecting a feature from the feature set of the sample data for testing, and distributing the sample data to its child nodes according to the test result, so as to recursively test and distribute the sample data until reaching a leaf node, and finally distributing the sample data to a certain leaf node.

The conventional decision tree is to allocate sample data to the child nodes hard according to the test results on the nodes, that is, the sample data can be allocated to only one of the child nodes. The invention may be soft-allocated, i.e. sample data may be allocated to a plurality of sub-nodes, depending on the nature of the selected feature. In particular, the method comprises the steps of,

The test uses a piecewise linear fuzzy function:

wherein x is a certain sample data, and gamma and delta can be the mean value and variance of a data value under the condition that a certain characteristic corresponds (the two parameters can be obtained through learning by a related machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are allocated to both the left and right child nodes.

Given node N, its membership of child node NC can be recursively defined as:

μ _NC (x)＝μ _N (x)t _N (x,γ,δ)

Finally, according to the decision tree obtained by training, when the test data x' of a certain public accumulation user is transmitted to the decision tree and reaches a plurality of leaf nodes, a refined classification result is finally given according to the estimated values of all the leaf nodes:

Step E: and displaying the refined classification result to the user in a chart mode on terminal equipment (a mobile phone end, a computer end and the like) through a data display module.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. A decision tree-based public accumulation user data refinement analysis system, comprising:

the data acquisition module is used for acquiring data information related to the public accumulation user from each terminal equipment, identifying the entity, the entity attribute and the relation among the entities, and eliminating the conflict existing in the multi-source public accumulation user data;

the data display module is used for displaying the result of the refined classification to the user in a chart mode on the terminal equipment;

when a decision tree is generated in the data analysis module, starting from a root node, selecting a feature in a feature set of sample data for testing, distributing the sample data to child nodes thereof according to a test result, recursively testing and distributing the sample data until a leaf node is reached, and finally distributing the sample data to the leaf node;

the process of distributing the sample data to the child nodes in the data analysis module specifically comprises the following steps:

1) If the feature value of the selected feature is discrete and limited, adopting a hard allocation mode, namely, allocating one piece of sample data to one of the child nodes according to the test result;

，

wherein x is sample data, and gamma and delta are the mean value and variance of data values under the condition that the selected features correspond to; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is allocated to the right child node; in other cases, the samples are distributed to the left child node and the right child node at the same time;

，

wherein x is sample data, μ _NC Is the membership grade, mu of the node NC _N Is the membership degree of the node N, t _N A test function corresponding to the node N;

the refined classification result of the data analysis module on the metric user test data x' according to the decision tree obtained by training is obtained by the following formula:

，

wherein x' is the test data,a probability estimate representing the c-th class of the i-th leaf node,representing the membership degree, y, of the ith leaf node ^c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.

2. An analysis method using the decision tree-based metric user data refinement analysis system of claim 1, comprising the steps of:

step E: displaying the refined classification result to a user in a chart mode on terminal equipment through a data display module;

the data acquired in the step A comprises structured data, semi-structured data and unstructured data, and an entity link technology is adopted to identify the entity and the relation between the entities from the acquired data;

the conflicts eliminated in the step A comprise attribute conflicts, name conflicts and structure conflicts;

the transformation process of the characteristic data and the original relational data in the step C is to search, process and integrate the data connected by a single table or a plurality of tables in the relational database, and the transformed result is the characteristic data to be extracted;

the training generation process of the decision tree refinement analysis model in the step D specifically comprises the following steps:

when generating a decision tree, starting from a root node, selecting a feature from a feature set of sample data for testing, distributing the sample data to child nodes thereof according to a test result, recursively testing and distributing the sample data until reaching a leaf node, and finally distributing the sample data to the leaf node;

the process of distributing the sample data to the leaf nodes specifically comprises the following steps:

，

each sample data used in the decision tree is endowed with a membership value in a corresponding node of the decision tree, which indicates the degree to which the sample data belongs to a sample data set under the node; the membership of all sample data under the root node is given by default 1, whereas the membership of a given node N, its child node NC, is recursively defined as:

，

x is sample data, μ _NC Is the membership grade, mu of the node NC _N Is the membership degree of the node N, t _N A test function corresponding to the node N;

in the step D, according to the decision tree obtained by training, the refined classification result of the public accumulation fund user test data x' is obtained by the following formula:

，