CN110737731B - Decision tree-based public accumulation user data refinement analysis system and method - Google Patents
Decision tree-based public accumulation user data refinement analysis system and method Download PDFInfo
- Publication number
- CN110737731B CN110737731B CN201911022440.6A CN201911022440A CN110737731B CN 110737731 B CN110737731 B CN 110737731B CN 201911022440 A CN201911022440 A CN 201911022440A CN 110737731 B CN110737731 B CN 110737731B
- Authority
- CN
- China
- Prior art keywords
- data
- node
- decision tree
- sample
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003066 decision tree Methods 0.000 title claims abstract description 74
- 238000009825 accumulation Methods 0.000 title claims abstract description 54
- 238000004458 analytical method Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000007405 data analysis Methods 0.000 claims abstract description 19
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000013500 data storage Methods 0.000 claims abstract description 7
- 238000012360 testing method Methods 0.000 claims description 53
- 230000008569 process Effects 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 10
- 230000008030 elimination Effects 0.000 claims description 6
- 238000003379 elimination reaction Methods 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 4
- 238000013480 data collection Methods 0.000 claims description 2
- 238000007726 management method Methods 0.000 abstract description 10
- 238000007670 refining Methods 0.000 abstract description 7
- 230000006870 function Effects 0.000 description 16
- 238000010801 machine learning Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012954 risk control Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011158 quantitative evaluation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a decision tree-based public accumulation user data refinement analysis system and method, comprising the following steps: the data acquisition module is used for acquiring multi-source public accumulation user data, identifying entities, entity attributes and relationships among the entities, and eliminating conflicts existing in the multi-source data; the data storage module is used for storing the converted relational data into a relational database; the data preprocessing module is used for converting the original relational data into characteristic data used by a decision tree in user refinement analysis; and the data analysis module is used for carrying out refinement analysis on the user characteristic data by using the decision tree, and finally displaying an analysis result to a user in a chart form. According to the invention, the original relation model data is preprocessed, the characteristic data for refining analysis of the decision tree is extracted, and the new method for refining analysis of the public accumulation user data based on the decision tree is designed on the basis of the characteristic data, so that powerful decision support can be provided for the public accumulation management department timely and accurately.
Description
Technical Field
The invention relates to a decision tree-based public accumulation user data refinement analysis system and method, and belongs to the technical field of public accumulation data analysis management.
Background
At present, informatization of the public accumulation business has become a necessary trend, and reasonable and effective management of data information of public accumulation users is of great importance to public accumulation management departments. By performing a refinement analysis on the aggregated users, different countermeasures can be used to manage the user traffic of different segments to enhance service functionality and management level. For this reason, how to use the public accumulation user data to perform detailed analysis on the user has become a problem that is more important to the public accumulation management departments in various places.
In order to enable the public accumulation management department to master decision data, the analysis method of service data is often realized by adopting a decision tree or clustering method, but the analysis result is still not fine enough, and the analysis process generally searches and analyzes directly on a relational database, wherein a large amount of data and data tables are involved, so that the efficiency of query, processing and access is low, and the service requirement realization period is long. At present, each service unit of the public accumulation fund has the defects of large time range, complex processing logic and stricter time requirement, and the traditional user analysis method based on the public accumulation fund data can not meet the timeliness requirement of the public accumulation fund service department.
Disclosure of Invention
The invention aims to: in order to overcome the defects in the prior art, the invention provides a system and a method for refining and analyzing the public accumulation user data based on a decision tree, which are used for extracting characteristic data for refining and analyzing the decision tree from original relation model data by preprocessing the data, and designing a method for refining and analyzing the public accumulation user based on a new decision tree on the basis of the characteristic data, thereby providing powerful decision support for a public accumulation management department timely and accurately.
The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:
a decision tree-based public accumulation user data refinement analysis system, comprising the following components:
the data acquisition module is used for acquiring data information related to the public accumulation user from each terminal device (mobile terminal, computer terminal, video monitoring terminal and the like), identifying the entity, entity attribute and relationship among the entities, and eliminating conflicts existing in the multi-source public accumulation user data;
the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database in the computer storage equipment to form original relation data;
the data preprocessing module is used for converting the original relational data into characteristic data used by the decision tree in the data analysis module and exists in a relational database view mode;
the data analysis module performs refinement analysis on the user characteristic data in the relational database view through a decision tree, trains and generates a decision tree refinement analysis model, transmits test data of the public accumulation user to the decision tree and reaches a plurality of leaf nodes, and finally gives a refinement classification result according to estimated values of all the leaf nodes;
and the data display module is used for displaying the refined classification result to the user in a form of a chart on the terminal equipment (a mobile phone end, a computer end and the like).
Further, when the decision tree is generated in the data analysis module, one feature is selected from the feature set of the sample data from the root node for testing, the sample data is distributed to the child nodes according to the test result, the sample data is tested and distributed recursively until the leaf node is reached, and finally the sample data is distributed to the leaf node.
Further, the process of distributing the sample data to the child nodes in the data analysis module specifically includes:
1) If the feature values of the selected features are discrete and finite, a hard allocation is used, i.e., one sample of data can be allocated to only one of the child nodes based on the results of the test.
2) If the feature values of the selected features are ordered, continuous, a soft allocation is used, i.e., one sample of data is allocated to one or more child nodes based on the results of a test that uses a piecewise linear fuzzy function:
x is a certain sample data, and gamma and delta can take the mean value and variance of a data value under the condition that a certain characteristic corresponds to (the two parameters can be obtained through learning by a related machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are allocated to both the left and right child nodes.
3) Each sample data used in the decision tree is endowed with a membership value in a corresponding node of the decision tree, which indicates the degree to which the sample data belongs to a sample data set under the node; the membership of all sample data under the root node is given by default 1, while for a given node N, the membership of its child node NC is recursively defined as:
μ NC (x)=μ N (x)t N (x,γ,δ)
wherein x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N Is a test function of the corresponding node N.
Further, the data analysis module obtains a refined classification result of the metric user test data according to the decision tree obtained by training by the following formula:
wherein x' is the test data,probability estimation representing the c-th class of the i-th leaf node,>representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
A decision tree-based public accumulation user data refinement analysis method comprises the following steps:
step A: the data information related to the public accumulation user is collected from each terminal device through the data collection module, the entity attribute and the relation among the entities are identified, and the conflict existing in the multi-source public accumulation user data is eliminated;
and (B) step (B): the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;
step C: converting the original relational data into characteristic data used by a decision tree in a data analysis module through a data preprocessing module, and enabling the characteristic data to exist in a relational database view mode;
step D: carrying out refinement analysis on user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refinement analysis model, transmitting test data of an accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refinement classification result according to estimated values of all the leaf nodes;
step E: and displaying the result of the refined classification to the user in a form of a chart on the terminal equipment through the data display module.
Further, the data collected in the step a includes structured data, semi-structured data and unstructured data, and the entity-to-entity relationship can be identified from different data sources by using entity linking technology.
Further, the conflict among the entities from different data sources, the entity attributes and the relationships among the entities is eliminated by means of manual judgment and identification, and the attribute conflict, the name conflict and the structure conflict are mainly eliminated.
Further, the step C specifically includes: according to different application requirements (such as risk control, customer service and the like), determining characteristic attributes used by a decision tree in the detail analysis of the public accumulation fund users, establishing a transformation relation between the characteristic data and original relational data (for example, the characteristic attributes used in the decision tree are annual average income of each customer, and the monthly income situation of each customer is recorded in a relational database, so that the average income of each customer needs to be counted according to years for the customer data in the database, thereby establishing a transformation relation between the annual average income of the characteristic data and the income situation of the customers in the relational database), and storing the extracted characteristic data into a relational database view.
The transformation process of the feature data and the original relational data generally searches, processes and integrates the data connected by a single table or multiple tables in the relational database, and the transformed result is the feature data to be extracted. For this transformation, it can be written as a separate program module and the transformation rules can be passed as parameters to the module for use by different users. In order to ensure that the feature data in the database view is up-to-date, a timer may be set to continuously perform the conversion from the original relational data to the feature data, and the update period may be determined by the user according to his own needs.
Further, the training generation process of the decision tree refinement analysis model in the step D specifically includes:
when generating the decision tree, starting from the root node, selecting a feature in the feature set of the sample data for testing, distributing the sample data to its child nodes according to the test result, recursively testing and distributing the sample data until reaching the leaf node, and finally distributing the sample data to the leaf node.
To select the appropriate feature from the feature set, different quantitative evaluation criteria may be used, such as information gain, information gain rate, gini index, etc.
The conventional decision tree is to allocate sample data to the child nodes hard according to the test results on the nodes, that is, the sample data can be allocated to only one of the child nodes. The invention may be soft-allocated, i.e. sample data may be allocated to a plurality of sub-nodes, depending on the nature of the selected feature. Since the data information collected from the real world has certain ambiguity certainty, the ambiguity can be more embodied in the decision tree by adopting a soft and hard allocation mode, thereby improving the subdivision classification effect, in particular,
1) If the feature values of the selected feature are discrete, limited, a hard allocation is used, i.e. a sample of data can only be allocated to one of the child nodes according to the test results.
2) If the feature values of the selected features are ordered and continuous, a soft allocation is used, i.e. a sample of data may be allocated to a plurality of sub-nodes according to the test results.
The test uses a piecewise linear fuzzy function:
wherein x is a certain sample data, and gamma and delta can be the mean value and variance of a data value under the corresponding condition of a certain characteristic (the two parameters can also be obtained through learning by a related machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are allocated to both the left and right child nodes.
3) Each sample data used in the decision tree is assigned a membership value in the corresponding node of the decision tree indicating the degree to which the sample data belongs to the sample data set under that node. And membership of all sample data under the root node is given by default 1, reflecting the fact that all samples belong to it.
Given node N, its membership of child node NC can be recursively defined as:
μ NC (x)=μ N (x)t N (x,γ,δ)
wherein x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N Is a test function of the corresponding node N.
Further, in the step D, according to the decision tree obtained by training, when the test data x of a certain public accumulation user is transferred to the decision tree and reaches a plurality of leaf nodes, the result of refinement classification is finally given according to the estimated values of all the leaf nodes:
wherein x' is the test data,probability estimation representing the c-th class of the i-th leaf node,>representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
The beneficial effects are that: compared with the prior art, the system and the method for refining and analyzing the public accumulation fund user data based on the decision tree have the following advantages: 1. the feature data for the detailed analysis of the decision tree is extracted from the original relation model data by preprocessing, and a new method for the detailed analysis of the public accumulation user based on the decision tree is designed on the basis of the feature data, so that the data analysis and management efficiency of the public accumulation user is greatly improved;
2. different from the traditional decision tree generation mode, the invention adopts a soft and hard combination mode to divide data into sub-nodes, and can provide more efficient data resources and stronger refinement analysis capability for the public accumulation management department, thereby providing scientific basis for effective management of public accumulation users.
Drawings
FIG. 1 is a flow chart diagram of a decision tree-based method for refining and analyzing public accumulation fund user data;
FIG. 2 is a flow chart of the transformation relationship of the public accumulation user data in the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
A decision tree-based public accumulation user data refinement analysis system, comprising the following components:
the data acquisition module is used for acquiring data information related to the public accumulation user from each terminal device (mobile terminal, computer terminal, video monitoring terminal and the like), identifying the entity, entity attribute and relationship among the entities, and eliminating conflicts existing in the multi-source public accumulation user data;
the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;
the data preprocessing module is used for converting the original relational data into characteristic data used by the decision tree in the data analysis module and exists in a relational database view mode;
the data analysis module performs refinement analysis on the user characteristic data in the relational database view through a decision tree, trains and generates a decision tree refinement analysis model, transmits test data of the public accumulation user to the decision tree and reaches a plurality of leaf nodes, and finally gives a refinement classification result according to estimated values of all the leaf nodes;
and the data display module is used for displaying the refined classification result to the user in a form of a chart on the terminal equipment (a mobile phone end, a computer end and the like).
When generating the decision tree, starting from the root node, selecting a feature in the feature set of the sample data for testing, distributing the sample data to its child nodes according to the test result, recursively testing and distributing the sample data until reaching the leaf node, and finally distributing the sample data to the leaf node.
The sample data distribution process specifically comprises the following steps:
1) If the feature values of the selected features are discrete and finite, a hard allocation is used, i.e., one sample of data can be allocated to only one of the child nodes based on the results of the test.
2) If the feature values of the selected features are ordered, continuous, a soft allocation is used, i.e., one sample of data is allocated to one or more child nodes based on the results of a test that uses a piecewise linear fuzzy function:
wherein x is a certain sample data, and gamma and delta are the mean value and variance of the data value under the corresponding condition of the selected characteristic; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is allocated to the right child node; in other cases, the samples are distributed to the left child node and the right child node at the same time;
3) Each sample data used in the decision tree is assigned a membership value in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under the node, and the membership of all sample data under the root node is assigned 1 by default. Given node N, its degrees of membership of child node NC are recursively defined as:
μ NC (x)=μ N (x)t N (x,γ,δ)
wherein x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N Is a test function of the corresponding node N.
Finally, according to the decision tree obtained by training, the refined classification result of the public accumulation user test data is obtained by the following formula:
wherein x' is the test data,probability estimation representing the c-th class of the i-th leaf node,>representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
FIG. 1 shows a decision tree-based detailed analysis method for public accumulation fund user data, which comprises the following steps:
step A: collecting data (the collected data comprises structured data, semi-structured data and unstructured data) from a plurality of different data sources related to an public accumulation user through a data collecting module, respectively identifying entities, entity attributes and relations among the entities from the different data sources by adopting intelligent technologies such as entity linking technology, deep learning and the like, and eliminating conflicts among the entities, entity attributes and relations among the entities from the different data sources, wherein the conflict among the attributes, the name conflict and the structure conflict are mainly eliminated;
and (B) step (B): the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;
step C: according to different application requirements (such as risk control, customer service and the like), determining characteristic attributes used by a decision tree in the detail analysis of the user of the accumulation fund through a data preprocessing module, establishing a transformation relationship between characteristic data and original relationship data, and storing the extracted characteristic data into a relational database view;
for the transformation of the feature data and the original relational data, the data connected by a single table or multiple tables in the relational database is generally searched, processed and integrated, and the transformed result is the feature data to be extracted. For this transformation, it can be written as a separate program module and the transformation rules can be passed as parameters to the module for use by different users. In order to ensure that the feature data in the database view is up-to-date, a timer may be set to continuously perform the conversion from the original relational data to the feature data, and the update period may be determined by the user according to his own needs.
Step D: carrying out refinement analysis on user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refinement analysis model, transmitting test data of an accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refinement classification result according to estimated values of all the leaf nodes;
specifically, when generating the decision tree, starting from the root node, selecting a feature from the feature set of the sample data for testing, and distributing the sample data to its child nodes according to the test result, so as to recursively test and distribute the sample data until reaching a leaf node, and finally distributing the sample data to a certain leaf node.
To select the appropriate feature from the feature set, different quantitative evaluation criteria may be used, such as information gain, information gain rate, gini index, etc.
The conventional decision tree is to allocate sample data to the child nodes hard according to the test results on the nodes, that is, the sample data can be allocated to only one of the child nodes. The invention may be soft-allocated, i.e. sample data may be allocated to a plurality of sub-nodes, depending on the nature of the selected feature. In particular, the method comprises the steps of,
1) If the feature values of the selected feature are discrete, limited, a hard allocation is used, i.e. a sample of data can only be allocated to one of the child nodes according to the test results.
2) If the feature values of the selected features are ordered and continuous, a soft allocation is used, i.e. a sample of data may be allocated to a plurality of sub-nodes according to the test results.
The test uses a piecewise linear fuzzy function:
wherein x is a certain sample data, and gamma and delta can be the mean value and variance of a data value under the condition that a certain characteristic corresponds (the two parameters can be obtained through learning by a related machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are allocated to both the left and right child nodes.
3) Each sample data used in the decision tree is assigned a membership value in the corresponding node of the decision tree indicating the degree to which the sample data belongs to the sample data set under that node. And membership of all sample data under the root node is given by default 1, reflecting the fact that all samples belong to it.
Given node N, its membership of child node NC can be recursively defined as:
μ NC (x)=μ N (x)t N (x,γ,δ)
wherein x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N Is a test function of the corresponding node N.
Finally, according to the decision tree obtained by training, when the test data x' of a certain public accumulation user is transmitted to the decision tree and reaches a plurality of leaf nodes, a refined classification result is finally given according to the estimated values of all the leaf nodes:
wherein x' is the test data,probability estimation representing the c-th class of the i-th leaf node,>representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
Step E: and displaying the refined classification result to the user in a chart mode on terminal equipment (a mobile phone end, a computer end and the like) through a data display module.
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.
Claims (2)
1. A decision tree-based public accumulation user data refinement analysis system, comprising:
the data acquisition module is used for acquiring data information related to the public accumulation user from each terminal equipment, identifying the entity, the entity attribute and the relation among the entities, and eliminating the conflict existing in the multi-source public accumulation user data;
the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;
the data preprocessing module is used for converting the original relational data into characteristic data used by the decision tree in the data analysis module and exists in a relational database view mode;
the data analysis module performs refinement analysis on the user characteristic data in the relational database view through a decision tree, trains and generates a decision tree refinement analysis model, transmits test data of the public accumulation user to the decision tree and reaches a plurality of leaf nodes, and finally gives a refinement classification result according to estimated values of all the leaf nodes;
the data display module is used for displaying the result of the refined classification to the user in a chart mode on the terminal equipment;
when a decision tree is generated in the data analysis module, starting from a root node, selecting a feature in a feature set of sample data for testing, distributing the sample data to child nodes thereof according to a test result, recursively testing and distributing the sample data until a leaf node is reached, and finally distributing the sample data to the leaf node;
the process of distributing the sample data to the child nodes in the data analysis module specifically comprises the following steps:
1) If the feature value of the selected feature is discrete and limited, adopting a hard allocation mode, namely, allocating one piece of sample data to one of the child nodes according to the test result;
2) If the feature values of the selected features are ordered, continuous, a soft allocation is used, i.e., one sample of data is allocated to one or more child nodes based on the results of a test that uses a piecewise linear fuzzy function:
,
wherein x is sample data, and gamma and delta are the mean value and variance of data values under the condition that the selected features correspond to; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is allocated to the right child node; in other cases, the samples are distributed to the left child node and the right child node at the same time;
3) Each sample data used in the decision tree is endowed with a membership value in a corresponding node of the decision tree, which indicates the degree to which the sample data belongs to a sample data set under the node; the membership of all sample data under the root node is given by default 1, while for a given node N, the membership of its child node NC is recursively defined as:
,
wherein x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N A test function corresponding to the node N;
the refined classification result of the data analysis module on the metric user test data x' according to the decision tree obtained by training is obtained by the following formula:
,
wherein x' is the test data,a probability estimate representing the c-th class of the i-th leaf node,representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
2. An analysis method using the decision tree-based metric user data refinement analysis system of claim 1, comprising the steps of:
step A: the data information related to the public accumulation user is collected from each terminal device through the data collection module, the entity attribute and the relation among the entities are identified, and the conflict existing in the multi-source public accumulation user data is eliminated;
and (B) step (B): the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;
step C: converting the original relational data into characteristic data used by a decision tree in a data analysis module through a data preprocessing module, and enabling the characteristic data to exist in a relational database view mode;
step D: carrying out refinement analysis on user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refinement analysis model, transmitting test data of an accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refinement classification result according to estimated values of all the leaf nodes;
step E: displaying the refined classification result to a user in a chart mode on terminal equipment through a data display module;
the data acquired in the step A comprises structured data, semi-structured data and unstructured data, and an entity link technology is adopted to identify the entity and the relation between the entities from the acquired data;
the conflicts eliminated in the step A comprise attribute conflicts, name conflicts and structure conflicts;
the transformation process of the characteristic data and the original relational data in the step C is to search, process and integrate the data connected by a single table or a plurality of tables in the relational database, and the transformed result is the characteristic data to be extracted;
the training generation process of the decision tree refinement analysis model in the step D specifically comprises the following steps:
when generating a decision tree, starting from a root node, selecting a feature from a feature set of sample data for testing, distributing the sample data to child nodes thereof according to a test result, recursively testing and distributing the sample data until reaching a leaf node, and finally distributing the sample data to the leaf node;
the process of distributing the sample data to the leaf nodes specifically comprises the following steps:
1) If the feature value of the selected feature is discrete and limited, adopting a hard allocation mode, namely, allocating one piece of sample data to one of the child nodes according to the test result;
2) If the feature values of the selected features are ordered, continuous, a soft allocation is used, i.e., one sample of data is allocated to one or more child nodes based on the results of a test that uses a piecewise linear fuzzy function:
,
wherein x is sample data, and gamma and delta are the mean value and variance of data values under the condition that the selected features correspond to; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is allocated to the right child node; in other cases, the samples are distributed to the left child node and the right child node at the same time;
each sample data used in the decision tree is endowed with a membership value in a corresponding node of the decision tree, which indicates the degree to which the sample data belongs to a sample data set under the node; the membership of all sample data under the root node is given by default 1, whereas the membership of a given node N, its child node NC, is recursively defined as:
,
x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N A test function corresponding to the node N;
in the step D, according to the decision tree obtained by training, the refined classification result of the public accumulation fund user test data x' is obtained by the following formula:
,
wherein x' is the test data,a probability estimate representing the c-th class of the i-th leaf node,representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911022440.6A CN110737731B (en) | 2019-10-25 | 2019-10-25 | Decision tree-based public accumulation user data refinement analysis system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911022440.6A CN110737731B (en) | 2019-10-25 | 2019-10-25 | Decision tree-based public accumulation user data refinement analysis system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110737731A CN110737731A (en) | 2020-01-31 |
CN110737731B true CN110737731B (en) | 2023-12-29 |
Family
ID=69271378
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911022440.6A Active CN110737731B (en) | 2019-10-25 | 2019-10-25 | Decision tree-based public accumulation user data refinement analysis system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110737731B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102811162A (en) * | 2011-06-03 | 2012-12-05 | 弗卢克公司 | Method and apparatus for detecting network attacks using a flow based technique |
CN107808245A (en) * | 2017-10-25 | 2018-03-16 | 冶金自动化研究设计院 | Based on the network scheduler system for improving traditional decision-tree |
CN109325844A (en) * | 2018-06-25 | 2019-02-12 | 南京工业大学 | Network loan borrower credit evaluation method under multidimensional data |
CN109522957A (en) * | 2018-11-16 | 2019-03-26 | 上海海事大学 | The method of harbour gantry crane machine work status fault classification based on decision Tree algorithms |
CN109830303A (en) * | 2019-02-01 | 2019-05-31 | 上海众恒信息产业股份有限公司 | Clinical data mining analysis and aid decision-making method based on internet integration medical platform |
-
2019
- 2019-10-25 CN CN201911022440.6A patent/CN110737731B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102811162A (en) * | 2011-06-03 | 2012-12-05 | 弗卢克公司 | Method and apparatus for detecting network attacks using a flow based technique |
CN107808245A (en) * | 2017-10-25 | 2018-03-16 | 冶金自动化研究设计院 | Based on the network scheduler system for improving traditional decision-tree |
CN109325844A (en) * | 2018-06-25 | 2019-02-12 | 南京工业大学 | Network loan borrower credit evaluation method under multidimensional data |
CN109522957A (en) * | 2018-11-16 | 2019-03-26 | 上海海事大学 | The method of harbour gantry crane machine work status fault classification based on decision Tree algorithms |
CN109830303A (en) * | 2019-02-01 | 2019-05-31 | 上海众恒信息产业股份有限公司 | Clinical data mining analysis and aid decision-making method based on internet integration medical platform |
Non-Patent Citations (3)
Title |
---|
基于决策树的住房贷款还贷行为分析;吴剑新;黄章树;;重庆工商大学学报(自然科学版)(第03期);第265-269页 * |
基于广义信息论的决策森林数据挖掘模型;王利民;臧雪柏;曹春红;;吉林大学学报(工学版)(第01期);第155-158页 * |
高斯隶属度优化的超分辨率随机森林学习算法;周文谊等;《计算机工程与应用》;第52卷(第23期);第208-212页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110737731A (en) | 2020-01-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106919689B (en) | Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge | |
CN109033387A (en) | A kind of Internet of Things search system, method and storage medium merging multi-source data | |
CN107016068A (en) | Knowledge mapping construction method and device | |
JP2009151760A (en) | Method and system for calculating competitiveness metric between objects | |
CN106326482A (en) | System of visualized big data collection and analysis and file conversion and method thereof | |
CN111159763B (en) | System and method for analyzing portrait of law-related personnel group | |
CN117875293B (en) | Method for generating service form template in quick digitization mode | |
CN116541480B (en) | Thematic data construction method and system based on multi-label driving | |
CN118113854B (en) | Online consultation method and system based on gynecological nursing knowledge base | |
Yu et al. | Intelligent analysis system of college students' employment and entrepreneurship situation: Big data and artificial intelligence-driven approach | |
CN115794803A (en) | Engineering audit problem monitoring method and system based on big data AI technology | |
CN110737731B (en) | Decision tree-based public accumulation user data refinement analysis system and method | |
CN106157651B (en) | A kind of traffic radio traffic information broadcasting system based on voice semantic understanding | |
CN112488236B (en) | Integrated unsupervised student behavior clustering method | |
CN115358797A (en) | Comprehensive energy user energy behavior analysis method and system based on cluster analysis method and storage medium | |
CN111291029B (en) | Data cleaning method and device | |
CN112668836A (en) | Risk graph-oriented associated risk evidence efficient mining and monitoring method and device | |
CN117454892B (en) | Metadata management method, device, terminal equipment and storage medium | |
CN109657684A (en) | A kind of image, semantic analytic method based on Weakly supervised study | |
CN117786182B (en) | Business data storage system and method based on ERP system | |
CN117556256B (en) | Private domain service label screening system and method based on big data | |
CN117370448B (en) | Brand digital asset insight analysis method | |
CN117076484B (en) | Human resource data analysis method based on time sequence knowledge graph | |
CN114840686B (en) | Knowledge graph construction method, device, equipment and storage medium based on metadata | |
Bianchini et al. | A Methodological Approach for enabling Personalised Smart City Data Exploration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |