CN110737731A - accumulation fund user data refinement analysis system and method based on decision tree - Google Patents

accumulation fund user data refinement analysis system and method based on decision tree Download PDF

Info

Publication number
CN110737731A
CN110737731A CN201911022440.6A CN201911022440A CN110737731A CN 110737731 A CN110737731 A CN 110737731A CN 201911022440 A CN201911022440 A CN 201911022440A CN 110737731 A CN110737731 A CN 110737731A
Authority
CN
China
Prior art keywords
data
decision tree
node
sample data
accumulation fund
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911022440.6A
Other languages
Chinese (zh)
Other versions
CN110737731B (en
Inventor
李子龙
鲍蓉
潘晓博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xuzhou University of Technology
Original Assignee
Xuzhou University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xuzhou University of Technology filed Critical Xuzhou University of Technology
Priority to CN201911022440.6A priority Critical patent/CN110737731B/en
Publication of CN110737731A publication Critical patent/CN110737731A/en
Application granted granted Critical
Publication of CN110737731B publication Critical patent/CN110737731B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses system and method for refining analysis of accumulation fund user data based on a decision tree, which comprises a data acquisition module, a data storage module, a data preprocessing module and a data analysis module, wherein the data acquisition module is used for acquiring multi-source accumulation fund user data, identifying the relationship among entities, entity attributes and the entities and eliminating conflicts existing in the multi-source data, the data storage module is used for storing converted relational data into a relational database, the data preprocessing module is used for converting original relational data into characteristic data used by the decision tree in user refining analysis, the data analysis module is used for refining analysis of the user characteristic data by using the decision tree, and finally, an analysis result is displayed to a user in a chart form.

Description

accumulation fund user data refinement analysis system and method based on decision tree
Technical Field
The invention relates to public accumulation fund user data refinement analysis systems and methods based on decision trees, and belongs to the technical field of public accumulation fund data analysis management.
Background
At present, the informatization of the public accumulation fund service becomes necessary trends, and the reasonable and effective management of the data information of the public accumulation fund users is very important for the public accumulation fund management part . by carrying out detailed analysis on the public accumulation fund users, different strategies can be used for management aiming at different subdivided user services so as to enhance the service function and the management level.
In order to enable the public accumulation fund management part to master decision data, the analysis method of the business data is often implemented by adopting a decision tree or clustering method, but the analysis result is still not fine enough, and the analysis process is directly searching and analyzing on a relational database, which involves a large amount of data and data tables, so that the efficiency of query, processing and access is low, and the business requirement implementation cycle is long.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, the invention provides system and method for refining and analyzing accumulation fund user data based on decision trees, which extract characteristic data for refining and analyzing the decision trees from original relational model data by preprocessing the original relational model data and design a new method for refining and analyzing the accumulation fund user based on the decision trees on the basis of the characteristic data, thereby timely and accurately providing powerful decision support for an accumulation fund management part .
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
A system for refining and analyzing accumulation fund user data based on decision trees comprises the following components:
the data acquisition module is used for acquiring data information related to the accumulation fund user from each terminal device (a mobile terminal, a computer terminal, a video monitoring terminal and the like), identifying entities, entity attributes and relations among the entities and eliminating conflicts existing in the multi-source accumulation fund user data;
the data storage module is used for converting the entity obtained after the conflict is eliminated and the relation between the entities into relational model data, and storing the relational model data into a relational database in computer storage equipment to form original relational data;
the data preprocessing module is used for converting the original relational data into characteristic data used by a decision tree in the data analysis module and exists in the form of a relational database view;
the data analysis module is used for carrying out thinning analysis on the user characteristic data in the relational database view through the decision tree, training to generate a decision tree thinning analysis model, then transmitting the test data of the accumulation fund user to the decision tree and reaching a plurality of leaf nodes, and finally giving a thinning classification result according to the estimation values of all the leaf nodes;
and the data display module is used for displaying the detailed classification result to the user on terminal equipment (a mobile phone end, a computer end and the like) in a chart mode.
, when generating the decision tree, the data analysis module selects features from the feature set of the sample data to test, starting from the root node, and allocates the sample data to its child nodes according to the test result, so that the sample data is tested and allocated recursively until reaching the leaf nodes, and finally the sample data is allocated to the leaf nodes.
Further , the process of the data analysis module of allocating the sample data to the child nodes includes:
1) if the eigenvalues of the selected features are discrete and finite, a hard assignment is used, i.e. sample data are assigned to only of the child nodes according to the test result.
2) If the eigenvalues of the selected features are ordered and continuous, soft distribution is used, i.e. sample data are distributed to or more child nodes according to the result of the test using a piecewise linear fuzzy function:
Figure BDA0002247659550000021
x is certain sample data, and gamma and delta can be the mean value and the variance of the data value corresponding to a certain characteristic (the two parameters can be obtained by learning through a relevant machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are distributed to the left and right child nodes simultaneously.
3) Each sample data used in the decision tree is assigned membership values in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under that node, the membership of all sample data under the root node is assigned 1 by default, and for a given node N, the membership of its child node NC is defined recursively as:
μNC(x)=μN(x)tN(x,γ,δ)
wherein x is sample data, μNCDegree of membership, mu, of node NCNIs the degree of membership, t, of node NNIs the test function for the corresponding node N.
, the data analysis module obtains the refined classification result of the user test data of the accumulation fund according to the trained decision tree by the following formula:
Figure BDA0002247659550000031
Figure BDA0002247659550000032
wherein, x' is the test data,
Figure BDA0002247659550000033
representing the probability estimate of the c-th class of the ith leaf node,
Figure BDA0002247659550000034
representing degree of membership, y, of the ith leaf nodecIndicates the probability output for category C, y indicates the probability output for the largest category, C indicates the number of categories, and leaves indicates the set of leaf nodes.
A method for refining and analyzing accumulation fund user data based on decision tree, comprising the following steps:
step A: collecting data information related to the accumulation fund user from each terminal device through a data collection module, identifying the entity, the entity attribute and the relationship among the entities, and eliminating the conflict existing in the multi-source accumulation fund user data;
and B: converting the entity obtained after the conflict is eliminated and the relationship between the entities into relational model data through a data storage module, and storing the relational model data into a relational database to form original relational data;
and C: converting original relational data into characteristic data used by a decision tree in a data analysis module through a data preprocessing module, and storing the characteristic data in a relational database view form;
step D: refining and analyzing user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refining analysis model, transmitting test data of a accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refining and classifying result according to estimated values of all the leaf nodes;
step E: and displaying the detailed classification result to the user on the terminal equipment in a chart mode through a data display module.
Further , the data collected in step A includes structured data, semi-structured data and unstructured data, and entities and relationships between entities can be identified from different data sources by using entity linking technology.
And , eliminating conflicts of entities, entity attributes and relationships among entities from different data sources by means of manual judgment and identification, and mainly eliminating attribute conflicts, name conflicts and structure conflicts.
, the step C includes determining the characteristic attributes used by the decision tree in the refinement analysis of the user of the fund based on different application requirements (such as risk control, customer service, etc.), establishing a transformation relationship between the characteristic data and the original relational data (for example, the characteristic attributes used by the decision tree are the average annual income of each customer, and the monthly income of each customer is recorded in the relational database, so that it is necessary to count the average annual income of each customer for the customer data in the database year to establish the transformation relationship between the average annual income of the customer as the characteristic data and the income of the customer in the relational database), and storing the extracted characteristic data in the relational database view.
The characteristic data and original relation data conversion process, , is to search, process and integrate the data connected with single table or multiple tables in the relation database, the converted result is the characteristic data to be extracted, for the conversion, it can be written into independent program modules to realize, and can transfer the conversion rule as parameter to the module to supply different users, in order to ensure the characteristic data in the database view is up-to-date, timers can be set, the conversion of original relation data to characteristic data is continuously carried out, the updating period can be determined by users according to their needs.
Step , the training and generating process of the decision tree refinement analysis model in step D specifically includes:
when the decision tree is generated, features are selected from the feature set of the sample data for testing from the root node, the sample data is distributed to the child nodes according to the test result, the sample data is tested and distributed recursively until the leaf nodes are reached, and finally the sample data is distributed to the leaf nodes.
To select the appropriate feature from the feature set, different quantitative evaluation criteria may be used, such as information gain, information gain ratio, Gini index, etc.
The conventional decision tree is to distribute sample data to sub-nodes according to the test result on the node, that is, sample data can be distributed to sub-nodes, but the present invention can distribute sample data to a plurality of sub-nodes according to the characteristic of the selected feature, that is, sample data can be distributed to a plurality of sub-nodes, since the data information collected from the real world has ambiguity certainty of , the ambiguity can be embodied in the decision tree by using soft-hard distribution, thereby improving the effect of subdivision classification, specifically,
1) if the eigenvalues of the selected features are discrete and finite, a hard distribution is adopted, that is, sample data can be distributed to only child nodes according to the test result.
2) If the feature values of the selected features are ordered and continuous, a soft distribution is used, that is, sample data may be distributed to a plurality of child nodes according to the test result.
The test uses a piecewise linear fuzzy function:
Figure BDA0002247659550000051
in the formula, x is a certain sample data, and γ and δ can be the mean and variance of the data value corresponding to a certain feature (these two parameters can also be obtained by learning through a relevant machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are distributed to the left and right child nodes simultaneously.
3) Each sample data used in the decision tree is assigned membership values in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under that node, whereas the membership of all sample data under the root node is assigned by default 1, which reflects the fact that all samples belong to it.
Given a node N, the degree of membership of its child nodes NC may be defined recursively as:
μNC(x)=μN(x)tN(x,γ,δ)
wherein x is sample data, μNCDegree of membership, mu, of node NCNIs the degree of membership, t, of node NNIs the test function for the corresponding node N.
And step , when the test data x of a user with a certain accumulation fund is transmitted to the decision tree and reaches a plurality of leaf nodes according to the decision tree obtained by training in the step D, and finally, the result of the refined classification is given according to the estimated values of all the leaf nodes:
Figure BDA0002247659550000053
wherein, x' is the test data,representing the probability estimate of the c-th class of the ith leaf node,
Figure BDA0002247659550000055
representing degree of membership, y, of the ith leaf nodecIndicates the probability output for category C, y indicates the probability output for the largest category, C indicates the number of categories, and leaves indicates the set of leaf nodes.
Compared with the prior art, the public accumulation fund user data refinement analysis system and method based on the decision tree have the advantages that 1, the characteristic data for decision tree refinement analysis is extracted from the preprocessed original relational model data, and a public accumulation fund user refinement analysis method based on a new decision tree is designed on the basis of the characteristic data, so that the public accumulation fund user data analysis management efficiency is greatly improved;
2. different from the traditional decision tree generation mode, the invention divides data into sub-nodes in a soft-hard combined mode, and can provide the accumulation fund management part with more efficient data resources and stronger detailed analysis capability, thereby providing scientific basis for the effective management of accumulation fund users.
Drawings
FIG. 1 is a block flow diagram of methods for refining and analyzing accumulation fund user data based on decision trees;
FIG. 2 is a schematic flow chart of the data transformation relationship of the accumulation fund user in the present invention.
Detailed Description
The invention is further described with reference to the following figures.
A system for refining and analyzing accumulation fund user data based on decision trees comprises the following components:
the data acquisition module is used for acquiring data information related to the accumulation fund user from each terminal device (a mobile terminal, a computer terminal, a video monitoring terminal and the like), identifying entities, entity attributes and relations among the entities and eliminating conflicts existing in the multi-source accumulation fund user data;
the data storage module is used for converting the entities obtained after the conflict is eliminated and the relationship between the entities into relational model data and storing the relational model data into a relational database to form original relational data;
the data preprocessing module is used for converting the original relational data into characteristic data used by a decision tree in the data analysis module and exists in the form of a relational database view;
the data analysis module is used for carrying out thinning analysis on the user characteristic data in the relational database view through the decision tree, training to generate a decision tree thinning analysis model, then transmitting the test data of the accumulation fund user to the decision tree and reaching a plurality of leaf nodes, and finally giving a thinning classification result according to the estimation values of all the leaf nodes;
and the data display module is used for displaying the detailed classification result to the user on terminal equipment (a mobile phone end, a computer end and the like) in a chart mode.
When the decision tree is generated, features are selected from the feature set of the sample data for testing from the root node, the sample data is distributed to the child nodes according to the test result, the sample data is tested and distributed recursively until the leaf nodes are reached, and finally the sample data is distributed to the leaf nodes.
The sample data allocation process specifically includes:
1) if the eigenvalues of the selected features are discrete and finite, a hard assignment is used, i.e. sample data are assigned to only of the child nodes according to the test result.
2) If the eigenvalues of the selected features are ordered and continuous, soft distribution is used, i.e. sample data are distributed to or more child nodes according to the result of the test using a piecewise linear fuzzy function:
Figure BDA0002247659550000071
in the formula, x is certain sample data, and gamma and delta are the mean value and the variance of the data value corresponding to the selected characteristic; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; in other cases, the samples are simultaneously distributed to the left child node and the right child node;
3) each sample data used in the decision tree is assigned membership values in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under that node, and the membership of all sample data under the root node is given by default 1. given node N, the membership of its child node NC is defined recursively as:
μNC(x)=μN(x)tN(x,γ,δ)
wherein x is sample data, μNCDegree of membership, mu, of node NCNIs the degree of membership, t, of node NNIs the test function for the corresponding node N.
And finally, according to the decision tree obtained by training, obtaining a refined classification result of the test data of the accumulation fund user according to the following formula:
Figure BDA0002247659550000072
Figure BDA0002247659550000073
wherein, x' is the test data,
Figure BDA0002247659550000074
representing the probability estimate of the c-th class of the ith leaf node,
Figure BDA0002247659550000075
representing degree of membership, y, of the ith leaf nodecIndicates the probability output for category C, y indicates the probability output for the largest category, C indicates the number of categories, and leaves indicates the set of leaf nodes.
FIG. 1 shows kinds of methods for refining and analyzing accumulation fund user data based on decision trees, which includes the following steps:
step A: collecting data from a plurality of different data sources related to a public accumulation fund user through a data collection module (the collected data comprises structured data, semi-structured data and unstructured data), identifying entities, entity attributes and relations among the entities from the different data sources by adopting intelligent technologies such as an entity link technology, deep learning and the like, eliminating conflicts among the entities, the entity attributes and the relations among the entities from the different data sources, and mainly eliminating attribute conflicts, name conflicts and structure conflicts;
and B: converting the entity obtained after the conflict is eliminated and the relationship between the entities into relational model data through a data storage module, and storing the relational model data into a relational database to form original relational data;
and C: according to different application requirements (such as risk control, customer service and the like), determining characteristic attributes used by a decision tree in the refinement analysis of the accumulation fund user through a data preprocessing module, establishing a transformation relation between characteristic data and original relational data, and storing the extracted characteristic data into a relational database view;
for the transformation of the feature data and the original relational data, is generally used to search, process and integrate the data connected with a single table or multiple tables in the relational database, and the transformed result is the feature data to be extracted, for the transformation, independent program modules can be written to realize the transformation, and transformation rules can be transmitted to the modules as parameters for different users to use, in order to ensure that the feature data in the database view is up-to-date, timers can be set, the transformation of the original relational data to the feature data is continuously carried out, and the updating period can be determined by the users according to the needs of the users.
Step D: refining and analyzing user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refining analysis model, transmitting test data of a accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refining and classifying result according to estimated values of all the leaf nodes;
specifically, when generating a decision tree, features are selected from the feature set of sample data to be tested, starting from the root node, and the sample data is assigned to its child nodes according to the test result, so that the sample data is tested and assigned recursively until a leaf node is reached, and finally the sample data is assigned to a certain leaf node.
To select the appropriate feature from the feature set, different quantitative evaluation criteria may be used, such as information gain, information gain ratio, Gini index, etc.
While conventional decision trees provide for hard distribution of sample data to children based on test results on the nodes, i.e., sample data can only be distributed to of the children, the present invention provides for soft distribution based on the characteristics of the selected features, i.e., sample data can be distributed to a plurality of children.
1) If the eigenvalues of the selected features are discrete and finite, a hard distribution is adopted, that is, sample data can be distributed to only child nodes according to the test result.
2) If the feature values of the selected features are ordered and continuous, a soft distribution is used, that is, sample data may be distributed to a plurality of child nodes according to the test result.
The test uses a piecewise linear fuzzy function:
Figure BDA0002247659550000091
in the formula, x is a certain sample data, and γ and δ can be the mean and variance of the data value corresponding to a certain feature (these two parameters can be obtained by learning through a relevant machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are distributed to the left and right child nodes simultaneously.
3) Each sample data used in the decision tree is assigned membership values in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under that node, whereas the membership of all sample data under the root node is assigned by default 1, which reflects the fact that all samples belong to it.
Given a node N, the degree of membership of its child nodes NC may be defined recursively as:
μNC(x)=μN(x)tN(x,γ,δ)
wherein x is sample data, μNCDegree of membership, mu, of node NCNIs the degree of membership, t, of node NNIs the test function for the corresponding node N.
And finally, according to the decision tree obtained by training, when the test data x' of a user with a certain accumulation fund is transmitted to the decision tree and reaches a plurality of leaf nodes, finally, a refined classification result is given according to the estimated values of all the leaf nodes:
Figure BDA0002247659550000092
Figure BDA0002247659550000093
wherein, x' is the test data,
Figure BDA0002247659550000094
representing the probability estimate of the c-th class of the ith leaf node,
Figure BDA0002247659550000095
representing degree of membership, y, of the ith leaf nodecIndicates the probability output for category C, y indicates the probability output for the largest category, C indicates the number of categories, and leaves indicates the set of leaf nodes.
Step E: and displaying the detailed and classified results to the user on terminal equipment (a mobile phone end, a computer end and the like) in a chart mode through a data display module.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (10)

1, kind of system for refining and analyzing accumulation fund user data based on decision tree, which is characterized by comprising:
the data acquisition module is used for acquiring data information related to the accumulation fund user from each terminal device, identifying the entity, the entity attribute and the relationship among the entities, and eliminating the conflict existing in the multi-source accumulation fund user data;
the data storage module is used for converting the entities obtained after the conflict is eliminated and the relationship between the entities into relational model data and storing the relational model data into a relational database to form original relational data;
the data preprocessing module is used for converting the original relational data into characteristic data used by a decision tree in the data analysis module and exists in the form of a relational database view;
the data analysis module is used for carrying out thinning analysis on the user characteristic data in the relational database view through the decision tree, training to generate a decision tree thinning analysis model, then transmitting the test data of the accumulation fund user to the decision tree and reaching a plurality of leaf nodes, and finally giving a thinning classification result according to the estimation values of all the leaf nodes;
and the data display module is used for displaying the detailed classification result to the user on the terminal equipment in a chart mode.
2. The decision tree-based accumulation fund user data fine analysis system, wherein, in the data analysis module, when generating the decision tree, starting from the root node, features are selected from the feature set of the sample data for testing, and the sample data is assigned to its child nodes according to the test result, so that the sample data is tested and assigned recursively until reaching the leaf nodes, and finally assigned to the leaf nodes.
3. The decision-tree-based accumulation fund user data refining analysis system according to claim 2, wherein the process of assigning sample data to child nodes in the data analysis module specifically comprises:
1) if the eigenvalue of the selected characteristic is discrete and limited, a hard distribution mode is adopted, namely sample data can be distributed to child nodes according to the test result;
2) if the eigenvalues of the selected features are ordered and continuous, soft distribution is used, i.e. sample data are distributed to or more child nodes according to the result of the test using a piecewise linear fuzzy function:
Figure FDA0002247659540000011
in the formula, x is sample data, and gamma and delta are the mean value and variance of the data value corresponding to the selected characteristic; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; in other cases, the samples are simultaneously distributed to the left child node and the right child node;
3) each sample data used in the decision tree is assigned membership values in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under that node, the membership of all sample data under the root node is assigned 1 by default, and for a given node N, the membership of its child node NC is defined recursively as:
μNC(x)=μN(x)tN(x,γ,δ)
wherein x is sample data, μNCDegree of membership, mu, of node NCNIs the degree of membership, t, of node NNIs the test function for the corresponding node N.
4. The decision tree-based accumulation fund user data refining analysis system as claimed in claim 3, wherein the refined classification result of the accumulation fund user test data x' according to the trained decision tree in the data analysis module is obtained by the following formula:
Figure FDA0002247659540000021
Figure FDA0002247659540000022
wherein, x' is the test data,
Figure FDA0002247659540000023
representing the probability estimate of the c-th class of the ith leaf node,
Figure FDA0002247659540000024
representing degree of membership, y, of the ith leaf nodecProbability output representing class C, probability output representing maximum class y, and class CThe leaves represents a set of leaf nodes.
5, A method for refining and analyzing accumulation fund user data based on decision tree, which is characterized in that the method comprises the following steps:
step A: collecting data information related to the accumulation fund user from each terminal device through a data collection module, identifying the entity, the entity attribute and the relationship among the entities, and eliminating the conflict existing in the multi-source accumulation fund user data;
and B: converting the entity obtained after the conflict is eliminated and the relationship between the entities into relational model data through a data storage module, and storing the relational model data into a relational database to form original relational data;
and C: converting original relational data into characteristic data used by a decision tree in a data analysis module through a data preprocessing module, and storing the characteristic data in a relational database view form;
step D: refining and analyzing user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refining analysis model, transmitting test data of a accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refining and classifying result according to estimated values of all the leaf nodes;
step E: and displaying the detailed classification result to the user on the terminal equipment in a chart mode through a data display module.
6. The method for refining analysis of equity fund user data based on decision tree as claimed in claim 5, wherein said data collected in step A includes structured data, semi-structured data and unstructured data, and entity linking technique is used to identify entities and relationships between entities from the collected data.
7. The method for refining analysis of equity fund user data based on decision tree as claimed in claim 5, wherein said conflicts eliminated in step A include attribute conflicts, name conflicts and structure conflicts.
8. The method for refining and analyzing accumulation fund user data based on decision tree as claimed in claim 5, wherein the transformation of the feature data and the original relational data in step C is to search, process and integrate the data connected to a single table or multiple tables in the relational database, and the transformed result is the feature data to be extracted.
9. The method for refining analysis of accumulation fund user data based on decision tree as claimed in claim 5, wherein the training process of the decision tree refining analysis model in step D specifically comprises:
when a decision tree is generated, features are selected from the feature set of sample data to be tested from a root node, the sample data is distributed to child nodes of the sample data according to a test result, the sample data is tested and distributed recursively in such a way until a leaf node is reached, and finally the sample data is distributed to the leaf node;
the sample data allocation process specifically includes:
1) if the eigenvalue of the selected characteristic is discrete and limited, a hard distribution mode is adopted, namely sample data can be distributed to child nodes according to the test result;
2) if the eigenvalues of the selected features are ordered and continuous, soft distribution is used, i.e. sample data are distributed to or more child nodes according to the result of the test using a piecewise linear fuzzy function:
Figure FDA0002247659540000031
in the formula, x is sample data, and gamma and delta are the mean value and variance of the data value corresponding to the selected characteristic; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; in other cases, the samples are simultaneously distributed to the left child node and the right child node;
3) each sample data used in the decision tree is assigned membership values in the corresponding node of the decision tree, which indicates the degree of the sample data belonging to the sample data set under the node, the membership of all sample data under the root node is assigned 1 by default, and the membership of the child node NC of the node N is defined recursively as:
μNC(x)=μN(x)tN(x,γ,δ)
wherein x is sample data, μNCDegree of membership, mu, of node NCNIs the degree of membership, t, of node NNIs the test function for the corresponding node N.
10. The method for refining analysis of accumulation fund user data based on decision tree as claimed in claim 9, wherein the refined classification result of the accumulation fund user test data x' based on the trained decision tree in step D is obtained by the following formula:
Figure FDA0002247659540000041
Figure FDA0002247659540000042
wherein, x' is the test data,
Figure FDA0002247659540000043
representing the probability estimate of the c-th class of the ith leaf node,
Figure FDA0002247659540000044
representing degree of membership, y, of the ith leaf nodecIndicates the probability output for category C, y indicates the probability output for the largest category, C indicates the number of categories, and leaves indicates the set of leaf nodes.
CN201911022440.6A 2019-10-25 2019-10-25 Decision tree-based public accumulation user data refinement analysis system and method Active CN110737731B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911022440.6A CN110737731B (en) 2019-10-25 2019-10-25 Decision tree-based public accumulation user data refinement analysis system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911022440.6A CN110737731B (en) 2019-10-25 2019-10-25 Decision tree-based public accumulation user data refinement analysis system and method

Publications (2)

Publication Number Publication Date
CN110737731A true CN110737731A (en) 2020-01-31
CN110737731B CN110737731B (en) 2023-12-29

Family

ID=69271378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911022440.6A Active CN110737731B (en) 2019-10-25 2019-10-25 Decision tree-based public accumulation user data refinement analysis system and method

Country Status (1)

Country Link
CN (1) CN110737731B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102811162A (en) * 2011-06-03 2012-12-05 弗卢克公司 Method and apparatus for detecting network attacks using a flow based technique
CN107808245A (en) * 2017-10-25 2018-03-16 冶金自动化研究设计院 Based on the network scheduler system for improving traditional decision-tree
CN109325844A (en) * 2018-06-25 2019-02-12 南京工业大学 Network loan borrower credit evaluation method under multidimensional data
CN109522957A (en) * 2018-11-16 2019-03-26 上海海事大学 The method of harbour gantry crane machine work status fault classification based on decision Tree algorithms
CN109830303A (en) * 2019-02-01 2019-05-31 上海众恒信息产业股份有限公司 Clinical data mining analysis and aid decision-making method based on internet integration medical platform

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102811162A (en) * 2011-06-03 2012-12-05 弗卢克公司 Method and apparatus for detecting network attacks using a flow based technique
CN107808245A (en) * 2017-10-25 2018-03-16 冶金自动化研究设计院 Based on the network scheduler system for improving traditional decision-tree
CN109325844A (en) * 2018-06-25 2019-02-12 南京工业大学 Network loan borrower credit evaluation method under multidimensional data
CN109522957A (en) * 2018-11-16 2019-03-26 上海海事大学 The method of harbour gantry crane machine work status fault classification based on decision Tree algorithms
CN109830303A (en) * 2019-02-01 2019-05-31 上海众恒信息产业股份有限公司 Clinical data mining analysis and aid decision-making method based on internet integration medical platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吴剑新;黄章树;: "基于决策树的住房贷款还贷行为分析", 重庆工商大学学报(自然科学版), no. 03, pages 265 - 269 *
周文谊等: "高斯隶属度优化的超分辨率随机森林学习算法", 《计算机工程与应用》, vol. 52, no. 23, pages 208 - 212 *
王利民;臧雪柏;曹春红;: "基于广义信息论的决策森林数据挖掘模型", 吉林大学学报(工学版), no. 01, pages 155 - 158 *

Also Published As

Publication number Publication date
CN110737731B (en) 2023-12-29

Similar Documents

Publication Publication Date Title
CN107239892B (en) Regional talent supply and demand balance quantitative analysis method based on big data
WO2019200752A1 (en) Semantic understanding-based point of interest query method, device and computing apparatus
CN107203849B (en) Regional talent supply quantitative analysis method based on big data
CN106326482A (en) System of visualized big data collection and analysis and file conversion and method thereof
CN115309906B (en) Intelligent data classification method based on knowledge graph technology
CN115794803B (en) Engineering audit problem monitoring method and system based on big data AI technology
CN108647729A (en) A kind of user's portrait acquisition methods
CN110688549A (en) Artificial intelligence classification method and system based on knowledge system map construction
CN113360599A (en) Multi-source heterogeneous information convergence cooperative processing platform based on content identification
WO2024131524A1 (en) Depression diet management method based on food image segmentation
CN111143689A (en) Method for constructing recommendation engine according to user requirements and user portrait
CN115809229A (en) Evaluation management method and system based on multi-dimensional data attributes
CN117575011B (en) Customer data management method and system based on big data
CN112363996A (en) Method, system, and medium for building a physical model of a power grid knowledge graph
CN110737731B (en) Decision tree-based public accumulation user data refinement analysis system and method
CN115358797A (en) Comprehensive energy user energy behavior analysis method and system based on cluster analysis method and storage medium
CN114168751B (en) Medical text label identification method and system based on medical knowledge conceptual diagram
CN112488236B (en) Integrated unsupervised student behavior clustering method
Chen Characteristic scales, scaling, and geospatial analysis
CN114969375A (en) Method and system for giving artificial intelligence learning to machine based on psychological knowledge
KR20220095654A (en) Social data collection and analysis system
CN112035680A (en) Knowledge graph construction method of intelligent auxiliary learning machine
CN109657684A (en) A kind of image, semantic analytic method based on Weakly supervised study
CN118113854B (en) Online consultation method and system based on gynecological nursing knowledge base
CN117786182B (en) Business data storage system and method based on ERP system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant