CN110737731B - Decision tree-based public accumulation user data refinement analysis system and method - Google Patents

Decision tree-based public accumulation user data refinement analysis system and method Download PDF

Info

Publication number
CN110737731B
CN110737731B CN201911022440.6A CN201911022440A CN110737731B CN 110737731 B CN110737731 B CN 110737731B CN 201911022440 A CN201911022440 A CN 201911022440A CN 110737731 B CN110737731 B CN 110737731B
Authority
CN
China
Prior art keywords
data
node
decision tree
sample
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911022440.6A
Other languages
Chinese (zh)
Other versions
CN110737731A (en
Inventor
李子龙
鲍蓉
潘晓博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xuzhou University of Technology
Original Assignee
Xuzhou University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xuzhou University of Technology filed Critical Xuzhou University of Technology
Priority to CN201911022440.6A priority Critical patent/CN110737731B/en
Publication of CN110737731A publication Critical patent/CN110737731A/en
Application granted granted Critical
Publication of CN110737731B publication Critical patent/CN110737731B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a decision tree-based public accumulation user data refinement analysis system and method, comprising the following steps: the data acquisition module is used for acquiring multi-source public accumulation user data, identifying entities, entity attributes and relationships among the entities, and eliminating conflicts existing in the multi-source data; the data storage module is used for storing the converted relational data into a relational database; the data preprocessing module is used for converting the original relational data into characteristic data used by a decision tree in user refinement analysis; and the data analysis module is used for carrying out refinement analysis on the user characteristic data by using the decision tree, and finally displaying an analysis result to a user in a chart form. According to the invention, the original relation model data is preprocessed, the characteristic data for refining analysis of the decision tree is extracted, and the new method for refining analysis of the public accumulation user data based on the decision tree is designed on the basis of the characteristic data, so that powerful decision support can be provided for the public accumulation management department timely and accurately.

Description

Decision tree-based public accumulation user data refinement analysis system and method
Technical Field
The invention relates to a decision tree-based public accumulation user data refinement analysis system and method, and belongs to the technical field of public accumulation data analysis management.
Background
At present, informatization of the public accumulation business has become a necessary trend, and reasonable and effective management of data information of public accumulation users is of great importance to public accumulation management departments. By performing a refinement analysis on the aggregated users, different countermeasures can be used to manage the user traffic of different segments to enhance service functionality and management level. For this reason, how to use the public accumulation user data to perform detailed analysis on the user has become a problem that is more important to the public accumulation management departments in various places.
In order to enable the public accumulation management department to master decision data, the analysis method of service data is often realized by adopting a decision tree or clustering method, but the analysis result is still not fine enough, and the analysis process generally searches and analyzes directly on a relational database, wherein a large amount of data and data tables are involved, so that the efficiency of query, processing and access is low, and the service requirement realization period is long. At present, each service unit of the public accumulation fund has the defects of large time range, complex processing logic and stricter time requirement, and the traditional user analysis method based on the public accumulation fund data can not meet the timeliness requirement of the public accumulation fund service department.
Disclosure of Invention
The invention aims to: in order to overcome the defects in the prior art, the invention provides a system and a method for refining and analyzing the public accumulation user data based on a decision tree, which are used for extracting characteristic data for refining and analyzing the decision tree from original relation model data by preprocessing the data, and designing a method for refining and analyzing the public accumulation user based on a new decision tree on the basis of the characteristic data, thereby providing powerful decision support for a public accumulation management department timely and accurately.
The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:
a decision tree-based public accumulation user data refinement analysis system, comprising the following components:
the data acquisition module is used for acquiring data information related to the public accumulation user from each terminal device (mobile terminal, computer terminal, video monitoring terminal and the like), identifying the entity, entity attribute and relationship among the entities, and eliminating conflicts existing in the multi-source public accumulation user data;
the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database in the computer storage equipment to form original relation data;
the data preprocessing module is used for converting the original relational data into characteristic data used by the decision tree in the data analysis module and exists in a relational database view mode;
the data analysis module performs refinement analysis on the user characteristic data in the relational database view through a decision tree, trains and generates a decision tree refinement analysis model, transmits test data of the public accumulation user to the decision tree and reaches a plurality of leaf nodes, and finally gives a refinement classification result according to estimated values of all the leaf nodes;
and the data display module is used for displaying the refined classification result to the user in a form of a chart on the terminal equipment (a mobile phone end, a computer end and the like).
Further, when the decision tree is generated in the data analysis module, one feature is selected from the feature set of the sample data from the root node for testing, the sample data is distributed to the child nodes according to the test result, the sample data is tested and distributed recursively until the leaf node is reached, and finally the sample data is distributed to the leaf node.
Further, the process of distributing the sample data to the child nodes in the data analysis module specifically includes:
1) If the feature values of the selected features are discrete and finite, a hard allocation is used, i.e., one sample of data can be allocated to only one of the child nodes based on the results of the test.
2) If the feature values of the selected features are ordered, continuous, a soft allocation is used, i.e., one sample of data is allocated to one or more child nodes based on the results of a test that uses a piecewise linear fuzzy function:
x is a certain sample data, and gamma and delta can take the mean value and variance of a data value under the condition that a certain characteristic corresponds to (the two parameters can be obtained through learning by a related machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are allocated to both the left and right child nodes.
3) Each sample data used in the decision tree is endowed with a membership value in a corresponding node of the decision tree, which indicates the degree to which the sample data belongs to a sample data set under the node; the membership of all sample data under the root node is given by default 1, while for a given node N, the membership of its child node NC is recursively defined as:
μ NC (x)=μ N (x)t N (x,γ,δ)
wherein x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N Is a test function of the corresponding node N.
Further, the data analysis module obtains a refined classification result of the metric user test data according to the decision tree obtained by training by the following formula:
wherein x' is the test data,probability estimation representing the c-th class of the i-th leaf node,>representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
A decision tree-based public accumulation user data refinement analysis method comprises the following steps:
step A: the data information related to the public accumulation user is collected from each terminal device through the data collection module, the entity attribute and the relation among the entities are identified, and the conflict existing in the multi-source public accumulation user data is eliminated;
and (B) step (B): the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;
step C: converting the original relational data into characteristic data used by a decision tree in a data analysis module through a data preprocessing module, and enabling the characteristic data to exist in a relational database view mode;
step D: carrying out refinement analysis on user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refinement analysis model, transmitting test data of an accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refinement classification result according to estimated values of all the leaf nodes;
step E: and displaying the result of the refined classification to the user in a form of a chart on the terminal equipment through the data display module.
Further, the data collected in the step a includes structured data, semi-structured data and unstructured data, and the entity-to-entity relationship can be identified from different data sources by using entity linking technology.
Further, the conflict among the entities from different data sources, the entity attributes and the relationships among the entities is eliminated by means of manual judgment and identification, and the attribute conflict, the name conflict and the structure conflict are mainly eliminated.
Further, the step C specifically includes: according to different application requirements (such as risk control, customer service and the like), determining characteristic attributes used by a decision tree in the detail analysis of the public accumulation fund users, establishing a transformation relation between the characteristic data and original relational data (for example, the characteristic attributes used in the decision tree are annual average income of each customer, and the monthly income situation of each customer is recorded in a relational database, so that the average income of each customer needs to be counted according to years for the customer data in the database, thereby establishing a transformation relation between the annual average income of the characteristic data and the income situation of the customers in the relational database), and storing the extracted characteristic data into a relational database view.
The transformation process of the feature data and the original relational data generally searches, processes and integrates the data connected by a single table or multiple tables in the relational database, and the transformed result is the feature data to be extracted. For this transformation, it can be written as a separate program module and the transformation rules can be passed as parameters to the module for use by different users. In order to ensure that the feature data in the database view is up-to-date, a timer may be set to continuously perform the conversion from the original relational data to the feature data, and the update period may be determined by the user according to his own needs.
Further, the training generation process of the decision tree refinement analysis model in the step D specifically includes:
when generating the decision tree, starting from the root node, selecting a feature in the feature set of the sample data for testing, distributing the sample data to its child nodes according to the test result, recursively testing and distributing the sample data until reaching the leaf node, and finally distributing the sample data to the leaf node.
To select the appropriate feature from the feature set, different quantitative evaluation criteria may be used, such as information gain, information gain rate, gini index, etc.
The conventional decision tree is to allocate sample data to the child nodes hard according to the test results on the nodes, that is, the sample data can be allocated to only one of the child nodes. The invention may be soft-allocated, i.e. sample data may be allocated to a plurality of sub-nodes, depending on the nature of the selected feature. Since the data information collected from the real world has certain ambiguity certainty, the ambiguity can be more embodied in the decision tree by adopting a soft and hard allocation mode, thereby improving the subdivision classification effect, in particular,
1) If the feature values of the selected feature are discrete, limited, a hard allocation is used, i.e. a sample of data can only be allocated to one of the child nodes according to the test results.
2) If the feature values of the selected features are ordered and continuous, a soft allocation is used, i.e. a sample of data may be allocated to a plurality of sub-nodes according to the test results.
The test uses a piecewise linear fuzzy function:
wherein x is a certain sample data, and gamma and delta can be the mean value and variance of a data value under the corresponding condition of a certain characteristic (the two parameters can also be obtained through learning by a related machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are allocated to both the left and right child nodes.
3) Each sample data used in the decision tree is assigned a membership value in the corresponding node of the decision tree indicating the degree to which the sample data belongs to the sample data set under that node. And membership of all sample data under the root node is given by default 1, reflecting the fact that all samples belong to it.
Given node N, its membership of child node NC can be recursively defined as:
μ NC (x)=μ N (x)t N (x,γ,δ)
wherein x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N Is a test function of the corresponding node N.
Further, in the step D, according to the decision tree obtained by training, when the test data x of a certain public accumulation user is transferred to the decision tree and reaches a plurality of leaf nodes, the result of refinement classification is finally given according to the estimated values of all the leaf nodes:
wherein x' is the test data,probability estimation representing the c-th class of the i-th leaf node,>representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
The beneficial effects are that: compared with the prior art, the system and the method for refining and analyzing the public accumulation fund user data based on the decision tree have the following advantages: 1. the feature data for the detailed analysis of the decision tree is extracted from the original relation model data by preprocessing, and a new method for the detailed analysis of the public accumulation user based on the decision tree is designed on the basis of the feature data, so that the data analysis and management efficiency of the public accumulation user is greatly improved;
2. different from the traditional decision tree generation mode, the invention adopts a soft and hard combination mode to divide data into sub-nodes, and can provide more efficient data resources and stronger refinement analysis capability for the public accumulation management department, thereby providing scientific basis for effective management of public accumulation users.
Drawings
FIG. 1 is a flow chart diagram of a decision tree-based method for refining and analyzing public accumulation fund user data;
FIG. 2 is a flow chart of the transformation relationship of the public accumulation user data in the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
A decision tree-based public accumulation user data refinement analysis system, comprising the following components:
the data acquisition module is used for acquiring data information related to the public accumulation user from each terminal device (mobile terminal, computer terminal, video monitoring terminal and the like), identifying the entity, entity attribute and relationship among the entities, and eliminating conflicts existing in the multi-source public accumulation user data;
the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;
the data preprocessing module is used for converting the original relational data into characteristic data used by the decision tree in the data analysis module and exists in a relational database view mode;
the data analysis module performs refinement analysis on the user characteristic data in the relational database view through a decision tree, trains and generates a decision tree refinement analysis model, transmits test data of the public accumulation user to the decision tree and reaches a plurality of leaf nodes, and finally gives a refinement classification result according to estimated values of all the leaf nodes;
and the data display module is used for displaying the refined classification result to the user in a form of a chart on the terminal equipment (a mobile phone end, a computer end and the like).
When generating the decision tree, starting from the root node, selecting a feature in the feature set of the sample data for testing, distributing the sample data to its child nodes according to the test result, recursively testing and distributing the sample data until reaching the leaf node, and finally distributing the sample data to the leaf node.
The sample data distribution process specifically comprises the following steps:
1) If the feature values of the selected features are discrete and finite, a hard allocation is used, i.e., one sample of data can be allocated to only one of the child nodes based on the results of the test.
2) If the feature values of the selected features are ordered, continuous, a soft allocation is used, i.e., one sample of data is allocated to one or more child nodes based on the results of a test that uses a piecewise linear fuzzy function:
wherein x is a certain sample data, and gamma and delta are the mean value and variance of the data value under the corresponding condition of the selected characteristic; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is allocated to the right child node; in other cases, the samples are distributed to the left child node and the right child node at the same time;
3) Each sample data used in the decision tree is assigned a membership value in the corresponding node of the decision tree, indicating the degree to which the sample data belongs to the sample data set under the node, and the membership of all sample data under the root node is assigned 1 by default. Given node N, its degrees of membership of child node NC are recursively defined as:
μ NC (x)=μ N (x)t N (x,γ,δ)
wherein x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N Is a test function of the corresponding node N.
Finally, according to the decision tree obtained by training, the refined classification result of the public accumulation user test data is obtained by the following formula:
wherein x' is the test data,probability estimation representing the c-th class of the i-th leaf node,>representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
FIG. 1 shows a decision tree-based detailed analysis method for public accumulation fund user data, which comprises the following steps:
step A: collecting data (the collected data comprises structured data, semi-structured data and unstructured data) from a plurality of different data sources related to an public accumulation user through a data collecting module, respectively identifying entities, entity attributes and relations among the entities from the different data sources by adopting intelligent technologies such as entity linking technology, deep learning and the like, and eliminating conflicts among the entities, entity attributes and relations among the entities from the different data sources, wherein the conflict among the attributes, the name conflict and the structure conflict are mainly eliminated;
and (B) step (B): the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;
step C: according to different application requirements (such as risk control, customer service and the like), determining characteristic attributes used by a decision tree in the detail analysis of the user of the accumulation fund through a data preprocessing module, establishing a transformation relationship between characteristic data and original relationship data, and storing the extracted characteristic data into a relational database view;
for the transformation of the feature data and the original relational data, the data connected by a single table or multiple tables in the relational database is generally searched, processed and integrated, and the transformed result is the feature data to be extracted. For this transformation, it can be written as a separate program module and the transformation rules can be passed as parameters to the module for use by different users. In order to ensure that the feature data in the database view is up-to-date, a timer may be set to continuously perform the conversion from the original relational data to the feature data, and the update period may be determined by the user according to his own needs.
Step D: carrying out refinement analysis on user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refinement analysis model, transmitting test data of an accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refinement classification result according to estimated values of all the leaf nodes;
specifically, when generating the decision tree, starting from the root node, selecting a feature from the feature set of the sample data for testing, and distributing the sample data to its child nodes according to the test result, so as to recursively test and distribute the sample data until reaching a leaf node, and finally distributing the sample data to a certain leaf node.
To select the appropriate feature from the feature set, different quantitative evaluation criteria may be used, such as information gain, information gain rate, gini index, etc.
The conventional decision tree is to allocate sample data to the child nodes hard according to the test results on the nodes, that is, the sample data can be allocated to only one of the child nodes. The invention may be soft-allocated, i.e. sample data may be allocated to a plurality of sub-nodes, depending on the nature of the selected feature. In particular, the method comprises the steps of,
1) If the feature values of the selected feature are discrete, limited, a hard allocation is used, i.e. a sample of data can only be allocated to one of the child nodes according to the test results.
2) If the feature values of the selected features are ordered and continuous, a soft allocation is used, i.e. a sample of data may be allocated to a plurality of sub-nodes according to the test results.
The test uses a piecewise linear fuzzy function:
wherein x is a certain sample data, and gamma and delta can be the mean value and variance of a data value under the condition that a certain characteristic corresponds (the two parameters can be obtained through learning by a related machine learning algorithm); when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is distributed to the right child node; otherwise, the samples are allocated to both the left and right child nodes.
3) Each sample data used in the decision tree is assigned a membership value in the corresponding node of the decision tree indicating the degree to which the sample data belongs to the sample data set under that node. And membership of all sample data under the root node is given by default 1, reflecting the fact that all samples belong to it.
Given node N, its membership of child node NC can be recursively defined as:
μ NC (x)=μ N (x)t N (x,γ,δ)
wherein x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N Is a test function of the corresponding node N.
Finally, according to the decision tree obtained by training, when the test data x' of a certain public accumulation user is transmitted to the decision tree and reaches a plurality of leaf nodes, a refined classification result is finally given according to the estimated values of all the leaf nodes:
wherein x' is the test data,probability estimation representing the c-th class of the i-th leaf node,>representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
Step E: and displaying the refined classification result to the user in a chart mode on terminal equipment (a mobile phone end, a computer end and the like) through a data display module.
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims (2)

1. A decision tree-based public accumulation user data refinement analysis system, comprising:
the data acquisition module is used for acquiring data information related to the public accumulation user from each terminal equipment, identifying the entity, the entity attribute and the relation among the entities, and eliminating the conflict existing in the multi-source public accumulation user data;
the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;
the data preprocessing module is used for converting the original relational data into characteristic data used by the decision tree in the data analysis module and exists in a relational database view mode;
the data analysis module performs refinement analysis on the user characteristic data in the relational database view through a decision tree, trains and generates a decision tree refinement analysis model, transmits test data of the public accumulation user to the decision tree and reaches a plurality of leaf nodes, and finally gives a refinement classification result according to estimated values of all the leaf nodes;
the data display module is used for displaying the result of the refined classification to the user in a chart mode on the terminal equipment;
when a decision tree is generated in the data analysis module, starting from a root node, selecting a feature in a feature set of sample data for testing, distributing the sample data to child nodes thereof according to a test result, recursively testing and distributing the sample data until a leaf node is reached, and finally distributing the sample data to the leaf node;
the process of distributing the sample data to the child nodes in the data analysis module specifically comprises the following steps:
1) If the feature value of the selected feature is discrete and limited, adopting a hard allocation mode, namely, allocating one piece of sample data to one of the child nodes according to the test result;
2) If the feature values of the selected features are ordered, continuous, a soft allocation is used, i.e., one sample of data is allocated to one or more child nodes based on the results of a test that uses a piecewise linear fuzzy function:
wherein x is sample data, and gamma and delta are the mean value and variance of data values under the condition that the selected features correspond to; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is allocated to the right child node; in other cases, the samples are distributed to the left child node and the right child node at the same time;
3) Each sample data used in the decision tree is endowed with a membership value in a corresponding node of the decision tree, which indicates the degree to which the sample data belongs to a sample data set under the node; the membership of all sample data under the root node is given by default 1, while for a given node N, the membership of its child node NC is recursively defined as:
wherein x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N A test function corresponding to the node N;
the refined classification result of the data analysis module on the metric user test data x' according to the decision tree obtained by training is obtained by the following formula:
wherein x' is the test data,a probability estimate representing the c-th class of the i-th leaf node,representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
2. An analysis method using the decision tree-based metric user data refinement analysis system of claim 1, comprising the steps of:
step A: the data information related to the public accumulation user is collected from each terminal device through the data collection module, the entity attribute and the relation among the entities are identified, and the conflict existing in the multi-source public accumulation user data is eliminated;
and (B) step (B): the data storage module is used for converting the entity obtained after the conflict elimination and the relation between the entities into relation model data, and storing the relation model data into a relation database to form original relation data;
step C: converting the original relational data into characteristic data used by a decision tree in a data analysis module through a data preprocessing module, and enabling the characteristic data to exist in a relational database view mode;
step D: carrying out refinement analysis on user characteristic data in a relational database view through a data analysis module, training to generate a decision tree refinement analysis model, transmitting test data of an accumulation fund user to a decision tree and reaching a plurality of leaf nodes, and finally giving a refinement classification result according to estimated values of all the leaf nodes;
step E: displaying the refined classification result to a user in a chart mode on terminal equipment through a data display module;
the data acquired in the step A comprises structured data, semi-structured data and unstructured data, and an entity link technology is adopted to identify the entity and the relation between the entities from the acquired data;
the conflicts eliminated in the step A comprise attribute conflicts, name conflicts and structure conflicts;
the transformation process of the characteristic data and the original relational data in the step C is to search, process and integrate the data connected by a single table or a plurality of tables in the relational database, and the transformed result is the characteristic data to be extracted;
the training generation process of the decision tree refinement analysis model in the step D specifically comprises the following steps:
when generating a decision tree, starting from a root node, selecting a feature from a feature set of sample data for testing, distributing the sample data to child nodes thereof according to a test result, recursively testing and distributing the sample data until reaching a leaf node, and finally distributing the sample data to the leaf node;
the process of distributing the sample data to the leaf nodes specifically comprises the following steps:
1) If the feature value of the selected feature is discrete and limited, adopting a hard allocation mode, namely, allocating one piece of sample data to one of the child nodes according to the test result;
2) If the feature values of the selected features are ordered, continuous, a soft allocation is used, i.e., one sample of data is allocated to one or more child nodes based on the results of a test that uses a piecewise linear fuzzy function:
wherein x is sample data, and gamma and delta are the mean value and variance of data values under the condition that the selected features correspond to; when the value of the function t is 1, sample data is distributed to the left child node; when the value of the function t is 0, the sample is allocated to the right child node; in other cases, the samples are distributed to the left child node and the right child node at the same time;
each sample data used in the decision tree is endowed with a membership value in a corresponding node of the decision tree, which indicates the degree to which the sample data belongs to a sample data set under the node; the membership of all sample data under the root node is given by default 1, whereas the membership of a given node N, its child node NC, is recursively defined as:
x is sample data, μ NC Is the membership grade, mu of the node NC N Is the membership degree of the node N, t N A test function corresponding to the node N;
in the step D, according to the decision tree obtained by training, the refined classification result of the public accumulation fund user test data x' is obtained by the following formula:
wherein x' is the test data,a probability estimate representing the c-th class of the i-th leaf node,representing the membership degree, y, of the ith leaf node c Representing the probability output for category C, y representing the probability output for the largest category, C representing the number of categories, leave representing the set of leaf nodes.
CN201911022440.6A 2019-10-25 2019-10-25 Decision tree-based public accumulation user data refinement analysis system and method Active CN110737731B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911022440.6A CN110737731B (en) 2019-10-25 2019-10-25 Decision tree-based public accumulation user data refinement analysis system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911022440.6A CN110737731B (en) 2019-10-25 2019-10-25 Decision tree-based public accumulation user data refinement analysis system and method

Publications (2)

Publication Number Publication Date
CN110737731A CN110737731A (en) 2020-01-31
CN110737731B true CN110737731B (en) 2023-12-29

Family

ID=69271378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911022440.6A Active CN110737731B (en) 2019-10-25 2019-10-25 Decision tree-based public accumulation user data refinement analysis system and method

Country Status (1)

Country Link
CN (1) CN110737731B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102811162A (en) * 2011-06-03 2012-12-05 弗卢克公司 Method and apparatus for detecting network attacks using a flow based technique
CN107808245A (en) * 2017-10-25 2018-03-16 冶金自动化研究设计院 Based on the network scheduler system for improving traditional decision-tree
CN109325844A (en) * 2018-06-25 2019-02-12 南京工业大学 Network loan borrower credit evaluation method under multidimensional data
CN109522957A (en) * 2018-11-16 2019-03-26 上海海事大学 The method of harbour gantry crane machine work status fault classification based on decision Tree algorithms
CN109830303A (en) * 2019-02-01 2019-05-31 上海众恒信息产业股份有限公司 Clinical data mining analysis and aid decision-making method based on internet integration medical platform

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102811162A (en) * 2011-06-03 2012-12-05 弗卢克公司 Method and apparatus for detecting network attacks using a flow based technique
CN107808245A (en) * 2017-10-25 2018-03-16 冶金自动化研究设计院 Based on the network scheduler system for improving traditional decision-tree
CN109325844A (en) * 2018-06-25 2019-02-12 南京工业大学 Network loan borrower credit evaluation method under multidimensional data
CN109522957A (en) * 2018-11-16 2019-03-26 上海海事大学 The method of harbour gantry crane machine work status fault classification based on decision Tree algorithms
CN109830303A (en) * 2019-02-01 2019-05-31 上海众恒信息产业股份有限公司 Clinical data mining analysis and aid decision-making method based on internet integration medical platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于决策树的住房贷款还贷行为分析;吴剑新;黄章树;;重庆工商大学学报(自然科学版)(第03期);第265-269页 *
基于广义信息论的决策森林数据挖掘模型;王利民;臧雪柏;曹春红;;吉林大学学报(工学版)(第01期);第155-158页 *
高斯隶属度优化的超分辨率随机森林学习算法;周文谊等;《计算机工程与应用》;第52卷(第23期);第208-212页 *

Also Published As

Publication number Publication date
CN110737731A (en) 2020-01-31

Similar Documents

Publication Publication Date Title
CN106919689B (en) Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge
CN109033387A (en) A kind of Internet of Things search system, method and storage medium merging multi-source data
CN107016068A (en) Knowledge mapping construction method and device
JP2009151760A (en) Method and system for calculating competitiveness metric between objects
CN106326482A (en) System of visualized big data collection and analysis and file conversion and method thereof
CN111159763B (en) System and method for analyzing portrait of law-related personnel group
CN117875293B (en) Method for generating service form template in quick digitization mode
CN116541480B (en) Thematic data construction method and system based on multi-label driving
CN118113854B (en) Online consultation method and system based on gynecological nursing knowledge base
Yu et al. Intelligent analysis system of college students' employment and entrepreneurship situation: Big data and artificial intelligence-driven approach
CN115794803A (en) Engineering audit problem monitoring method and system based on big data AI technology
CN110737731B (en) Decision tree-based public accumulation user data refinement analysis system and method
CN106157651B (en) A kind of traffic radio traffic information broadcasting system based on voice semantic understanding
CN112488236B (en) Integrated unsupervised student behavior clustering method
CN115358797A (en) Comprehensive energy user energy behavior analysis method and system based on cluster analysis method and storage medium
CN111291029B (en) Data cleaning method and device
CN112668836A (en) Risk graph-oriented associated risk evidence efficient mining and monitoring method and device
CN117454892B (en) Metadata management method, device, terminal equipment and storage medium
CN109657684A (en) A kind of image, semantic analytic method based on Weakly supervised study
CN117786182B (en) Business data storage system and method based on ERP system
CN117556256B (en) Private domain service label screening system and method based on big data
CN117370448B (en) Brand digital asset insight analysis method
CN117076484B (en) Human resource data analysis method based on time sequence knowledge graph
CN114840686B (en) Knowledge graph construction method, device, equipment and storage medium based on metadata
Bianchini et al. A Methodological Approach for enabling Personalised Smart City Data Exploration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant