CN111444944A

CN111444944A - Information screening method, device, equipment and storage medium based on decision tree

Info

Publication number: CN111444944A
Application number: CN202010182272.3A
Authority: CN
Inventors: 高越
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2020-07-24

Abstract

The application relates to the technical field of data analysis, in particular to a method, a device, equipment and a storage medium for information screening based on a decision tree, which comprises the following steps: acquiring original data, and extracting keywords in the original data to generate a user portrait; classifying the user images to generate a user group with a class label; counting the number of users in each user group, and determining the purity value of each class of label; determining the position of each class label in the decision tree according to the purity value of each class label to obtain a decision tree model; inputting user information in the original data into a root node of the decision tree model, screening the original data step by step from the root node to the minimum leaf node of the decision tree model according to the limiting conditions of each level of nodes in the decision tree model, and summarizing the screened user information. The method solves the problems that when the decision tree analyzes the client information, the characteristics of the user cannot be accurately obtained, and high-quality clients cannot be timely and effectively screened from the client information data.

Description

Information screening method, device, equipment and storage medium based on decision tree

Technical Field

The present application relates to the field of data analysis technologies, and in particular, to a method, an apparatus, a device, and a storage medium for information screening based on a decision tree.

Background

The decision tree is a comprehensive evaluation and classification analysis tool, and is mainly characterized in that main influence factors are rapidly screened from a plurality of influence factors with complex interaction through a series of complex calculations such as multi-factor regression and the like, and hierarchical analysis is effectively carried out, so that the function of accurately predicting or classifying research objects is achieved.

At present, when client information is analyzed by the existing decision tree, the characteristics of a user cannot be accurately obtained, so that high-quality clients cannot be timely and effectively screened from huge client information data.

Disclosure of Invention

Based on this, the information screening method, the device, the equipment and the storage medium based on the decision tree are provided for solving the problem that the characteristics of the user cannot be accurately obtained when the current decision tree analyzes the client information, so that the high-quality client cannot be timely and effectively screened from huge client information data.

A method for screening information based on decision trees comprises the following steps:

acquiring original data, extracting key words in the original data, and generating a user portrait according to the key words;

classifying the user pictures according to preset classification conditions to generate a user group with a class label;

counting the number of users in each user group, and determining the purity value of each category of label according to the number of the users;

determining the position of each class label in a decision tree according to the purity value of each class label to obtain a decision tree model;

and inputting the user information in the original data into a root node of the decision tree model, screening the original data step by step from the root node to a minimum leaf node of the decision tree model according to the limiting condition of each level of node in the decision tree model, and summarizing the screened user information.

In one possible embodiment, the obtaining raw data, extracting a keyword from the raw data, and generating a user portrait according to the keyword includes:

acquiring original data, and dividing the original data into dynamic data and static data according to a preset classification rule;

acquiring an attribute keyword list prestored in a database, and extracting attribute keywords from the static data according to the attribute keyword list;

traversing the dynamic data, obtaining user behavior information corresponding to the attribute key words, and creating the user portrait according to the attribute key words and the user behavior information.

In one possible embodiment, the classifying the user images according to a preset classification condition, and generating the user group with the category label includes:

inputting the sample data which accords with the preset classification condition into a preset support vector machine model as a training set for training to obtain a trained classification model;

inputting the user portrait into the trained classification model for classification to obtain a plurality of initial user groups;

calculating the association degree between the initial user groups, and if the association degree between any two initial user groups is greater than a preset association degree threshold value, combining the corresponding two initial user groups into one user group;

and obtaining a plurality of user groups with the category labels by taking the category names corresponding to the user groups as the category labels.

In one possible embodiment, the counting the number of users in each user group, and determining the purity value of each category label according to the number of users includes:

acquiring entity names of users in the user group with the category labels, packaging the users belonging to the same entity name into a user group, and counting the number of the users in the user group;

calculating the proportion of the number of the users in each user group to the total number of the users in the user group, and calculating the proportion by referring to a preset purity calculation formula to obtain the purity value of the user group, wherein the purity value calculation formula is as follows:

in the formula, H represents purity, and p (i) represents a ratio of the number of users of the ith user group to the total number of users in the user group.

In one possible embodiment, the determining, according to the purity value of each class label, a position of each class label in the decision tree to obtain a decision tree model includes:

acquiring a first class label corresponding to the maximum value in the purity values, and taking the first class label as a root node of the decision tree;

acquiring a second class label corresponding to the minimum value in the purity values, and taking the second class label as the minimum leaf node of the decision tree;

and arranging other category labels according to the purity values from large to small, and sequentially filling the labels between the root node and the minimum leaf node to obtain the decision tree model.

In one possible embodiment, after the inputting the user information in the original data to the root node of the decision tree model, and screening the original data step by step from the root node to the smallest leaf node of the decision tree model according to the constraint condition of each level of nodes in the decision tree model, and summarizing the screened user information, the method further includes:

summarizing the information of the undetermined users which do not pass through the minimum leaf node, and sending the information of the undetermined users to a checking terminal for checking;

and receiving the auditing result of the auditing terminal, re-inputting the correction information in the auditing result into the decision tree model for re-screening, and deleting the user information output by the minimum leaf node of the decision tree model.

An information screening device based on decision tree comprises the following modules:

the portrait generation module is used for acquiring original data, extracting key words in the original data and generating a user portrait according to the key words;

the label endowing module is used for classifying the user pictures according to preset classification conditions to generate a user group with a class label;

the purity value generation module is used for counting the number of users in each user group and determining the purity value of each class of label according to the number of the users;

the decision tree generation module is used for determining the positions of the various types of labels in the decision tree according to the purity values of the various types of labels to obtain a decision tree model;

and the user screening module is used for inputting the user information in the original data into a root node of the decision tree model, screening the original data step by step from the root node to a minimum leaf node of the decision tree model according to the limiting condition of each level of node in the decision tree model, and summarizing the screened user information.

In one possible embodiment, the representation generation module is further configured to:

A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the above decision tree based information screening method.

A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the above decision tree-based information screening method.

Compared with the existing mechanism, the method and the device have the advantages that the original data are obtained, the key words in the original data are extracted, and the user portrait is generated according to the key words; classifying the user pictures according to preset classification conditions to generate a user group with a class label; counting the number of users in each user group, and determining the purity value of each category of label according to the number of the users; determining the position of each class label in a decision tree according to the purity value of each class label to obtain a decision tree model; and inputting the user information in the original data into a root node of the decision tree model, screening the original data step by step from the root node to a minimum leaf node of the decision tree model according to the limiting condition of each level of node in the decision tree model, and summarizing the screened user information. Therefore, the problem that when the intelligent decision tree analyzes the client information, the characteristics of the user cannot be accurately obtained, and high-quality clients cannot be timely and effectively screened from huge client information data is solved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application.

FIG. 1 is a flowchart illustrating an overall decision tree based information screening method according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an image generation process in an embodiment of a decision tree-based information filtering method according to the present application;

fig. 3 is a block diagram of an apparatus for filtering information based on a decision tree according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Fig. 1 is an overall flowchart of a decision tree-based information screening method according to an embodiment of the present application, where the decision tree-based information screening method includes the following steps:

s1, acquiring original data, extracting key words in the original data, and generating a user portrait according to the key words;

specifically, the original data in this step may be data such as behavior log data, purchase history data, collected goods, and the like of the user generated when the user purchases a policy or browses a custodian app, and the like. When the keyword query is performed on the original data, the keyword query may be the birthday, the age, the cell, the constellation and the gender of the user, and the terminal model used. The keywords are combined to obtain a user portrait, for example, Zhang three, the user purchases 10 airplane delay risks in the past 30 days; browsing the website of the housekeeper app for 10 hours; li four users purchased 30 orders within 1 month, and the payment amount of each order is more than 200. The user portrait mainly comprises: time, place and people. Each user action is essentially a random event, which can be described in detail as: what user, at what time, at what location, what is done.

S2, classifying the user pictures according to preset classification conditions to generate a user group with a class label;

specifically, the preset classification condition is generated through a label model of the user, where the model includes an inherent attribute label, a user preference class label, a model class label, and the like. Because the model class label of the user (i.e. the label is based on the data acquisition result of the user within a certain period of time, the user may like the high-end brand in the period of time, or the user may like other high-end brands in the period of time), is dynamically changed over time, the dynamic label maintenance, i.e. the function of label management, is added on the basis, and the model label of a certain user can be updated along with the dynamic change of the user data. The label of Zhang three beats is as follows: label a (5 times), label B (2 times), label C (1 time). The weight of the behavior type a of zhang san, a/(a + b + c), 5/(5+2+1), 5/8 time decay; the earlier a user browses or purchases a certain commodity, the less influence is caused on the existing label, for example, a black and white television is purchased three years ago, and a color television is purchased after 10 years. Time attenuation coefficient formula: 1/[100+ (current time-time of past purchase) ]; user label weight calculation formula: behavior order time attenuation coefficient behavior type weight

Examples are: name time of purchase

Zhang three 5 times 2009-01-01 buying accident risk (consider A label)

Zhang three 2 times 2019-01-01 purchase accident risk (consider B label)

According to a label weight calculation formula:

zhangsan A label weight: 5 × 1/100+ (2019-2009) ], 5/8 ═ 5 × (1/110) × 5/8 ═ 0.02840

Zhang three B label weight: 2 [ [1/100+ (2019-. Zhang three can be divided into different user groups according to the label weight.

S3, counting the number of users in each user group, and determining the purity value of each category label according to the number of the users;

the purity value is a ratio of the number of users meeting a preset condition in the user group to the total number.

For example, data of the a user group is as follows:

according to travel insurance purchase (Yes, No), annual income (less than 100W, more than or equal to 100W), obtaining a data classification table;

annual income	50	102	110	120
					High quality customer	Whether or not	Is that	Is that	Is that
Dividing point	50	100	110	120

According to the purity calculation formula:

H(D)＝-0.75*log2(0.75)-0.25*log2(0.25)

＝0.75*0.415+0.25*2

＝0.311+0.5＝0.811

According to the above example, the percentage of premium customers is 3/4; the proportion of non-premium customers is 1/4, and the annual incomes of these customers are all greater than 100W.

S4, determining the positions of the various category labels in the decision tree according to the purity values of the various category labels to obtain a decision tree model;

specifically, the higher the purity value, the higher the position in the decision tree model, for example, the purity value of the annual income is 0.999, the purity value of the manager is 0.92, and the purity value of the age is 0.56, then the annual income is at the root node of the decision tree model, the manager is the first-level leaf node, and the age is the second-level leaf node.

S5, inputting the user information in the original data into the root node of the decision tree model, screening the original data step by step from the root node to the minimum leaf node of the decision tree model according to the limiting condition of each level of node in the decision tree model, and summarizing the screened user information.

Specifically, when the user information is input into the decision tree model, the root node labels of the decision tree model are firstly used as keywords to screen the user information, and a word vector conversion mode can be adopted during screening, namely, each word in the user information is subjected to word vector conversion, and then each node label of the decision tree model is subjected to word vector conversion. The Word vector conversion can adopt Word2vec and other Word vector conversion tools. If the user information contains words consistent with the root node word vectors of the decision tree model, the user information is shown to be consistent with the root node conditions of the decision tree model, and at the moment, user screening does not need to be carried out through child nodes of the decision tree model. And for the words which are not consistent with the word vectors of the root nodes in the user information, screening the word vectors of the leaf nodes of each level step by step.

In the embodiment, the positions of the various category labels on the decision tree model are set by using the purity values, so that the technical effect of timely and effectively screening high-quality customers from huge customer information data is achieved.

Fig. 2 is a schematic diagram illustrating a process of generating a representation in an information filtering method based on a decision tree according to an embodiment of the present application, as shown in fig. 1, the obtaining raw data, extracting keywords from the raw data, and generating a user representation according to the keywords includes:

s11, acquiring original data, and dividing the original data into dynamic data and static data according to a preset classification rule;

the original data can be divided into two categories, static data and dynamic data, wherein the two categories are static data: data such as population attributes, business attributes, consumption characteristics, life forms and the like of the user belong to static data, and the static data are relatively fixed and rarely change along with the change of user behaviors; the acquisition modes also exist in various manners, data mining is common, the data mining can adopt various forms, such as a group interview, a user deep visit, a log, a questionnaire and the like, and the real psychological needs of the user are mainly mined through the open problem. The key to the questionnaire mode is how to model and analyze the collected data in the later period, and the purpose is mainly to obtain the rule of the user attribute by assuming verification;

dynamic data: the method comprises the following steps that a user opens a webpage according to constantly changing behavior information, and purchases personal accident insurance; in addition to the user's telephone number change, a book of insurance knowledge is purchased on site a, tickets to shanghai are reserved on site B, personal accident insurance, flight delay, and the like are purchased, and various dynamic behavior data can be recorded.

S12, acquiring an attribute keyword list prestored in a database, and extracting attribute keywords from the static data according to the attribute keyword list;

in particular, the keywords may represent content that the user has an interest, preference, need, and the like in. A stroke similarity comparison mode can be adopted during keyword extraction, namely stroke font similarity calculation is carried out on the attribute keywords in the static data and the words in the attribute keyword list, if the calculation result is smaller than or equal to a similarity threshold value, the calculation result is similar to the word in the attribute keyword list, and the words in the attribute keyword list are used as the attribute keywords extracted from the static data.

And S13, traversing the dynamic data to obtain user behavior information corresponding to the attribute key words, and creating the user portrait according to the attribute key words and the user behavior information.

Specifically, the verb in the dynamic data that is adjacent to the attribute keyword is the user behavior. Such as a sheet manager (keyword) purchased 30 orders. The user behavior is "purchase".

According to the embodiment, the user portrait is accurately established through the static data and the dynamic data, so that the accuracy of screening the original data is guaranteed.

In one embodiment, the classifying the user images according to a preset classification condition to generate a user group with a category label includes:

specifically, a Support Vector Machine (SVM) is a generalized linear classifier (generalized linear classifier) that performs binary classification (binary classification) on data in a supervised learning manner, and a decision boundary of the SVM is a maximum-margin hyperplane (maximum-margin hyperplane) that solves a learning sample.

specifically, the initial set of users, such as a product manager, a sales manager, a senior engineer, a junior engineer, etc.

the degree of association is a term for the analysis of gray system, and is a term that characterizes the degree of association between two things, and mathematically means the degree of similarity between two functions. By calculating the association degree, the product manager and the sales manager in the last step can be merged into a business manager.

In the embodiment, the user portraits are effectively grouped, so that the user labels are conveniently put into the corresponding nodes of the decision tree.

In an embodiment, the counting the number of users in each user group, and determining the purity value of each category label according to the number of users includes:

the entity names can be extracted from the user groups by adopting a knowledge graph technology.

In the embodiment, the user attribute of each user group can be effectively obtained by calculating the purity value, so that the dimension of original data screening is simplified.

In an embodiment, the determining, according to the purity value of each class label, a position of each class label in the decision tree to obtain the decision tree model includes:

Specifically, if the purity values corresponding to any two of the category labels are consistent, the logical relationship between the category labels is obtained;

if the logical relation is AND, respectively writing the category labels into a plurality of leaf nodes at the same level;

if the logical relationship is "not", then the purity value of the class label is recalculated.

In one embodiment, after the inputting the user information in the original data to the root node of the decision tree model, and performing step-by-step screening on the original data from the root node to the minimum leaf node of the decision tree model according to the constraint condition of each level of nodes in the decision tree model, and summarizing the screened user information, the method further includes:

specifically, the failed parameter value in the user information of the failed minimum leaf node is sent to the auditing terminal, where the passing condition is as follows: if the premium is more than 10 ten thousand and the premium of the client A is 3 ten thousand, the client A information is sent to the auditing terminal for auditing. If after the audit it is found that the a user has a premium of 3 ten thousand per year for 5 years and 10 thousand of the smallest leaf nodes are the total amount, then this user is marked as the problem user and its premium is modified to 15 thousand.

And the user information output by the minimum leaf node is the user information which does not accord with the high-quality client condition in the scheme.

In this embodiment, the error information is corrected by the audit terminal, so that the accuracy of the decision tree model for screening the user information is improved.

The technical features mentioned in any of the above corresponding embodiments or implementations are also applicable to the embodiment corresponding to fig. 3 in the present application, and the details of the subsequent similarities are not repeated.

In the above description, a method for filtering information based on a decision tree in the present application is described, and an apparatus for filtering information based on a decision tree is described below.

Fig. 3 is a block diagram of a decision tree based information screening apparatus, which can be applied to decision tree based information screening. The information screening device based on the decision tree in the embodiment of the present application can implement the steps corresponding to the information screening method based on the decision tree executed in the embodiment corresponding to fig. 1. The function realized by the information screening device based on the decision tree can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.

In one embodiment, an information filtering apparatus based on a decision tree is provided, as shown in fig. 3, including the following modules:

the portrait generation module 10 is used for acquiring original data, extracting key words in the original data, and generating a user portrait according to the key words;

the label giving module 20 is configured to classify the user images according to preset classification conditions, and generate a user group with a category label;

a purity value generation module 30, configured to count the number of users in each user group, and determine a purity value of each category of tag according to the number of users;

the decision tree generation module 40 is configured to determine, according to the purity value of each category of tag, a position of each category of tag in the decision tree to obtain a decision tree model;

and the user screening module 50 is configured to input the user information in the original data to a root node of the decision tree model, screen the original data step by step from the root node to a minimum leaf node of the decision tree model according to a limiting condition of each level of nodes in the decision tree model, and collect the screened user information.

In one embodiment, the representation generation module is further specifically configured to:

In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the decision tree based information screening method in the above embodiments.

In one embodiment, a storage medium storing computer-readable instructions is provided, which when executed by one or more processors, cause the one or more processors to perform the steps of the decision tree based information screening method in the above embodiments. The storage medium may be a nonvolatile storage medium or a volatile storage medium, and the present application is not limited in particular.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-described embodiments are merely illustrative of some embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for screening information based on a decision tree is characterized in that the method for screening information based on the decision tree comprises the following steps:

2. The method of claim 1, wherein the obtaining raw data, extracting keywords from the raw data, and generating a user representation according to the keywords comprises:

3. The method of claim 1, wherein the classifying the user images according to a preset classification condition to generate the user group with the category label comprises:

4. The method as claimed in claim 1, wherein the counting the number of users in each user group, and determining the purity value of each category label according to the number of users comprises:

5. The method according to any one of claims 1 to 4, wherein the determining the position of each class label in the decision tree according to the purity value of each class label to obtain the decision tree model comprises:

6. The method as claimed in claim 5, wherein after the inputting user information in the raw data to a root node of the decision tree model, and performing a step-by-step screening on the raw data from the root node to a minimum leaf node of the decision tree model according to a constraint condition of each level of nodes in the decision tree model, and summarizing the screened user information, the method further comprises:

7. A decision tree based information screening apparatus, comprising:

8. The decision tree-based information screening apparatus of claim 7, wherein the representation generation module is further configured to:

9. A decision tree based information screening apparatus comprising a memory and a processor, the memory having stored therein computer readable instructions, wherein the computer readable instructions, when executed by the processor, cause the processor to perform the decision tree based information screening method of any one of claims 1 to 6.

10. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the method of decision tree based information screening of any one of claims 1 to 6.