CN111444944A - Information screening method, device, equipment and storage medium based on decision tree - Google Patents

Information screening method, device, equipment and storage medium based on decision tree Download PDF

Info

Publication number
CN111444944A
CN111444944A CN202010182272.3A CN202010182272A CN111444944A CN 111444944 A CN111444944 A CN 111444944A CN 202010182272 A CN202010182272 A CN 202010182272A CN 111444944 A CN111444944 A CN 111444944A
Authority
CN
China
Prior art keywords
user
decision tree
information
users
original data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010182272.3A
Other languages
Chinese (zh)
Inventor
高越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010182272.3A priority Critical patent/CN111444944A/en
Publication of CN111444944A publication Critical patent/CN111444944A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0203Market surveys; Market polls
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Databases & Information Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Technology Law (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of data analysis, in particular to a method, a device, equipment and a storage medium for information screening based on a decision tree, which comprises the following steps: acquiring original data, and extracting keywords in the original data to generate a user portrait; classifying the user images to generate a user group with a class label; counting the number of users in each user group, and determining the purity value of each class of label; determining the position of each class label in the decision tree according to the purity value of each class label to obtain a decision tree model; inputting user information in the original data into a root node of the decision tree model, screening the original data step by step from the root node to the minimum leaf node of the decision tree model according to the limiting conditions of each level of nodes in the decision tree model, and summarizing the screened user information. The method solves the problems that when the decision tree analyzes the client information, the characteristics of the user cannot be accurately obtained, and high-quality clients cannot be timely and effectively screened from the client information data.

Description

Information screening method, device, equipment and storage medium based on decision tree
Technical Field
The present application relates to the field of data analysis technologies, and in particular, to a method, an apparatus, a device, and a storage medium for information screening based on a decision tree.
Background
The decision tree is a comprehensive evaluation and classification analysis tool, and is mainly characterized in that main influence factors are rapidly screened from a plurality of influence factors with complex interaction through a series of complex calculations such as multi-factor regression and the like, and hierarchical analysis is effectively carried out, so that the function of accurately predicting or classifying research objects is achieved.
At present, when client information is analyzed by the existing decision tree, the characteristics of a user cannot be accurately obtained, so that high-quality clients cannot be timely and effectively screened from huge client information data.
Disclosure of Invention
Based on this, the information screening method, the device, the equipment and the storage medium based on the decision tree are provided for solving the problem that the characteristics of the user cannot be accurately obtained when the current decision tree analyzes the client information, so that the high-quality client cannot be timely and effectively screened from huge client information data.
A method for screening information based on decision trees comprises the following steps:
acquiring original data, extracting key words in the original data, and generating a user portrait according to the key words;
classifying the user pictures according to preset classification conditions to generate a user group with a class label;
counting the number of users in each user group, and determining the purity value of each category of label according to the number of the users;
determining the position of each class label in a decision tree according to the purity value of each class label to obtain a decision tree model;
and inputting the user information in the original data into a root node of the decision tree model, screening the original data step by step from the root node to a minimum leaf node of the decision tree model according to the limiting condition of each level of node in the decision tree model, and summarizing the screened user information.
In one possible embodiment, the obtaining raw data, extracting a keyword from the raw data, and generating a user portrait according to the keyword includes:
acquiring original data, and dividing the original data into dynamic data and static data according to a preset classification rule;
acquiring an attribute keyword list prestored in a database, and extracting attribute keywords from the static data according to the attribute keyword list;
traversing the dynamic data, obtaining user behavior information corresponding to the attribute key words, and creating the user portrait according to the attribute key words and the user behavior information.
In one possible embodiment, the classifying the user images according to a preset classification condition, and generating the user group with the category label includes:
inputting the sample data which accords with the preset classification condition into a preset support vector machine model as a training set for training to obtain a trained classification model;
inputting the user portrait into the trained classification model for classification to obtain a plurality of initial user groups;
calculating the association degree between the initial user groups, and if the association degree between any two initial user groups is greater than a preset association degree threshold value, combining the corresponding two initial user groups into one user group;
and obtaining a plurality of user groups with the category labels by taking the category names corresponding to the user groups as the category labels.
In one possible embodiment, the counting the number of users in each user group, and determining the purity value of each category label according to the number of users includes:
acquiring entity names of users in the user group with the category labels, packaging the users belonging to the same entity name into a user group, and counting the number of the users in the user group;
calculating the proportion of the number of the users in each user group to the total number of the users in the user group, and calculating the proportion by referring to a preset purity calculation formula to obtain the purity value of the user group, wherein the purity value calculation formula is as follows:
Figure BDA0002412976830000021
in the formula, H represents purity, and p (i) represents a ratio of the number of users of the ith user group to the total number of users in the user group.
In one possible embodiment, the determining, according to the purity value of each class label, a position of each class label in the decision tree to obtain a decision tree model includes:
acquiring a first class label corresponding to the maximum value in the purity values, and taking the first class label as a root node of the decision tree;
acquiring a second class label corresponding to the minimum value in the purity values, and taking the second class label as the minimum leaf node of the decision tree;
and arranging other category labels according to the purity values from large to small, and sequentially filling the labels between the root node and the minimum leaf node to obtain the decision tree model.
In one possible embodiment, after the inputting the user information in the original data to the root node of the decision tree model, and screening the original data step by step from the root node to the smallest leaf node of the decision tree model according to the constraint condition of each level of nodes in the decision tree model, and summarizing the screened user information, the method further includes:
summarizing the information of the undetermined users which do not pass through the minimum leaf node, and sending the information of the undetermined users to a checking terminal for checking;
and receiving the auditing result of the auditing terminal, re-inputting the correction information in the auditing result into the decision tree model for re-screening, and deleting the user information output by the minimum leaf node of the decision tree model.
An information screening device based on decision tree comprises the following modules:
the portrait generation module is used for acquiring original data, extracting key words in the original data and generating a user portrait according to the key words;
the label endowing module is used for classifying the user pictures according to preset classification conditions to generate a user group with a class label;
the purity value generation module is used for counting the number of users in each user group and determining the purity value of each class of label according to the number of the users;
the decision tree generation module is used for determining the positions of the various types of labels in the decision tree according to the purity values of the various types of labels to obtain a decision tree model;
and the user screening module is used for inputting the user information in the original data into a root node of the decision tree model, screening the original data step by step from the root node to a minimum leaf node of the decision tree model according to the limiting condition of each level of node in the decision tree model, and summarizing the screened user information.
In one possible embodiment, the representation generation module is further configured to:
acquiring original data, and dividing the original data into dynamic data and static data according to a preset classification rule;
acquiring an attribute keyword list prestored in a database, and extracting attribute keywords from the static data according to the attribute keyword list;
traversing the dynamic data, obtaining user behavior information corresponding to the attribute key words, and creating the user portrait according to the attribute key words and the user behavior information.
A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the above decision tree based information screening method.
A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the above decision tree-based information screening method.
Compared with the existing mechanism, the method and the device have the advantages that the original data are obtained, the key words in the original data are extracted, and the user portrait is generated according to the key words; classifying the user pictures according to preset classification conditions to generate a user group with a class label; counting the number of users in each user group, and determining the purity value of each category of label according to the number of the users; determining the position of each class label in a decision tree according to the purity value of each class label to obtain a decision tree model; and inputting the user information in the original data into a root node of the decision tree model, screening the original data step by step from the root node to a minimum leaf node of the decision tree model according to the limiting condition of each level of node in the decision tree model, and summarizing the screened user information. Therefore, the problem that when the intelligent decision tree analyzes the client information, the characteristics of the user cannot be accurately obtained, and high-quality clients cannot be timely and effectively screened from huge client information data is solved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application.
FIG. 1 is a flowchart illustrating an overall decision tree based information screening method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating an image generation process in an embodiment of a decision tree-based information filtering method according to the present application;
fig. 3 is a block diagram of an apparatus for filtering information based on a decision tree according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Fig. 1 is an overall flowchart of a decision tree-based information screening method according to an embodiment of the present application, where the decision tree-based information screening method includes the following steps:
s1, acquiring original data, extracting key words in the original data, and generating a user portrait according to the key words;
specifically, the original data in this step may be data such as behavior log data, purchase history data, collected goods, and the like of the user generated when the user purchases a policy or browses a custodian app, and the like. When the keyword query is performed on the original data, the keyword query may be the birthday, the age, the cell, the constellation and the gender of the user, and the terminal model used. The keywords are combined to obtain a user portrait, for example, Zhang three, the user purchases 10 airplane delay risks in the past 30 days; browsing the website of the housekeeper app for 10 hours; li four users purchased 30 orders within 1 month, and the payment amount of each order is more than 200. The user portrait mainly comprises: time, place and people. Each user action is essentially a random event, which can be described in detail as: what user, at what time, at what location, what is done.
S2, classifying the user pictures according to preset classification conditions to generate a user group with a class label;
specifically, the preset classification condition is generated through a label model of the user, where the model includes an inherent attribute label, a user preference class label, a model class label, and the like. Because the model class label of the user (i.e. the label is based on the data acquisition result of the user within a certain period of time, the user may like the high-end brand in the period of time, or the user may like other high-end brands in the period of time), is dynamically changed over time, the dynamic label maintenance, i.e. the function of label management, is added on the basis, and the model label of a certain user can be updated along with the dynamic change of the user data. The label of Zhang three beats is as follows: label a (5 times), label B (2 times), label C (1 time). The weight of the behavior type a of zhang san, a/(a + b + c), 5/(5+2+1), 5/8 time decay; the earlier a user browses or purchases a certain commodity, the less influence is caused on the existing label, for example, a black and white television is purchased three years ago, and a color television is purchased after 10 years. Time attenuation coefficient formula: 1/[100+ (current time-time of past purchase) ]; user label weight calculation formula: behavior order time attenuation coefficient behavior type weight
Examples are: name time of purchase
Zhang three 5 times 2009-01-01 buying accident risk (consider A label)
Zhang three 2 times 2019-01-01 purchase accident risk (consider B label)
According to a label weight calculation formula:
zhangsan A label weight: 5 × 1/100+ (2019-2009) ], 5/8 ═ 5 × (1/110) × 5/8 ═ 0.02840
Zhang three B label weight: 2 [ [1/100+ (2019-. Zhang three can be divided into different user groups according to the label weight.
S3, counting the number of users in each user group, and determining the purity value of each category label according to the number of the users;
the purity value is a ratio of the number of users meeting a preset condition in the user group to the total number.
For example, data of the a user group is as follows:
Figure BDA0002412976830000071
according to travel insurance purchase (Yes, No), annual income (less than 100W, more than or equal to 100W), obtaining a data classification table;
annual income 50 102 110 120
High quality customer Whether or not Is that Is that Is that
Dividing point 50 100 110 120
According to the purity calculation formula:
Figure BDA0002412976830000072
in the formula, H represents purity, and p (i) represents a ratio of the number of users of the ith user group to the total number of users in the user group.
H(D)=-0.75*log2(0.75)-0.25*log2(0.25)
=0.75*0.415+0.25*2
=0.311+0.5=0.811
According to the above example, the percentage of premium customers is 3/4; the proportion of non-premium customers is 1/4, and the annual incomes of these customers are all greater than 100W.
S4, determining the positions of the various category labels in the decision tree according to the purity values of the various category labels to obtain a decision tree model;
specifically, the higher the purity value, the higher the position in the decision tree model, for example, the purity value of the annual income is 0.999, the purity value of the manager is 0.92, and the purity value of the age is 0.56, then the annual income is at the root node of the decision tree model, the manager is the first-level leaf node, and the age is the second-level leaf node.
S5, inputting the user information in the original data into the root node of the decision tree model, screening the original data step by step from the root node to the minimum leaf node of the decision tree model according to the limiting condition of each level of node in the decision tree model, and summarizing the screened user information.
Specifically, when the user information is input into the decision tree model, the root node labels of the decision tree model are firstly used as keywords to screen the user information, and a word vector conversion mode can be adopted during screening, namely, each word in the user information is subjected to word vector conversion, and then each node label of the decision tree model is subjected to word vector conversion. The Word vector conversion can adopt Word2vec and other Word vector conversion tools. If the user information contains words consistent with the root node word vectors of the decision tree model, the user information is shown to be consistent with the root node conditions of the decision tree model, and at the moment, user screening does not need to be carried out through child nodes of the decision tree model. And for the words which are not consistent with the word vectors of the root nodes in the user information, screening the word vectors of the leaf nodes of each level step by step.
In the embodiment, the positions of the various category labels on the decision tree model are set by using the purity values, so that the technical effect of timely and effectively screening high-quality customers from huge customer information data is achieved.
Fig. 2 is a schematic diagram illustrating a process of generating a representation in an information filtering method based on a decision tree according to an embodiment of the present application, as shown in fig. 1, the obtaining raw data, extracting keywords from the raw data, and generating a user representation according to the keywords includes:
s11, acquiring original data, and dividing the original data into dynamic data and static data according to a preset classification rule;
the original data can be divided into two categories, static data and dynamic data, wherein the two categories are static data: data such as population attributes, business attributes, consumption characteristics, life forms and the like of the user belong to static data, and the static data are relatively fixed and rarely change along with the change of user behaviors; the acquisition modes also exist in various manners, data mining is common, the data mining can adopt various forms, such as a group interview, a user deep visit, a log, a questionnaire and the like, and the real psychological needs of the user are mainly mined through the open problem. The key to the questionnaire mode is how to model and analyze the collected data in the later period, and the purpose is mainly to obtain the rule of the user attribute by assuming verification;
dynamic data: the method comprises the following steps that a user opens a webpage according to constantly changing behavior information, and purchases personal accident insurance; in addition to the user's telephone number change, a book of insurance knowledge is purchased on site a, tickets to shanghai are reserved on site B, personal accident insurance, flight delay, and the like are purchased, and various dynamic behavior data can be recorded.
S12, acquiring an attribute keyword list prestored in a database, and extracting attribute keywords from the static data according to the attribute keyword list;
in particular, the keywords may represent content that the user has an interest, preference, need, and the like in. A stroke similarity comparison mode can be adopted during keyword extraction, namely stroke font similarity calculation is carried out on the attribute keywords in the static data and the words in the attribute keyword list, if the calculation result is smaller than or equal to a similarity threshold value, the calculation result is similar to the word in the attribute keyword list, and the words in the attribute keyword list are used as the attribute keywords extracted from the static data.
And S13, traversing the dynamic data to obtain user behavior information corresponding to the attribute key words, and creating the user portrait according to the attribute key words and the user behavior information.
Specifically, the verb in the dynamic data that is adjacent to the attribute keyword is the user behavior. Such as a sheet manager (keyword) purchased 30 orders. The user behavior is "purchase".
According to the embodiment, the user portrait is accurately established through the static data and the dynamic data, so that the accuracy of screening the original data is guaranteed.
In one embodiment, the classifying the user images according to a preset classification condition to generate a user group with a category label includes:
inputting the sample data which accords with the preset classification condition into a preset support vector machine model as a training set for training to obtain a trained classification model;
specifically, a Support Vector Machine (SVM) is a generalized linear classifier (generalized linear classifier) that performs binary classification (binary classification) on data in a supervised learning manner, and a decision boundary of the SVM is a maximum-margin hyperplane (maximum-margin hyperplane) that solves a learning sample.
Inputting the user portrait into the trained classification model for classification to obtain a plurality of initial user groups;
specifically, the initial set of users, such as a product manager, a sales manager, a senior engineer, a junior engineer, etc.
Calculating the association degree between the initial user groups, and if the association degree between any two initial user groups is greater than a preset association degree threshold value, combining the corresponding two initial user groups into one user group;
the degree of association is a term for the analysis of gray system, and is a term that characterizes the degree of association between two things, and mathematically means the degree of similarity between two functions. By calculating the association degree, the product manager and the sales manager in the last step can be merged into a business manager.
And obtaining a plurality of user groups with the category labels by taking the category names corresponding to the user groups as the category labels.
In the embodiment, the user portraits are effectively grouped, so that the user labels are conveniently put into the corresponding nodes of the decision tree.
In an embodiment, the counting the number of users in each user group, and determining the purity value of each category label according to the number of users includes:
acquiring entity names of users in the user group with the category labels, packaging the users belonging to the same entity name into a user group, and counting the number of the users in the user group;
the entity names can be extracted from the user groups by adopting a knowledge graph technology.
Calculating the proportion of the number of the users in each user group to the total number of the users in the user group, and calculating the proportion by referring to a preset purity calculation formula to obtain the purity value of the user group, wherein the purity value calculation formula is as follows:
Figure BDA0002412976830000101
in the formula, H represents purity, and p (i) represents a ratio of the number of users of the ith user group to the total number of users in the user group.
In the embodiment, the user attribute of each user group can be effectively obtained by calculating the purity value, so that the dimension of original data screening is simplified.
In an embodiment, the determining, according to the purity value of each class label, a position of each class label in the decision tree to obtain the decision tree model includes:
acquiring a first class label corresponding to the maximum value in the purity values, and taking the first class label as a root node of the decision tree;
acquiring a second class label corresponding to the minimum value in the purity values, and taking the second class label as the minimum leaf node of the decision tree;
and arranging other category labels according to the purity values from large to small, and sequentially filling the labels between the root node and the minimum leaf node to obtain the decision tree model.
Specifically, if the purity values corresponding to any two of the category labels are consistent, the logical relationship between the category labels is obtained;
if the logical relation is AND, respectively writing the category labels into a plurality of leaf nodes at the same level;
if the logical relationship is "not", then the purity value of the class label is recalculated.
In one embodiment, after the inputting the user information in the original data to the root node of the decision tree model, and performing step-by-step screening on the original data from the root node to the minimum leaf node of the decision tree model according to the constraint condition of each level of nodes in the decision tree model, and summarizing the screened user information, the method further includes:
summarizing the information of the undetermined users which do not pass through the minimum leaf node, and sending the information of the undetermined users to a checking terminal for checking;
specifically, the failed parameter value in the user information of the failed minimum leaf node is sent to the auditing terminal, where the passing condition is as follows: if the premium is more than 10 ten thousand and the premium of the client A is 3 ten thousand, the client A information is sent to the auditing terminal for auditing. If after the audit it is found that the a user has a premium of 3 ten thousand per year for 5 years and 10 thousand of the smallest leaf nodes are the total amount, then this user is marked as the problem user and its premium is modified to 15 thousand.
And receiving the auditing result of the auditing terminal, re-inputting the correction information in the auditing result into the decision tree model for re-screening, and deleting the user information output by the minimum leaf node of the decision tree model.
And the user information output by the minimum leaf node is the user information which does not accord with the high-quality client condition in the scheme.
In this embodiment, the error information is corrected by the audit terminal, so that the accuracy of the decision tree model for screening the user information is improved.
The technical features mentioned in any of the above corresponding embodiments or implementations are also applicable to the embodiment corresponding to fig. 3 in the present application, and the details of the subsequent similarities are not repeated.
In the above description, a method for filtering information based on a decision tree in the present application is described, and an apparatus for filtering information based on a decision tree is described below.
Fig. 3 is a block diagram of a decision tree based information screening apparatus, which can be applied to decision tree based information screening. The information screening device based on the decision tree in the embodiment of the present application can implement the steps corresponding to the information screening method based on the decision tree executed in the embodiment corresponding to fig. 1. The function realized by the information screening device based on the decision tree can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.
In one embodiment, an information filtering apparatus based on a decision tree is provided, as shown in fig. 3, including the following modules:
the portrait generation module 10 is used for acquiring original data, extracting key words in the original data, and generating a user portrait according to the key words;
the label giving module 20 is configured to classify the user images according to preset classification conditions, and generate a user group with a category label;
a purity value generation module 30, configured to count the number of users in each user group, and determine a purity value of each category of tag according to the number of users;
the decision tree generation module 40 is configured to determine, according to the purity value of each category of tag, a position of each category of tag in the decision tree to obtain a decision tree model;
and the user screening module 50 is configured to input the user information in the original data to a root node of the decision tree model, screen the original data step by step from the root node to a minimum leaf node of the decision tree model according to a limiting condition of each level of nodes in the decision tree model, and collect the screened user information.
In one embodiment, the representation generation module is further specifically configured to:
acquiring original data, and dividing the original data into dynamic data and static data according to a preset classification rule;
acquiring an attribute keyword list prestored in a database, and extracting attribute keywords from the static data according to the attribute keyword list;
traversing the dynamic data, obtaining user behavior information corresponding to the attribute key words, and creating the user portrait according to the attribute key words and the user behavior information.
In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the decision tree based information screening method in the above embodiments.
In one embodiment, a storage medium storing computer-readable instructions is provided, which when executed by one or more processors, cause the one or more processors to perform the steps of the decision tree based information screening method in the above embodiments. The storage medium may be a nonvolatile storage medium or a volatile storage medium, and the present application is not limited in particular.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-described embodiments are merely illustrative of some embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for screening information based on a decision tree is characterized in that the method for screening information based on the decision tree comprises the following steps:
acquiring original data, extracting key words in the original data, and generating a user portrait according to the key words;
classifying the user pictures according to preset classification conditions to generate a user group with a class label;
counting the number of users in each user group, and determining the purity value of each category of label according to the number of the users;
determining the position of each class label in a decision tree according to the purity value of each class label to obtain a decision tree model;
and inputting the user information in the original data into a root node of the decision tree model, screening the original data step by step from the root node to a minimum leaf node of the decision tree model according to the limiting condition of each level of node in the decision tree model, and summarizing the screened user information.
2. The method of claim 1, wherein the obtaining raw data, extracting keywords from the raw data, and generating a user representation according to the keywords comprises:
acquiring original data, and dividing the original data into dynamic data and static data according to a preset classification rule;
acquiring an attribute keyword list prestored in a database, and extracting attribute keywords from the static data according to the attribute keyword list;
traversing the dynamic data, obtaining user behavior information corresponding to the attribute key words, and creating the user portrait according to the attribute key words and the user behavior information.
3. The method of claim 1, wherein the classifying the user images according to a preset classification condition to generate the user group with the category label comprises:
inputting the sample data which accords with the preset classification condition into a preset support vector machine model as a training set for training to obtain a trained classification model;
inputting the user portrait into the trained classification model for classification to obtain a plurality of initial user groups;
calculating the association degree between the initial user groups, and if the association degree between any two initial user groups is greater than a preset association degree threshold value, combining the corresponding two initial user groups into one user group;
and obtaining a plurality of user groups with the category labels by taking the category names corresponding to the user groups as the category labels.
4. The method as claimed in claim 1, wherein the counting the number of users in each user group, and determining the purity value of each category label according to the number of users comprises:
acquiring entity names of users in the user group with the category labels, packaging the users belonging to the same entity name into a user group, and counting the number of the users in the user group;
calculating the proportion of the number of the users in each user group to the total number of the users in the user group, and calculating the proportion by referring to a preset purity calculation formula to obtain the purity value of the user group, wherein the purity value calculation formula is as follows:
Figure FDA0002412976820000021
in the formula, H represents purity, and p (i) represents a ratio of the number of users of the ith user group to the total number of users in the user group.
5. The method according to any one of claims 1 to 4, wherein the determining the position of each class label in the decision tree according to the purity value of each class label to obtain the decision tree model comprises:
acquiring a first class label corresponding to the maximum value in the purity values, and taking the first class label as a root node of the decision tree;
acquiring a second class label corresponding to the minimum value in the purity values, and taking the second class label as the minimum leaf node of the decision tree;
and arranging other category labels according to the purity values from large to small, and sequentially filling the labels between the root node and the minimum leaf node to obtain the decision tree model.
6. The method as claimed in claim 5, wherein after the inputting user information in the raw data to a root node of the decision tree model, and performing a step-by-step screening on the raw data from the root node to a minimum leaf node of the decision tree model according to a constraint condition of each level of nodes in the decision tree model, and summarizing the screened user information, the method further comprises:
summarizing the information of the undetermined users which do not pass through the minimum leaf node, and sending the information of the undetermined users to a checking terminal for checking;
and receiving the auditing result of the auditing terminal, re-inputting the correction information in the auditing result into the decision tree model for re-screening, and deleting the user information output by the minimum leaf node of the decision tree model.
7. A decision tree based information screening apparatus, comprising:
the portrait generation module is used for acquiring original data, extracting key words in the original data and generating a user portrait according to the key words;
the label endowing module is used for classifying the user pictures according to preset classification conditions to generate a user group with a class label;
the purity value generation module is used for counting the number of users in each user group and determining the purity value of each class of label according to the number of the users;
the decision tree generation module is used for determining the positions of the various types of labels in the decision tree according to the purity values of the various types of labels to obtain a decision tree model;
and the user screening module is used for inputting the user information in the original data into a root node of the decision tree model, screening the original data step by step from the root node to a minimum leaf node of the decision tree model according to the limiting condition of each level of node in the decision tree model, and summarizing the screened user information.
8. The decision tree-based information screening apparatus of claim 7, wherein the representation generation module is further configured to:
acquiring original data, and dividing the original data into dynamic data and static data according to a preset classification rule;
acquiring an attribute keyword list prestored in a database, and extracting attribute keywords from the static data according to the attribute keyword list;
traversing the dynamic data, obtaining user behavior information corresponding to the attribute key words, and creating the user portrait according to the attribute key words and the user behavior information.
9. A decision tree based information screening apparatus comprising a memory and a processor, the memory having stored therein computer readable instructions, wherein the computer readable instructions, when executed by the processor, cause the processor to perform the decision tree based information screening method of any one of claims 1 to 6.
10. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the method of decision tree based information screening of any one of claims 1 to 6.
CN202010182272.3A 2020-03-16 2020-03-16 Information screening method, device, equipment and storage medium based on decision tree Pending CN111444944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010182272.3A CN111444944A (en) 2020-03-16 2020-03-16 Information screening method, device, equipment and storage medium based on decision tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010182272.3A CN111444944A (en) 2020-03-16 2020-03-16 Information screening method, device, equipment and storage medium based on decision tree

Publications (1)

Publication Number Publication Date
CN111444944A true CN111444944A (en) 2020-07-24

Family

ID=71652333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010182272.3A Pending CN111444944A (en) 2020-03-16 2020-03-16 Information screening method, device, equipment and storage medium based on decision tree

Country Status (1)

Country Link
CN (1) CN111444944A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112363832A (en) * 2020-11-10 2021-02-12 中国平安人寿保险股份有限公司 Ignite-based distributed data processing method and device and computer equipment
CN112529319A (en) * 2020-12-18 2021-03-19 平安银行股份有限公司 Grading method and device based on multi-dimensional features, computer equipment and storage medium
CN112819527A (en) * 2021-01-29 2021-05-18 百果园技术(新加坡)有限公司 User grouping processing method and device
CN112836741A (en) * 2021-02-01 2021-05-25 深圳无域科技技术有限公司 Crowd sketch extraction method, system, equipment and computer readable medium for coupling decision tree
CN112950352A (en) * 2021-02-08 2021-06-11 北京淇瑀信息科技有限公司 User screening strategy generation method and device and electronic equipment
CN113673229A (en) * 2021-08-23 2021-11-19 广东电网有限责任公司 Electric power marketing data interaction method, system and storage medium
CN113706322A (en) * 2021-08-31 2021-11-26 康键信息技术(深圳)有限公司 Service distribution method, device, equipment and storage medium based on data analysis
CN113722371A (en) * 2021-08-31 2021-11-30 平安国际智慧城市科技股份有限公司 Medicine recommendation method, device, equipment and storage medium based on decision tree
CN113806594A (en) * 2020-12-30 2021-12-17 京东科技控股股份有限公司 Business data processing method, device, equipment and storage medium based on decision tree
CN113806492A (en) * 2021-09-30 2021-12-17 中国平安人寿保险股份有限公司 Record generation method, device and equipment based on semantic recognition and storage medium
CN113822309A (en) * 2020-09-25 2021-12-21 京东科技控股股份有限公司 User classification method, device and non-volatile computer-readable storage medium
CN113837866A (en) * 2021-09-29 2021-12-24 重庆富民银行股份有限公司 Two-stage management method and system based on full stock customer
CN114995772A (en) * 2022-08-08 2022-09-02 南京三百云信息科技有限公司 Customer data migration and storage method and device

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822309B (en) * 2020-09-25 2024-04-16 京东科技控股股份有限公司 User classification method, apparatus and non-volatile computer readable storage medium
CN113822309A (en) * 2020-09-25 2021-12-21 京东科技控股股份有限公司 User classification method, device and non-volatile computer-readable storage medium
CN112363832A (en) * 2020-11-10 2021-02-12 中国平安人寿保险股份有限公司 Ignite-based distributed data processing method and device and computer equipment
CN112363832B (en) * 2020-11-10 2023-07-28 中国平安人寿保险股份有限公司 Ignite-based distributed data processing method and device and computer equipment
CN112529319A (en) * 2020-12-18 2021-03-19 平安银行股份有限公司 Grading method and device based on multi-dimensional features, computer equipment and storage medium
CN113806594A (en) * 2020-12-30 2021-12-17 京东科技控股股份有限公司 Business data processing method, device, equipment and storage medium based on decision tree
CN112819527A (en) * 2021-01-29 2021-05-18 百果园技术(新加坡)有限公司 User grouping processing method and device
CN112836741A (en) * 2021-02-01 2021-05-25 深圳无域科技技术有限公司 Crowd sketch extraction method, system, equipment and computer readable medium for coupling decision tree
CN112950352A (en) * 2021-02-08 2021-06-11 北京淇瑀信息科技有限公司 User screening strategy generation method and device and electronic equipment
CN113673229B (en) * 2021-08-23 2024-04-05 广东电网有限责任公司 Electric power marketing data interaction method, system and storage medium
CN113673229A (en) * 2021-08-23 2021-11-19 广东电网有限责任公司 Electric power marketing data interaction method, system and storage medium
CN113722371A (en) * 2021-08-31 2021-11-30 平安国际智慧城市科技股份有限公司 Medicine recommendation method, device, equipment and storage medium based on decision tree
CN113706322A (en) * 2021-08-31 2021-11-26 康键信息技术(深圳)有限公司 Service distribution method, device, equipment and storage medium based on data analysis
CN113722371B (en) * 2021-08-31 2024-04-12 深圳平安智慧医健科技有限公司 Medicine recommendation method, device, equipment and storage medium based on decision tree
CN113837866A (en) * 2021-09-29 2021-12-24 重庆富民银行股份有限公司 Two-stage management method and system based on full stock customer
CN113806492A (en) * 2021-09-30 2021-12-17 中国平安人寿保险股份有限公司 Record generation method, device and equipment based on semantic recognition and storage medium
CN113806492B (en) * 2021-09-30 2024-02-06 中国平安人寿保险股份有限公司 Record generation method, device, equipment and storage medium based on semantic recognition
CN114995772A (en) * 2022-08-08 2022-09-02 南京三百云信息科技有限公司 Customer data migration and storage method and device

Similar Documents

Publication Publication Date Title
CN111444944A (en) Information screening method, device, equipment and storage medium based on decision tree
US11475143B2 (en) Sensitive data classification
JP7169369B2 (en) Method, system for generating data for machine learning algorithms
CN110020660B (en) Integrity assessment of unstructured processes using Artificial Intelligence (AI) techniques
US20230289665A1 (en) Failure feedback system for enhancing machine learning accuracy by synthetic data generation
US10216829B2 (en) Large-scale, high-dimensional similarity clustering in linear time with error-free retrieval
US9753964B1 (en) Similarity clustering in linear time with error-free retrieval using signature overlap with signature size matching
CN111445028A (en) AI-driven transaction management system
US20180203917A1 (en) Discovering data similarity groups in linear time for data science applications
CN107357902A (en) A kind of tables of data categorizing system and method based on correlation rule
CN112559900B (en) Product recommendation method and device, computer equipment and storage medium
US20180203916A1 (en) Data clustering with reduced partial signature matching using key-value storage and retrieval
Chambers et al. Improved secondary analysis of linked data: a framework and an illustration
CN113177700B (en) Risk assessment method, system, electronic equipment and storage medium
US20230161947A1 (en) Mathematical models of graphical user interfaces
CN117151814A (en) Personalized commodity recommendation and real-time dynamic adjustment method
CN114510735A (en) Role management-based intelligent shared financial management method and platform
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
CN113656699B (en) User feature vector determining method, related equipment and medium
CN116484109B (en) Customer portrait analysis system and method based on artificial intelligence
CN116260866A (en) Government information pushing method and device based on machine learning and computer equipment
CN113420018A (en) User behavior data analysis method, device, equipment and storage medium
CN114693409A (en) Product matching method, device, computer equipment, storage medium and program product
CN114090850A (en) Log classification method, electronic device and computer-readable storage medium
CN112818215A (en) Product data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination