CN109885597B

CN109885597B - User grouping processing method and device based on machine learning and electronic terminal

Info

Publication number: CN109885597B
Application number: CN201910012154.5A
Authority: CN
Inventors: 孙闳绅; 金戈; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2023-05-30
Anticipated expiration: 2039-01-07
Also published as: CN109885597A

Abstract

The embodiment of the invention provides a machine learning-based user grouping processing method and device and an electronic terminal, and relates to the technical field of data processing. The embodiment of the invention acquires a plurality of desensitized user data after desensitization processing; then, classifying the desensitized user data by utilizing a decision tree model in the spark big data operation platform to obtain a classification processing result; based on the decision path of the decision tree and the classification processing result, group information of each node in the decision tree model is counted layer by layer; and generating a user data distribution tree according to the decision path of the decision tree and the group information of each node in each decision tree model, and storing the user data distribution tree. Therefore, the technical scheme provided by the embodiment of the invention can solve the problems of high labor cost and low user grouping accuracy in the manual grouping mode of user images in the prior art.

Description

User grouping processing method and device based on machine learning and electronic terminal

[ field of technology ]

The present invention relates to the field of data processing technologies, and in particular, to a machine learning-based user grouping processing method and apparatus, and an electronic terminal.

[ background Art ]

With the rapid development of the information age, a large amount of user information is generated every day, each specific information of a user is abstracted into labels by effectively analyzing the user information, and the user image is materialized by using the labels so as to construct a user portrait.

The traditional user portrait generation mainly comprises the steps of carrying out feature statistics on different users according to a certain rule, and then manually grouping. The traditional user portrait generation mode is suitable for user portrait grouping of a small amount of user information, and in a big data application environment, the user information comprises multiple dimensions, complex information and large information quantity, the user grouping is realized by a manual mode, the time and the labor are consumed, the labor cost is high, and each user information is difficult to comprehensively count, so that the user grouping accuracy is low.

[ invention ]

In view of the above, the embodiment of the invention provides a machine learning-based user grouping processing method, a machine learning-based user grouping processing device and an electronic terminal, which are used for solving the problems that the prior art is time-consuming and labor-consuming for user grouping and the user grouping accuracy is low.

In one aspect, an embodiment of the present invention provides a machine learning-based user grouping method, where the machine learning-based user grouping method includes:

Acquiring a plurality of desensitized user data after desensitization processing, wherein the desensitized user data comprises a user identifier and a user characteristic corresponding to the user identifier;

classifying the desensitized user data by utilizing a decision tree model in the spark big data operation platform to obtain a classification processing result;

based on the decision path of the decision tree and the classification processing result, calculating group information of each node in the decision tree model layer by layer, wherein the group information comprises user identification number, user characteristics and user duty ratio of the user characteristics of each dimension;

and generating a user data distribution tree according to the decision path of the decision tree and the group information of each node in each decision tree model, and storing the user data distribution tree.

In the aspect and any possible implementation manner described above, there is further provided an implementation manner, when counting the number of user identities included in each node in the decision tree model, the step of counting, layer by layer, group information of each node in the decision tree model based on the decision path of the decision tree and the classification processing result, including:

counting the number of user identifiers contained in each terminal node in the decision tree;

According to the decision path of the decision tree, the same target node N is connected _i Is determined as a group of first candidate nodes, a plurality of groups of first candidate nodes are obtained, wherein i is an integer greater than or equal to 1, and when i=1, the target node N _i Is a superior node adjacent to the end node;

respectively obtaining the sum of the number of the user identifications included in all the first candidate nodes in the same group to obtain each target node N _i The number of user identifications included;

detecting said each target node N _i Whether there is a previous level node;

if at least one target node N is detected _i The node with the upper level is connected with the same target node N according to the decision tree path of the decision tree _i+1 All target nodes N of (1) _i Determining a group of candidate nodes to obtain a plurality of groups of second candidate nodes; and respectively obtaining the sum of the number of the user identifications included in all the second candidate nodes in the same group to obtain each target node N _i+1 The number of user identifications included;

if each target node N _i And if no previous level node is detected, indicating that the statistics of the number of user identifications of each node in the decision tree is completed.

In the aspect and any possible implementation manner described above, there is further provided an implementation manner, when counting the user ratio of each dimension user feature of each node in the decision tree model, the step of counting, layer by layer, group information of each node in the decision tree model based on the decision path of the decision tree and the classification processing result, including:

Determining all upper nodes corresponding to each node by adopting a reverse search mode based on a decision path of the decision tree;

and calculating the quotient of the number of the user identifications of each node and the number of the user identifications of each corresponding upper node according to the number of the user identifications of each node and the number of the user identifications of each corresponding upper node, wherein the quotient is the user duty ratio.

In the aspect and any possible implementation manner described above, there is further provided an implementation manner, where after generating a user data distribution tree according to the decision path of the decision tree and the group information of each node in each decision tree model, and saving the user data distribution tree, the method further includes:

acquiring a query request of a user data distribution tree, wherein the query request at least carries target user characteristics;

locating a target node where the target user characteristic is located according to the target user characteristic carried in the query request;

and outputting the group information corresponding to the target node.

In the foregoing aspect and any possible implementation manner, there is further provided an implementation manner, where the query request further carries a querier identifier, and before the outputting the group information corresponding to the target node, the method further includes:

Determining the query authority of the querier according to the querier identification carried in the query request;

when the query authority of the querier is determined to be the query part information authority, outputting the number of the user identifications corresponding to the target node;

and when the query authority of the querier is determined to be the authority for querying all the information, outputting the class group information corresponding to the target node.

In a second aspect, an embodiment of the present invention provides a machine learning-based user grouping processing apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of desensitized user data after desensitization processing, and the desensitized user data comprises a user identifier and a user characteristic corresponding to the user identifier;

the model processing unit is used for classifying the desensitized user data by utilizing a decision tree model in the spark big data operation platform to obtain a classification processing result;

the statistics unit is used for counting the group information of each node in the decision tree model layer by layer based on the decision path of the decision tree and the classification processing result, wherein the group information comprises the number of user identifications, the user characteristics and the user duty ratio of the user characteristics of each dimension;

The generating unit is used for generating a user data distribution tree according to the decision path of the decision tree and the group information of each node in each decision tree model;

and the storage unit is used for storing the user data distribution tree.

In the aspect and any possible implementation manner described above, there is further provided an implementation manner, when counting the number of user identities included in each node in the decision tree model, the counting unit is configured to count, layer by layer, group information of each node in the decision tree model based on a decision path of the decision tree and the classification processing result, including:

according to the decision path of the decision tree, the same target node N is connected _i Is determined as a group of first candidate nodes to obtain a plurality of groups of first candidate nodes, wherein the target node N _i Is a superior node adjacent to the end node;

detecting said each target node N _i Whether there is a previous level node;

if saidEach target node N _i And if no previous level node is detected, indicating that the statistics of the number of user identifications of each node in the decision tree is completed.

In the aspect and any possible implementation manner described above, there is further provided an implementation manner, when counting the user ratio of each dimension user feature of each node in the decision tree model, the counting unit is configured to count, layer by layer, group information of each node in the decision tree model based on a decision path of the decision tree and the classification processing result, including:

In accordance with the aspects and any possible implementation manner of the foregoing, there is further provided an implementation manner, where the machine learning based user grouping processing apparatus further includes a positioning unit and an output unit, the generating unit is configured to generate a user data distribution tree according to a decision path of the decision tree and group information of each node in each decision tree model, and the storage unit is configured to store the user data distribution tree,

the acquisition unit is further configured to acquire a query request of the user data distribution tree, where the query request at least carries a target user feature;

the positioning unit is used for positioning a target node where the target user characteristic is located according to the target user characteristic carried in the query request;

the output unit is used for outputting the group information corresponding to the target node.

In a third aspect, an embodiment of the present invention provides an electronic terminal, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where when the processor executes the computer program, the machine learning-based user grouping processing method in any one of the first aspects is implemented.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, the computer program implementing any one of the machine learning based user grouping processing methods in the first aspect when executed by a processor.

The technical scheme has the following beneficial effects:

in the scheme provided by the invention, the decision tree model in the spark big data operation platform is utilized to classify and process each desensitized user data so as to obtain a classification result, so that the group information of each node is counted based on the decision path and the classification result of the decision tree model in the spark big data operation platform, the automatic grouping processing of the full sample user data is realized by utilizing the decision tree, the problems that the user grouping is realized in a manual mode, time and labor are consumed, and the information of each user is difficult to comprehensively count are avoided, and the user grouping accuracy is low and the efficiency is improved. And by counting the group information of each node and storing the group information of the nodes in a tree form, each node only needs to store one user characteristic, so that the data redundancy can be effectively reduced to a certain extent, and a large amount of original user data is compressed into a small-capacity user data distribution tree, thereby the method can be executed in a single-machine system.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a user grouping processing method based on machine learning according to an embodiment of the present invention;

FIG. 2 is a flow chart of one implementation of the method provided by the embodiment of the invention for step 106 in the machine learning based user grouping processing method of FIG. 1;

FIG. 3 is a schematic diagram of a decision tree according to an embodiment of the present invention;

FIG. 4 is a flow chart of one implementation of the method provided by the embodiment of the invention for step 106 in the machine learning based user grouping processing method of FIG. 1;

FIG. 5 is a schematic diagram of a user grouping processing device based on machine learning according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic terminal according to an embodiment of the present invention.

[ detailed description ] of the invention

For a better understanding of the technical solution of the present invention, the following detailed description of the embodiments of the present invention refers to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

Aiming at the problems of high labor cost and low user grouping accuracy in the prior art by manually grouping user portraits, the invention provides a method for classifying and processing massive user data through a decision tree model, and based on the decision path statistics of the decision tree, the user data quantity, the user characteristics, the user occupation ratio of the user characteristics with different dimensions and the like of each node of the decision tree are counted, so as to realize the optimal grouping of the whole sample. Based on the above thought, the invention provides a user grouping processing method based on machine learning.

Specifically, please refer to fig. 1, which is a schematic flow chart of a machine learning-based clustering method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

102. And acquiring a plurality of desensitized user data after desensitization processing, wherein the desensitized user data comprises a user identifier and a user characteristic corresponding to the user identifier.

The user identification is used for uniquely identifying a user, and can be an identity card number and a passport number after the user is desensitized; alternatively, a user ID number assigned to each user by the system may be used.

Wherein, the user features describe attribute information possessed by the user. A user typically possesses attribute information in a number of different dimensions, such as may include, but not be limited to: the attribute information of different dimensions such as age, gender, working industry, asset condition, academic, household book, user behavior and the like. In order to be able to distribute the multitude of users as discretely as possible, thereby enabling an accurate and detailed grouping of users, the user characteristics comprise a series of attribute information of different dimensions.

In addition, the desensitization processing refers to the data transformation of sensitive information in the data through a desensitization rule, so that the sensitive privacy data is effectively hidden. Under the condition of involving user safety data or some commercial sensitive data, under the condition of not violating system rules, the real data is modified and tested, and personal information such as an identity card number, a mobile phone number, a card number, a client number and the like needs to be subjected to data desensitization treatment.

104. And classifying the desensitized user data by utilizing a decision tree model in the spark big data operation platform to obtain a classification processing result.

The spark big data operation platform supports a classification decision tree algorithm and a decision tree regression algorithm, and the algorithm supported by the spark big data operation platform can be used for building a related classification decision tree through machine learning training by using data. In step 104, the decision tree model in the spark big data operation platform is utilized to classify the user features, and before the classification result is obtained, the decision tree needs to be established. Before the user data are established in the decision tree, the desensitized user data are required to be acquired, the user data are labeled according to the requirement, training samples are formed by the user data and corresponding labels, the training samples are input into a spark big data operation platform, and the decision tree is constructed by utilizing a decision algorithm of the spark big data operation platform.

106. Based on decision paths of the decision tree and classification processing results, group information of each node in the decision tree model is counted layer by layer, wherein the group information comprises user identification number, user characteristics and user duty ratio of the user characteristics of each dimension.

Wherein the decision path is a path between two decision nodes, the decision path decides which node the desensitized user data to be processed should be transmitted from the current node to the next stage of the current node. The decision path from one node to the next node is actually the judgment of the data on a characteristic attribute, for example, the data of the root node has the characteristic attribute of 'having a house in a car' into the left leaf node, and does not have the characteristic attribute into the right leaf node. Based on this analysis, it is known that the data in each leaf node in the decision tree has at least one common feature, and the users in each leaf node can represent a class of groups with the same feature, so that a large number of users can be clustered from multiple dimensions relatively quickly by means of the decision paths of the decision tree in the clustering process of the users.

The specific implementation manner of performing layer-by-layer statistics on the group information such as the number of user identifications of each node in the decision tree model, the user occupation ratio of the user features of each dimension and the like based on the decision path and the classification processing result of the decision tree is described in detail below, and the invention is not repeated herein.

108. And generating a user data distribution tree according to the decision path of the decision tree and the group information of each node in each decision tree model, and storing the user data distribution tree.

The user grouping data is stored in the form of a user data distribution tree, namely, the corresponding group information of each group is stored, and the relation among individual group information is stored, so that the subsequent searching and analysis of the user group are facilitated.

Optionally, a connection relationship exists between a root node and a leaf node in the decision tree, and in the implementation process, all desensitized user data in the root node are separated into leaf nodes connected with the root node, and in combination with the characteristics of the decision tree, in order to accurately and rapidly count the number of user identifications of each node, the number of user identifications of each node can be counted in a reverse search mode according to a decision path of the decision tree. Based on this idea, as shown in fig. 2, an implementation manner is provided for the implementation of step 106, where when counting the number of user identities included in each node in the decision tree model, step 106 counts, layer by layer, group information of each node in the decision tree model based on the decision path of the decision tree and the classification processing result, including:

s11, counting the number of user identifiers contained in each terminal node in the decision tree.

S12, connecting the same target node N according to the decision path of the decision tree _i Is determined as a group of first candidate nodes, a plurality of groups of first candidate nodes are obtained, wherein i is an integer greater than or equal to 1, and when i=1, the target node N _i Is the upper level node adjacent to the end node.

S13, respectively obtaining the sum of the user identification numbers included in all the first candidate nodes in the same group to obtain each target node N _i Number of user identities included.

S14, detecting each target node N _i Whether there is a previous level node.

S15, if at least one target node N is detected _i Has a higher-level node, and is connected with the same target node N according to the decision tree path of the decision tree _i+1 All target nodes N of (1) _i Determining a group of candidate nodes to obtain a plurality of groups of second candidate nodes; and respectively obtaining the sum of the number of the user identifications included in all the second candidate nodes in the same group to obtain each target node N _i+1 Number of user identities included.

Specifically, each target node N is obtained in S15 _i+1 After the number of the included user identifications, making i=i+1, continuing to execute S14 until all nodes of the decision tree species are determined to be statistically completed, or determining N _i Without upper partThe primary node stops the loop execution of steps S14 and S15.

S16, if each target node N _i And if no previous level node is detected, indicating that the statistics of the number of user identifications of each node in the decision tree is completed.

For example, a decision tree as shown in FIG. 3, which includes a layer 2 decision, with 7 nodes in total, where node a is the source node and nodes b and c are the two leaf nodes of node a; nodes d and e are two leaf nodes of node b; nodes f and g are two leaf nodes of node c. After the processing in the step 104, the number of user identifications included in the final layer of the final node defg is 7, 10, 15 and 8, and the decision path of the decision tree shows that the final node de is connected with the same node b and the final node fg is connected with the same node c, so that the final node de is determined to be the same group of candidate nodes, the final node fg is the same group of candidate nodes, the number of user identifications of the same node b connected with the final node de is determined to be 17, and the number of user identifications of the same node c connected with the final node fg is determined to be 23; further, the node bc is determined to be connected with the node a, so that the number of user identifications of the node a is determined to be 40, and the node a and the source node are determined to have no previous level node, so that the number of user identifications included in each node in the decision tree is determined to be completed in a counting mode.

Optionally, when counting the user ratios of the user features of each node in each dimension, the reverse search mode is adopted to count the user ratios of the different user features of each node according to the decision paths of the decision tree, based on this, as shown in fig. 4, another implementation manner is provided for the implementation of step 106, that is, for the implementation of step 106 based on the decision paths of the decision tree and the classification processing results, the layer-by-layer statistics of the group information of each node in the decision tree model may include:

s21, determining all upper nodes corresponding to each node by adopting a reverse search mode based on a decision path of the decision tree.

S22, calculating the quotient of the number of the user identifications of each node and the number of the user identifications of each corresponding upper node according to the number of the user identifications of each node and the number of the user identifications of each corresponding upper node, wherein the quotient is the user duty ratio.

In the process of processing desensitized user data by the decision tree, the source node starts to make a downward layer-by-layer decision to determine which node the user is allocated to, the decision basis is to determine whether the user characteristics of the user in a certain dimension meet the requirements, and allocate the user to the corresponding node according to the determined result. Therefore, the user ratio of the user characteristics in each dimension can be determined by calculating the quotient of the number of the user identifications of each node and the number of the user identifications of each corresponding superior node.

For example, please refer again to the decision tree shown in fig. 3, the process of calculating the duty ratio is described in detail below with reference to the above example of statistical user identification data: firstly, assume that the decision basis from node a to node bc is gender, if male, node b is entered, and if female, node c is entered; the decision basis from node b to node de is a working attribute, if the decision basis is software development, the node d is entered, and if the decision basis is not software development, the node e is entered; the decision from point c to node fg depends on the working attribute, and enters node f if accounting, and enters node g if not. The process of calculating the duty ratio is specifically explained based on the above assumption: determining that the upper node of d comprises ba and the upper node of e also comprises ba, the upper node of f comprises ca and the upper node of g comprises ca in a reverse search mode; d comprises 7 user identifications, b comprises 17 user identifications, and a comprises 40 user identifications, wherein the quotient of d and b is 7/17, and the quotient of d and a is 7/40, so that the desensitization user data comprise 7/17 of the total number of men who are engaged in the software development work, and the total number of men who are engaged in the software development work is 7/40; e contains 10 user identifications, wherein the quotient of e and b is 10/17, and the quotient of e and a is 10/40, so that the users who do other works in the desensitized user data account for 10/17 of the total number of men, and the total number of men who do other works accounts for 10/40.

Optionally, the objective of grouping the users is to facilitate the specific analysis of the people aiming at the same group, so, in order to quickly determine the group information of the user group to be analyzed, a target user feature of the user group to be analyzed may be set first, and then the group with the target user feature is located from the user data distribution tree by using the target user feature, that is, after the user data distribution tree is generated according to the decision path of the decision tree and the group information of each node in each decision tree model, and the user data distribution tree is stored, the method may further specifically be executed: acquiring a query request of a user data distribution tree, wherein the query request at least carries target user characteristics; locating a target node where the target user characteristic is located according to the target user characteristic carried in the query request; and outputting the group information corresponding to the target node. The method comprises the steps that when a target node where a target user characteristic is located, a reverse search mode can be adopted, and searching is started from a last layer node; alternatively, a forward search mode may be adopted, where the search is started from the source node; or, in order to increase the searching speed and save time, an intermediate searching mode may be adopted, that is, a node corresponding to the middle layer of the user data distribution tree is used as a starting point to search for the source node and the last layer node of the user data distribution tree respectively, which is not particularly limited in the present invention.

For example, as shown in the decision tree of fig. 3, it is assumed that the obtained query request 1 carries the information that the target user features are male, and the node b is located by adopting a reverse search mode or a forward search mode, so that the cluster information of the node b is obtained, the male is output, and the user occupation ratio is 17/40; assuming that the obtained query request 2 carries the characteristics of the target user as female and is engaged in accounting work, locating the node f in a reverse search mode or a forward search mode, thus obtaining the group information of the node f, outputting the group information of the node f, wherein the female is engaged in accounting work, accounting accounts for 15/23 of the female, and the female engaged in accounting work accounts for 15/40 of the total number of people.

Optionally, in order to ensure the security of the information, different multiple query authorities may be set, for example, a part of information authority may be set to be queried, all information authorities may be queried, based on which, before the group information corresponding to the target node is output, the corresponding query authority needs to be determined first, and corresponding data is output based on the query authority. The method specifically comprises the following steps: analyzing a querier identifier carried in the query request from the query request, comparing the querier identifier with a prestored identifier and an identifier in a query authority configuration table, selecting a target identifier matched with the querier identifier in the configuration table, and obtaining a query authority corresponding to the target identifier, wherein the query authority corresponding to the target identifier is the query authority of the querier; when the query authority of the querier is determined to be the query part information authority, outputting the number of user identifiers corresponding to the target node; and when the query authority of the querier is determined to be the authority for querying all the information, outputting the group information corresponding to the target node. In addition, it is added that if the querier identification is compared with the prestored identification and the identification in the query authority configuration table, and the identification matched with the querier identification is not found in the prestored configuration table, the querier is indicated to not have the query authority for querying the user data distribution tree, and no group information is required to be output at this time, or the warning information of illegal query is output.

The inquirer identification can be information of user identity, such as an identification card number, fingerprint information, iris information, face information and the like of inquirer; alternatively, the querier identifier may be an ID number of the querier, and the querier identifier is a piece of information obtained after processing by a token authentication mechanism.

The main execution body of the method may be a user grouping processing device based on machine learning, and the device may be an application of the local terminal, or may be a functional unit such as a plug-in unit or a software development kit (Software Development Kit, SDK) located in the application of the local terminal, which is not particularly limited in the embodiment of the present invention.

It will be appreciated that the application may be an application program (native app) installed on the terminal, or may also be a web page program (webApp) of a browser on the terminal, which is not limited by the embodiment of the present invention.

Based on the user grouping processing method based on machine learning provided by the above embodiment, the embodiment of the present invention further provides an apparatus embodiment for implementing each step and method in the above method embodiment.

Fig. 5 is a functional block diagram of an apparatus according to an embodiment of the invention. As shown in fig. 5, the machine learning-based user grouping processing apparatus includes: an acquisition unit 21, a model processing unit 22, a statistics unit 23, a generation unit 24 and a storage unit 25. The acquiring unit 21 is configured to acquire a plurality of desensitized user data after desensitization processing, where the desensitized user data includes a user identifier and a user feature corresponding to the user identifier; the model processing unit 22 is used for classifying the desensitized user data by utilizing a decision tree model in the spark big data operation platform to obtain a classification processing result; the statistics unit 23 is configured to count, layer by layer, group information of each node in the decision tree model based on a decision path of the decision tree and a classification processing result, where the group information includes a number of user identifications, user features, and a user duty ratio of the user features in each dimension; a generating unit 24, configured to generate a user data distribution tree according to the decision path of the decision tree and the group information of each node in each decision tree model; a storage unit 25 for storing a user data distribution tree.

Optionally, in the embodiment of the present invention, when counting the number of user identifiers included in each node in the decision tree model, the counting unit 23 is configured to count, layer by layer, group information of each node in the decision tree model based on a decision path of the decision tree and a classification processing result, including: firstly, counting the number of user identifications contained in each terminal node in a decision tree; then, according to the decision path of the decision tree, the same target node N is connected _i Is determined as a group of first candidate nodes, a plurality of groups of first candidate nodes are obtained, wherein i is an integer greater than or equal to 1, and when i=1, the target node N _i Is the upper level node adjacent to the final node; thus, the sum of the number of the user identifications included in all the first candidate nodes in the same group is obtained to obtain each target node N _i The number of user identifications included; and detecting each target node N _i Whether there is a previous level node; if at least one object is detectedTarget node N _i Has a higher-level node, and is connected with the same target node N according to the decision tree path of the decision tree _i+1 All target nodes N of (1) _i Determining a group of candidate nodes to obtain a plurality of groups of second candidate nodes; and respectively obtaining the sum of the number of the user identifications included in all the second candidate nodes in the same group to obtain each target node N _i+1 The number of user identifications included; if each target node N _i And if no previous level node is detected, indicating that the statistics of the number of user identifications of each node in the decision tree is completed.

Optionally, in the apparatus provided in the embodiment of the present invention, when counting the user duty ratio of each dimension user feature of each node in the decision tree model, the statistics unit 23 is configured to count, layer by layer, group information of each node in the decision tree model based on a decision path of the decision tree and a classification processing result, including: determining all superior nodes corresponding to each node by adopting a reverse search mode based on a decision path of the decision tree; therefore, according to the number of the user identifications of each node and the number of the user identifications of each corresponding upper node, calculating the quotient of the number of the user identifications of each node and the number of the user identifications of each corresponding upper node, wherein the quotient is the user duty ratio.

Optionally, the machine learning based user grouping processing device provided by the embodiment of the present invention further includes a positioning unit and an output unit (not shown in fig. 5), where the generating unit 24 is configured to generate a user data distribution tree according to a decision path of the decision tree and cluster information of each node in each decision tree model, and the storage unit is configured to store the user data distribution tree, and after the storage unit is configured to store the user data distribution tree, the obtaining unit 21 is further configured to obtain a query request of the user data distribution tree, where the query request at least carries a target user feature; therefore, the positioning unit is used for positioning the target node where the target user characteristic is located according to the target user characteristic carried in the query request; furthermore, the output unit is used for outputting the group information corresponding to the target node.

Since each unit in the present embodiment is capable of executing the above-described machine learning-based user grouping processing method, a relevant description of the machine learning-based user grouping processing method may be referred to for a part not described in detail in the present embodiment.

Fig. 6 is a schematic structural diagram of an embodiment of an electronic terminal of the present application, where the electronic device may include a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where when the processor executes the computer program, the method for user grouping processing based on machine learning provided in the embodiment of the present application may be implemented.

The electronic terminal may include, but is not limited to, smart electronic devices such as a smart phone, a tablet computer, or a wearable smart device, and in addition, the electronic terminal may also be a server, and the specific form of the electronic terminal is not limited in this embodiment.

Fig. 6 shows a block diagram of an exemplary electronic terminal 30 suitable for use in implementing embodiments of the present application. The electronic terminal 30 shown in fig. 6 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.

As shown in fig. 6, the electronic terminal 30 is in the form of a general purpose computing device. The components of the electronic terminal 30 may include, but are not limited to: one or more processors or processing units 310, a system memory 320, and a bus 330 that connects the different system components (including the system memory 320 and the processing units 310).

Bus 330 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnection; hereinafter PCI) bus.

Electronic terminal 30 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by electronic terminal 30 and includes both volatile and non-volatile media, removable and non-removable media.

The system memory 320 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 321 and/or cache memory 322. The electronic terminal 30 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 323 may be used to read from and write to non-removable, non-volatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard disk drive"). Although not shown in fig. 6, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a compact disk read only memory (Compact Disc Read Only Memory; hereinafter CD-ROM), digital versatile read only optical disk (Digital Video Disc Read Only Memory; hereinafter DVD-ROM), or other optical media) may be provided. In such cases, each drive may be coupled to bus 330 through one or more data medium interfaces. Memory 320 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the present application.

A program/utility 325 having a set (at least one) of program modules 324 may be stored in, for example, memory 320, such program modules 324 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 324 generally perform the functions and/or methods in the embodiments described herein.

The electronic terminal 30 may also be in communication with one or more external devices 40 (e.g., keyboard, pointing device, display 410, etc.), one or more devices that enable a user to interact with the electronic terminal 30, and/or any device (e.g., network card, modem, etc.) that enables the electronic terminal 30 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 340. Also, the electronic terminal 30 may communicate with one or more networks, such as a local area network (Local Area Network; hereinafter: LAN), a wide area network (Wide Area Network; hereinafter: WAN) and/or a public network, such as the Internet, via the network adapter 350. As shown in fig. 6, the network adapter 350 communicates with other modules of the electronic terminal 30 via the bus 330. It should be appreciated that although not shown in fig. 6, other hardware and/or software modules may be used in connection with the electronic terminal 30, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 310 executes various functional applications and data processing by running programs stored in the system memory 320, for example, implementing the test data acquisition method provided in the embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the user grouping processing method based on machine learning provided by the embodiment of the application can be realized.

Any combination of one or more computer readable media may be utilized as the above-described computer readable storage media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory; EPROM) or flash Memory, an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network; hereinafter: LAN) or a wide area network (Wide Area Network; hereinafter: WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a Processor (Processor) to perform part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. The user grouping processing method based on machine learning is characterized by comprising the following steps of:

based on the decision path of the decision tree and the classification processing result, calculating group information of each node in the decision tree model layer by layer, wherein the group information comprises the number of user identifications and the user proportion of the user characteristics of each dimension;

generating a user data distribution tree according to the decision path of the decision tree and the group information of each node in each decision tree model, and storing the user data distribution tree;

when counting the number of user identifications contained in each node in the decision tree model, the step-by-step statistics of the class group information of each node in the decision tree model based on the decision path of the decision tree and the classification processing result comprises the following steps:

determining all the terminal nodes connected with the same target node Ni as a group of first candidate nodes according to the decision path of the decision tree to obtain a plurality of groups of first candidate nodes, wherein i is an integer greater than or equal to 1, and when i=1, the target node Ni is a superior node adjacent to the terminal node;

respectively obtaining the sum of the number of the user identifications included in all the first candidate nodes in the same group to obtain the number of the user identifications included in each target node Ni;

detecting whether each target node Ni has a previous node;

if at least one target node Ni is detected to have a previous node, determining all target nodes Ni connected with the same target node Ni+1 as a group of candidate nodes according to a decision tree path of the decision tree, and obtaining a plurality of groups of second candidate nodes; the sum of the number of the user identifications included in all the second candidate nodes in the same group is obtained respectively to obtain the number of the user identifications included in each target node Ni+1;

if each target node Ni does not detect the previous node, indicating that the statistics of the number of user identifiers of each node in the decision tree is completed;

When the user duty ratio of each dimension user feature of each node in the decision tree model is counted, the step-by-step counting of the group information of each node in the decision tree model based on the decision path of the decision tree and the classification processing result comprises the following steps:

2. The method of claim 1, wherein after generating a user data distribution tree from the decision path of the decision tree and the cluster information of each node in each decision tree model, and saving the user data distribution tree, the method further comprises:

And outputting the group information corresponding to the target node.

3. The method according to claim 2, wherein the query request further carries a querier identifier, and before the outputting the group information corresponding to the target node, the method further includes:

4. A machine learning-based user grouping processing apparatus, the machine learning-based user grouping processing apparatus comprising:

The statistics unit is used for counting the class group information of each node in the decision tree model layer by layer based on the decision path of the decision tree and the classification processing result, wherein the class group information comprises the number of user identifications and the user occupation ratio of the user characteristics of each dimension;

a storage unit, configured to store the user data distribution tree;

when counting the number of user identifications contained in each node in the decision tree model, the counting unit is used for counting the group information of each node in the decision tree model layer by layer based on the decision path of the decision tree and the classification processing result, and comprises the following steps:

detecting whether each target node Ni has a previous node;

when the user duty ratio of each dimension user feature of each node in the decision tree model is counted, the counting unit is used for counting the group information of each node in the decision tree model layer by layer based on the decision path of the decision tree and the classification processing result, and the counting unit comprises the following steps:

5. The apparatus of claim 4, wherein the machine learning based user grouping processing apparatus further comprises a positioning unit and an output unit, the generating unit is configured to generate a user data distribution tree based on the decision path of the decision tree and the group information of each node in each decision tree model, and the storage unit is configured to store the user data distribution tree,

6. An electronic terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of claims 1-3 when executing the computer program.