CN112396428B

CN112396428B - User portrait data-based customer group classification management method and device

Info

Publication number: CN112396428B
Application number: CN202011225923.9A
Authority: CN
Inventors: 于扬
Original assignee: Beijing Analysys Think Tank Network Technology Co ltd
Current assignee: Beijing Analysys Digital Intelligence Technology Co ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2023-04-07
Anticipated expiration: 2040-11-05
Also published as: CN112396428A

Abstract

The embodiment of the invention provides a guest group dividing method and device based on user portrait data, which is used for acquiring user portrait data stored in a kudu, hdfs or hive memory; calculating the behavior data, the attribute data and the label data according to the logical operation conditions and the factor operation conditions to obtain a target user; after the target user ID is associated with the user portrait data according to a preset time period, completing and normalizing the user portrait data to obtain feature data meeting a preset format; and after matching operation is carried out on the feature data and the pre-established feature library, the target users are divided into corresponding guest groups. The invention integrates the behavior, attribute and tag data related to the user according to the user id, and stores the behavior, attribute and tag based on the characteristics of kudu, hdfs and hive, thereby providing high-efficiency data query performance. The problem of carry out the guest group under the current scene and divide and use the image dimension singleness, be difficult to promote the accuracy of dividing is solved.

Description

User portrait data-based guest group classification management method and device

Technical Field

The embodiment of the invention relates to the technical field of data classification, in particular to a customer group classification management method and device based on user portrait data.

Background

With the rapid development of the internet, the user scale is significantly increased and the demand diversity is more complicated, and in order to provide more matched products, services and contents for users with different characteristics, effective grouping and analysis are required for the users. In the current market, for the division of user groups, rule configuration is mainly performed by using collected customer data, and users are divided by manually selecting different dimensions and indexes. Such approaches are limited in level of business to the operator and do not allow accurate user segmentation from a more detailed or difficult to manually gain insight. For the user group division scenario, a more intelligent and simpler way is needed to provide services.

However, in order to solve such a scenario problem, business personnel is mainly relied on to manually perform rule configuration and division based on understanding of users and in combination with collected user attributes, where the following problems may cause a defect that it is difficult to effectively divide users. First, it is overly dependent on the business experience of the operator, requiring repeated attempts to determine the final partitioning rule. Secondly, the manual partitioning method can only perform coarse-grained partitioning on users, and it is difficult to find the difference between users from finer granularity to perform object group partitioning. Thirdly, the data is understood, the information of the user cannot be fully utilized, and the hidden factors which can distinguish the user are not included in the rule range.

Disclosure of Invention

Aiming at the defects of a customer group division system in the prior art, the embodiment of the invention provides a customer group division method and a customer group division device based on user portrait data, which are used for dividing customers in different modes aiming at users with different capability levels, helping the users to quickly know the difference and the characteristics among the customer groups aiming at the divided customer groups, and in the aspect of customer data, the system supports the utilization of behavior records generated by the customers, collected customer information, customer labels and the like; in the aspect of guest group division, the system supports a configuration mode of custom behaviors and attribute rules, supports the automatic division of different levels of the selected target guest group by using a supervised algorithm and an unsupervised algorithm, and shows the difference common users of different guest groups for reference. The specific technical scheme is as follows:

the embodiment of the invention provides a customer group division method based on user portrait data, which comprises the following steps: acquiring user portrait data stored in a kudu, hdfs or hive memory; wherein the user representation data includes behavioral data, attribute data, and tag data; the behavior data includes: user ID, action occurrence time and action content;

taking the behavior data, the attribute data and the label data as input conditions, and calculating the user portrait data according to logical operation conditions and factor operation conditions to obtain a target user; associating the target user ID with the user portrait data according to a preset time period, and performing completion and normalization operation on the user portrait data to obtain feature data meeting a preset format; the factor operation condition comprises a numerical value type factor, a character type factor and a time type factor; the characteristic data comprises behavior characteristic data, attribute characteristic data and label characteristic data;

and after matching operation is carried out on the feature data and a pre-established feature library, the target users are divided into corresponding guest groups.

Further, the method also comprises the step of scoring the principal components of the target user in different customer groups by adopting a principal component analysis algorithm, and finishing the evaluation of the customer groups according to the scores.

Furthermore, partitions are created for the user portrait data according to behavior occurrence time, and dynamic bucket-dividing storage is carried out when the number of the behaviors of the partitions in the day is larger than a preset number of times.

Further, the behavior data, the attribute data and the label data are used as input conditions, and the user portrait data are calculated according to logical operation conditions and numerical operation conditions to obtain a target user; associating the target user ID with the user portrait data according to a preset time period, and performing completion and normalization operation on the user portrait data to obtain feature data meeting a preset format; the method specifically comprises the following steps:

taking the behavior data, the attribute data and the tag data as input conditions, and logically screening the user portrait data according to logical operation conditions and by adopting a minimum screening principle;

respectively carrying out the operation of a numerical value type factor, a character type factor and a time type factor on the user portrait data of the target user subjected to the logic screening to obtain the user portrait data subjected to the factor operation screening;

associating the user portrait data subjected to factor operation screening with the target user ID according to a time period;

and performing completion and normalization operation on the associated data fields to obtain characteristic data meeting the preset format.

Further, in the default value processing part, a KNN filling algorithm is adopted for data completion; using a linear function normalization algorithm to perform field normalization, converting the user portrait data into the range of [0,1] in a linear mode according to a linear function, and then performing distance measurement and covariance calculation; when the data do not accord with normal distribution, normalization processing is carried out through mean absolute deviation standardization, logarithmic transformation, decimal scaling and sigmoid functions.

Further, the matching operation of the feature data and a pre-established feature library comprises the following steps:

when the behavior feature data of the target user is matched with the behavior features in the feature library, if the extracted behavior features contain the features of the feature library, the matching can be judged to be successful; otherwise, judging that the matching is unsuccessful; when the extracted user attribute features are matched with the attribute features in the feature library, if the extracted attribute features contain the features in the feature library, the matching can be judged to be successful; otherwise, judging that the matching is unsuccessful; when the extracted user tag features are matched with the tag features in the feature library, if the extracted tag features contain the features in the feature library, the matching can be judged to be successful; otherwise, the matching is judged to be unsuccessful.

Another aspect of the present application provides a customer group classification apparatus based on user portrait data, including:

a data integration module for obtaining user portrait data stored in kudu, hdfs, or hive memory; wherein the user representation data comprises behavioral data, attribute data, and tag data; the behavior data includes: user ID, action occurrence time and action content;

the characteristic extraction module is used for taking the behavior data, the attribute data and the tag data as input conditions, and calculating the user portrait data according to logical operation conditions and factor operation conditions to obtain a target user; associating the target user ID with the user portrait data according to a preset time period, and performing completion and normalization operation on the user portrait data to obtain feature data meeting a preset format; the factor operation condition comprises a numerical value type factor, a character type factor and a time type factor; the characteristic data comprises behavior characteristic data, attribute characteristic data and label characteristic data;

and the guest group division module is used for carrying out matching operation on the feature data and a pre-built feature library and then dividing the target user into corresponding guest groups.

Further, the customer group evaluation module is used for scoring the principal components of the target user in different customer groups by adopting a principal component analysis algorithm and finishing the customer group evaluation according to the scores.

Further, the feature extraction module further includes:

the logic screening module is used for taking the behavior data, the attribute data and the tag data as input conditions and carrying out logic screening on the user portrait data according to logic operation conditions and by adopting a minimum screening principle;

the factor screening module is used for respectively carrying out the operation of a numerical value type factor, a character type factor and a time type factor on the user portrait data of the target user subjected to the logic screening to obtain the user portrait data subjected to the factor operation screening;

the association module is used for associating the user portrait data subjected to factor operation screening with the target user ID according to a time period;

and the completion and normalization module is used for performing completion and normalization operations on the associated data fields to obtain the characteristic data meeting the preset format.

The embodiment of the invention provides a method and a device for dividing a guest group based on user portrait data, which comprises the following steps: acquiring user portrait data stored in a kudu, hdfs or hive memory; taking the behavior data, the attribute data and the label data as input conditions, and calculating the user portrait data according to logical operation conditions and factor operation conditions to obtain a target user; associating the target user ID with the user portrait data according to a preset time period, and performing completion and normalization operation on the user portrait data to obtain feature data meeting a preset format; and after matching operation is carried out on the feature data and a pre-established feature library, the target users are divided into corresponding guest groups. The invention integrates the behavior, attribute and tag data related to the user according to the user id, respectively stores the behavior, attribute and tag based on the characteristics of kudu, hdfs and hive, and provides high-efficiency data query performance by using reasonable partitioning and barreling strategies. The problem of carry out the guest group under the current scene and divide and use the image dimension singleness, be difficult to promote the accuracy of dividing is solved.

Furthermore, the invention carries out default value processing and normalization operation on the screened target customer group and the integrated image data such as behaviors, attributes and labels through feature extraction and customer group division, and carries out customer group division by utilizing a classification model algorithm in combination with a pre-valued customer group feature rule in a programmed manner, thereby solving the problem that the current customer group division mainly depends on personal experience and is difficult to carry out deep division by utilizing complete customer image data.

Furthermore, the invention utilizes the finished guest group division result and combines the user portrait data of data integration to carry out the feature and difference identification between different guest groups. And (3) rapidly identifying the significant difference by using a principal component analysis method, grading and quantifying the evaluation on the customer group. The problem that the characteristics and differences of the passenger groups cannot be accurately described because the passenger groups cannot be quantitatively evaluated after being manually divided at present is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art will understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical essence, and any modifications of the structures, changes of the ratio relationships, or adjustments of the sizes, should still fall within the scope covered by the technical contents disclosed in the present invention without affecting the efficacy and the achievable purpose of the present invention.

FIG. 1 is a flowchart of an embodiment of a method for dividing a user portrait data into a plurality of user portrait sections according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a user portrait data-based guest group partitioning apparatus according to an embodiment of the present invention.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of a preferred embodiment of a method for dividing a guest group based on user portrait data according to an embodiment of the present application is shown, where the method includes the steps of: acquiring user portrait data stored in a kudu, hdfs or hive memory; wherein the user representation data comprises behavioral data, attribute data, and tag data; the behavior data includes: user ID, action occurrence time and action content;

The method further comprises the step of scoring the principal components of the target user in different customer groups by adopting a principal component analysis algorithm, and finishing the evaluation of the customer groups according to the scores. According to the technical scheme, the behavior, the attribute and the label feature are extracted respectively according to the behavior data, the attribute data and the label data generated by the user, so that the user data can be utilized more comprehensively and more efficiently to divide the passenger groups, the accuracy of the division of the passenger groups is greatly reduced, meanwhile, the corresponding rule configuration is not required to be carried out manually, the cost of manual participation is reduced, finally, brief insight evaluation is provided for the divided passenger groups, the user is helped to know the features and the differences among the passenger groups more quickly and intuitively, the difficulty of subsequent marketing and operation work is reduced, and the final effect is improved.

In a specific implementation manner of the invention, the method further comprises the steps of creating partitions for the user portrait data according to behavior occurrence time, and performing dynamic bucket-dividing storage when the number of behaviors of the partitions in the day is greater than a preset number of times. The user data is stored through the kudu, the hdfs and the hive, wherein the user behavior data is stored in a column mode and consists of three elements of user ID, behavior occurrence time and behavior content, and due to the fact that attribute fields of different behaviors are different, the kudu which is easier to expand the fields is adopted for storing the behavior data in the system. In order to improve the efficiency of behavior data in association query and feature extraction, partitions are created every day according to behavior occurrence time, and a dynamic partition bucket design is needed in a scene that the daily partition behavior quantity is more than 10 hundred million times, wherein the specific design reference method comprises the following steps: according to the same behavior field (such as the commodity ID) and the two tables (such as the order and the order amount) with the same bucket dividing quantity, when join is carried out through the commodity ID, since the same commodity ID of the two tables is divided into the buckets with the same ID, the join and the aggregation calculation can be independently carried out (refer to the partition process of MapReducer). In this way, every time the data calculation of one bucket is completed, the memory occupied by the bucket can be released immediately, and therefore, the memory occupation can be limited by controlling the number of parallel processing buckets. Calculating the memory occupied by theory: optimized memory footprint = number of buckets of original memory footprint/table + number of parallel processing buckets. The data storage uses the ID of the user as a unique main key, and finally the behavior, the attribute and the label data of the user are associated through the main key when modeling application is carried out.

In the specific implementation manner of the invention, when the characteristics of the user portrait data are extracted, the behavior data, the attribute data and the tag data are used as input conditions, and the user portrait data are logically screened according to logical operation conditions and by adopting a minimum screening principle; respectively carrying out the operation of a numerical value type factor, a character type factor and a time type factor on the user portrait data of the target user subjected to the logic screening to obtain the user portrait data subjected to the factor operation screening; associating the user portrait data subjected to factor operation screening with the target user ID according to a time period; and performing completion and normalization operation on the associated data fields to obtain characteristic data meeting the preset format.

Specifically, screening of target users is carried out according to a specified rule, user behaviors, attributes and label data in the data integration module are correlated according to user IDs and serve as features, feature training is carried out after processing of default values, normalization and the like, matching is carried out by combining an existing customer group rule base, and the matching is output to a customer group division module to complete a customer group division model. Firstly, taking user behaviors, attributes and labels as input conditions, performing multiple operations according to set operation conditions, and selecting target people needing to perform guest group division. Wherein, in the implementation, the collective operation is divided into two items of logic condition and factor, the logic condition supports and is not related, and infinite nesting logic is supported, and logic screening can be performed through the combination relation among any groups. It should be noted that in the implementation process, a principle of minimum screening, that is, logic of support and relationship within a group and support and non-relationship between groups, should be adopted to ensure that the range of the target user can be gradually narrowed along with the increase of the logical relationship in the screening process, thereby ensuring the usability of the program. In the factor part, aiming at the data type stored by the data integration module, the operation and comparison operation between the factor and the factor can be carried out, and the calculation logics of more than, less than, more than or equal to, less than or equal to, unequal to, open interval, closed interval, semi-closed interval, value, no value and the like are supported in the aspect of the numerical type factor; the method supports calculation logics of equal, unequal, containing, not containing, length, row repetition number and the like in terms of character type factors; the time type factor supports computation logic for absolute time, relative time, etc. If the factors are mainly non-numerical values and time types, a bitmap mode can be adopted for data storage, and the calculation and comparison efficiency can be further improved.

After the target population screening is completed, the system needs to realize the association of user information according to the target user id. And (4) extracting the full features without any limitation, and associating all behavior records, attribute data and tag data related to the user in the data integration module according to the specified time period by using the user ID in the result of the target crowd screening. And performing completion and normalization operations on the associated data fields. In the default value processing part, a KNN filling algorithm is adopted for data completion, namely, near neighbor data is filled, KNN is used for calculating adjacent k data, then the average value of the k data is filled, and the dimensional system with the default proportion reaching more than 80% is subjected to column alignment deletion by default. The normalization part defaults to carry out field normalization by using a linear function normalization algorithm, linearly converts the original data into the range of [0,1] according to a linear function, and can carry out normalization processing in the modes of mean absolute deviation normalization, logarithmic transformation, decimal scaling and sigmoid function when distance measurement, covariance calculation and data are not in accordance with normal distribution are involved.

The extracted features classify users into corresponding classifications. The specific implementation steps are that the user, the extracted behavior, attribute and label features are matched with features recorded in a created feature rule base, the final matching degree is obtained by combining the weight coefficients of the input behavior, attribute and label features, and the user is classified into designated classes. The feature rule base should have records for each type of user group, including feature rules of behaviors, attributes and labels.

For example, in the embodiment of the present invention, the feature rule base includes different user groups such as white collar people, tall and tall commander people, family people, quadratic element people, student people, and the like, and all the user groups are users accumulated in the actual business process and generate rules corresponding to the user groups according to different associated behaviors, attributes, and tag features of the user groups. The system defaults to 1:1: the weight distribution mode of 1 uses behaviors, attributes and labels to carry out weighting calculation, and supports a custom input weight adjustment matching algorithm. When the extracted user behavior characteristics are matched with the behavior characteristics in the characteristic library, if the extracted behavior characteristics contain the characteristics of the characteristic library, the matching can be judged to be successful; otherwise, judging that the matching is unsuccessful; when the extracted user attribute features are matched with the attribute features in the feature library, if the extracted attribute features contain the features in the feature library, the matching can be judged to be successful; otherwise, judging that the matching is unsuccessful; when the extracted user tag features are matched with the tag features in the feature library, if the extracted tag features contain the features in the feature library, the matching can be judged to be successful; otherwise, the matching is judged to be unsuccessful. In the process of feature matching, if the situation that the behaviors, attributes and label features matched by the users do not have a matching relation with the existing features of the feature rule base exists, the system divides the user groups into three groups by default according to an unsupervised means, and adds the features of the user groups into the feature rule base as rules.

The invention also provides a better implementation mode, and the three characteristics are directly fused and summarized when the behavior, attribute and label characteristics are extracted, so that the final weighted characteristic value of the user is obtained. And inputting the obtained features into a classification model obtained by pre-training to directly classify the user. The method greatly reduces the complexity of feature calculation, and the calculation logic is clearer. And combining the steps, dividing the target customer group users into corresponding classifications, and storing classification results into the hive database for further application and analysis.

And evaluating the guest groups by the image dimensions with obvious differences aiming at the partitioned guest groups so as to more intuitively understand the characteristics and the differences among different guest groups after the model is partitioned. The concrete implementation steps are as follows:

and after the system receives the request, the user id details of the appointed single or multiple guest groups are obtained by the hive according to the input guest group id. If a comparison request of a three-family guest group and a two-dimensional guest group is received in the implementation case of the invention, the system details the user id of the obtained guest group id to the memory, and matches behavior, attribute and tag data in kudu and hive according to the user id.

And (3) performing primary filtering on the matched behaviors, attributes and label data in a principal component analysis mode, and excluding factors which are not mainly influenced when the default ratio exceeds 90%. And comparing the principal component analysis results of the two classes of customer groups, carrying out secondary filtration on the factors with the same factor difference within 10%, and keeping the final factor result.

When the principal component analysis is implemented, firstly, the matched behaviors, attributes and label data are subjected to standardization treatment, and then a correlation matrix or a covariance matrix is calculated; calculating the eigenvalue and eigenvector of the correlation matrix; the method includes calculating an accumulated contribution ratio (generally, the accumulated contribution ratio is required to be more than 85%), calculating a principal component score by observing a coefficient, and calculating a score of each principal component in a covariance matrix after normalizing each sample data.

And after the score calculation is completed, outputting the result to a system front-end interface to complete the evaluation of the guest group.

The embodiment of the invention provides a user portrait data-based guest group division method and device, which comprises the following steps: acquiring user portrait data stored in a kudu, hdfs or hive memory; taking the behavior data, the attribute data and the label data as input conditions, and calculating the user portrait data according to logical operation conditions and factor operation conditions to obtain a target user; associating the target user ID with the user portrait data according to a preset time period, and performing completion and normalization operation on the user portrait data to obtain feature data meeting a preset format; and after matching operation is carried out on the feature data and a pre-established feature library, the target users are divided into corresponding guest groups. The invention integrates the behavior, attribute and tag data related to the user according to the user id, respectively stores the behavior, attribute and tag based on the characteristics of kudu, hdfs and hive, and provides high-efficiency data query performance by using reasonable partitioning and barreling strategies. The problem of carry out the guest group under the current scene and divide and use the image dimension singleness, be difficult to promote the accuracy of dividing is solved.

Furthermore, the invention utilizes the finished guest group division result and combines the user portrait data integrated by data to identify the characteristics and differences among different guest groups. And (3) rapidly identifying the significant difference by using a principal component analysis method, grading and quantifying the evaluation on the customer group. The problem that the characteristics and differences of the passenger groups cannot be accurately described because the passenger groups cannot be quantitatively evaluated after being manually divided at present is solved.

Referring to fig. 2, a schematic structural diagram of a device for dividing a guest group based on user portrait data according to an embodiment of the present invention includes:

the characteristic extraction module is used for calculating the user portrait data according to a logic operation condition and a factor operation condition by taking the behavior data, the attribute data and the label data as input conditions to obtain a target user; associating the target user ID with the user portrait data according to a preset time period, and performing completion and normalization operation on the user portrait data to obtain feature data meeting a preset format; the factor operation condition comprises a numerical value type factor, a character type factor and a time type factor; the characteristic data comprises behavior characteristic data, attribute characteristic data and label characteristic data;

Further, the feature extraction module further includes:

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A user portrait data-based guest group division method is characterized by comprising the following steps: acquiring user portrait data stored in a kudu, hdfs or hive memory; wherein the user representation data comprises behavioral data, attribute data, and tag data; the behavior data includes: user ID, action occurrence time and action content;

after matching operation is carried out on the feature data and a pre-built feature library, the target users are divided into corresponding guest groups;

the method also comprises the steps of scoring the principal components of the target user in different customer groups by adopting a principal component analysis algorithm, and finishing customer group evaluation according to the scores;

creating partitions for the user portrait data according to behavior occurrence time, and performing dynamic barrel storage when the number of the behaviors of the partitions in the current day is greater than a preset number of times;

calculating the user portrait data according to a logic operation condition and a numerical operation condition by taking the behavior data, the attribute data and the tag data as input conditions to obtain a target user; associating the target user ID with the user portrait data according to a preset time period, and performing completion and normalization operation on the user portrait data to obtain feature data meeting a preset format; the method specifically comprises the following steps: taking the behavior data, the attribute data and the tag data as input conditions, and logically screening the user portrait data according to logical operation conditions and by adopting a minimum screening principle; respectively performing the operation of a numerical value type factor, a character type factor and a time type factor on the user portrait data of the target user subjected to the logic screening to obtain the user portrait data subjected to the factor operation screening; associating the user portrait data subjected to factor operation screening with the target user ID according to a time period; completing and normalizing the associated data fields to obtain characteristic data meeting a preset format;

in the default value processing part, a KNN filling algorithm is adopted for data completion; using a linear function normalization algorithm to perform field normalization, converting the user portrait data into the range of [0,1] in a linear mode according to a linear function, and then performing distance measurement and covariance calculation; when the data do not accord with normal distribution, normalization processing is carried out through mean absolute deviation standardization, logarithmic transformation, decimal scaling and sigmoid functions.

2. The method of claim 1, wherein matching the feature data with a pre-built feature library comprises:

3. An apparatus for dividing a guest group based on user portrait data, comprising:

the guest group division module is used for dividing the target user into corresponding guest groups after matching operation is carried out on the feature data and a pre-established feature library;

the client group evaluation module is used for scoring the principal components of the target user in different client groups by adopting a principal component analysis algorithm and finishing client group evaluation according to the scores;

creating partitions for the user portrait data according to behavior occurrence time, and performing dynamic barreling storage when the number of the daily partition behaviors is larger than a preset number of times;

the feature extraction module further comprises:

the completion and normalization module is used for performing completion and normalization operations on the associated data fields to obtain characteristic data meeting a preset format;