CN113836370A

CN113836370A - User group classification method and device, storage medium and computer equipment

Info

Publication number: CN113836370A
Application number: CN202111412279.0A
Authority: CN
Inventors: 陶景龙; 王启凡; 魏国富; 殷钱安; 余贤喆; 周晓勇; 梁淑云; 刘胜; 马影
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2021-12-24
Anticipated expiration: 2041-11-25
Also published as: CN113836370B; WO2023092646A1

Abstract

The invention discloses a user group classification method, a user group classification device, a storage medium and computer equipment. The method comprises the following steps: acquiring behavior data of a user group, and preprocessing the behavior data of the user group to obtain a behavior sequence dataset which takes the user name of each user as a main object; extracting and frequency counting the frequent behavior instruction combinations in the behavior sequence data set by using a correlation analysis algorithm to obtain a frequent instruction combination feature table; calculating sequence matching scores and inter-sequence similarity scores among all behavior sequences in the behavior sequence data set through a sequence comparison algorithm to obtain a sequence similarity feature table; carrying out frequency statistics on the behavior instructions in the behavior sequence data set to obtain a behavior instruction frequency feature table; and classifying and analyzing the frequent instruction combination feature table, the sequence similarity feature table and the behavior instruction frequency feature table by adopting a semi-supervised classification algorithm to obtain user groups with different categories so as to improve the classification efficiency.

Description

User group classification method and device, storage medium and computer equipment

Technical Field

The invention relates to the technical field of big data processing, in particular to a user group classification method, a user group classification device, a storage medium and computer equipment.

Background

User group classification is a link which is particularly important in the development process of various industries taking users as operation carriers, and when facing platforms with huge user groups such as e-commerce, public resource management, information security management and the like, how to classify independent user objects into groups is very difficult and important work. Compared with the traditional method for carrying out group classification according to the user attribute construction characteristics, the method for carrying out user group classification by taking the operation behaviors of the user as the original characteristics is obviously more innovative and effective, and after the user group is divided according to the user operation behaviors, when classification data of the user group is applied to downstream work, all the advantages of accurate recommendation, updating and retention, group management and the like can be exerted.

In the prior art, most group classification methods based on user operation behaviors add tags to a data set in combination with business logic according to attributes such as basic attributes, user behavior tracks and user social connections of the user operation behaviors, and classify user groups by using a supervised machine learning algorithm. However, the classification method for the user group cannot be applied to an application scenario that no social relationship exists between users and no behavior track exists in user operation, and adding tags to the user group is also work with high labor cost and low efficiency. Therefore, the classification method of the user group seems to be very effective, the practical application scenarios are very limited, the required labor cost is high, and the model training efficiency is very low.

Disclosure of Invention

In view of this, the present application provides a user group classification method, device, storage medium and computer device, and mainly aims to solve the technical problems in the prior art that the application scenario of the user group classification method is limited, the required labor cost is high, and the model training efficiency is low.

According to a first aspect of the present invention, there is provided a method for classifying a user group, the method comprising:

acquiring behavior data of a user group, and preprocessing the behavior data of the user group to obtain a behavior sequence dataset which takes the user name of each user as a main object, wherein each user name corresponds to a behavior sequence, and each behavior sequence comprises at least one behavior instruction;

extracting and frequency counting the frequent behavior instruction combinations in the behavior sequence data set by using a correlation analysis algorithm to obtain a frequent instruction combination feature table;

calculating sequence matching scores and inter-sequence similarity scores among all behavior sequences in the behavior sequence data set through a sequence comparison algorithm to obtain a sequence similarity feature table;

carrying out frequency statistics on the behavior instructions in the behavior sequence data set to obtain a behavior instruction frequency feature table;

and classifying and analyzing the frequent instruction combination feature table, the sequence similarity feature table and the behavior instruction frequency feature table by adopting a semi-supervised classification algorithm to obtain user groups with different categories.

According to a second aspect of the present invention, there is provided an apparatus for classifying a user group, the apparatus comprising:

the user data acquisition module is used for acquiring behavior data of a user group and preprocessing the behavior data of the user group to obtain a behavior sequence data set taking the user name of each user as a main object, wherein each user name corresponds to one behavior sequence, and each behavior sequence comprises at least one behavior instruction;

the frequent item feature extraction module is used for extracting and carrying out frequency statistics on the frequent behavior instruction combination in the behavior sequence data set by using a correlation analysis algorithm to obtain a frequent instruction combination feature table;

the similarity characteristic extraction module is used for calculating sequence matching scores and inter-sequence similarity scores among all behavior sequences in the behavior sequence data set through a sequence comparison algorithm to obtain a sequence similarity characteristic table;

the instruction frequency characteristic extraction module is used for carrying out frequency statistics on the behavior instructions in the behavior sequence data set to obtain a behavior instruction frequency characteristic table;

and the user group classification module is used for classifying and analyzing the frequent instruction combination feature table, the sequence similarity feature table and the behavior instruction frequency feature table by adopting a semi-supervised classification algorithm to obtain user groups with different categories.

According to a third aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method of classifying a user population.

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above method of classifying a user group when executing the program.

According to the classification method, the classification device, the storage medium and the computer equipment of the user group, the behavior habit attributes existing among the users are mined by extracting and frequency counting the combination of each operation behavior and the frequent operation behavior of the user group, the potential connection scale between each user and the user group can be quantized by calculating the sequence matching score and the inter-sequence similarity score among the behavior sequences in the user group, and the behavior relation attribute among the socializing-free users is compensated. Based on the method, the behavior habit attributes, the behavior relation attributes and the potential connection attributes of the users in the user group are mined, so that the method can be widely applied to application scenes without social relations among the users and behavior tracks of user operation, and the application range of the user group classification method is expanded. In addition, the method reduces the workload of adding the classification labels to the user group by adopting a semi-supervised classification algorithm, and effectively improves the training efficiency of the user group classification model and the classification efficiency of the user group.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a flowchart illustrating a method for classifying a user group according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a sample behavior sequence dataset according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a sample frequent instruction combination feature table according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a sample sequence similarity feature table provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a sample behavior instruction frequency characteristic table according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a sample classification result of a user group according to an embodiment of the present invention;

FIG. 7 is a scatter plot diagram illustrating a classification result of a user group according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating a method for classifying user groups according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram illustrating a classification apparatus for a user group according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In one embodiment, as shown in fig. 1, a method for classifying a user group is provided, which is described by taking the method as an example of being applied to a computer device such as a server, and includes the following steps:

101. and acquiring behavior data of the user group, and preprocessing the behavior data of the user group to obtain a behavior sequence data set taking the user name of each user as a main object.

The behavior data of the user group refers to data related to operation behaviors of a plurality of users (usually, a large number of users) in a system, which is obtained by analyzing registration information, log information and the like by using the system or a platform with the user as an operation carrier. The operation behavior refers to an operation instruction triggered by the user at each operation time point, and the operation instruction may be, for example, login, browsing a main page, browsing a sub-page, interacting with a certain component in the page, placing an order for a certain commodity, and the like. In this embodiment, in order to facilitate data processing, each operation instruction triggered by the user may be converted into an instruction code, for example, the "login" instruction may be converted into an instruction code "h", the "browse home page" instruction may be converted into an instruction code "f", and the like.

Specifically, the computer device may obtain behavior data of a user group to be processed through a data management center of a certain system or platform, where the user group mainly refers to multiple users registered on the same system or platform, the behavior data of the user group mainly includes information such as a user name of each user, a behavior instruction of each user, and an operation time of each behavior instruction, and then the computer device may perform preprocessing operations such as data cleaning and data processing on the obtained behavior data of the user group, encode each behavior instruction in the behavior data of the user group, and sort the encoded behavior instructions according to the operation time to form a behavior sequence of each user, and finally, the computer device may list the behavior sequences of all users in the user group in a data table with the user name of each user as a subject object, to form a behavioral sequence dataset for a population of users.

In this embodiment, the behavior sequence data set at least includes two field names, which are the user name and the behavior sequence corresponding to the user name, respectively. Because the implementation adopts the semi-supervised classification algorithm to classify the user population, the classification labels of the user population can be incomplete, namely, one part of users of the behavior sequence dataset have the classification label and the other part of users do not have the classification label.

102. And extracting and carrying out frequency statistics on the frequent behavior instruction combination in the behavior sequence data set by using a correlation analysis algorithm to obtain a frequent instruction combination feature table.

The association analysis algorithm refers to an unsupervised learning algorithm for finding out some association between data in a data set, and the algorithm can find out the relationship between data and data in large-scale data, such as finding out a frequent item set (a set of items that often appear together) and an association rule (suggesting that a strong relationship may exist between two items) in the data set, and the like, wherein common association analysis algorithms mainly include an Apriori algorithm, an FP-growth algorithm, and the like.

Specifically, the computer device may find out a frequent item set in the behavior sequence data set by using an association analysis algorithm such as Apriori algorithm and FP-growth algorithm, then count the frequency of each frequent item in the frequent item set in the behavior sequence data set, and finally form a frequent instruction combination feature table using the user name and the frequent item as field names. In this embodiment, the frequent item may specifically be a frequent behavior instruction combination, where the frequent behavior instruction combination refers to a set of behavior instructions that frequently appear together in a behavior sequence data set. For example, a "login" instruction and a "browse home page" instruction typically appear concatenated, where the "login" instruction encodes "h" and the "browse home page" instruction encodes "f", then "hf" is a frequently-behaving instruction combination. Through a correlation analysis algorithm, all frequent behavior instruction combinations in the behavior sequence data set can be found, and further, a frequent instruction combination feature table can be formed by counting the frequency of each frequent behavior instruction combination in each behavior sequence. In the embodiment, by the characteristic that the frequent behavior instructions are combined with the frequency, the daily behavior habits of each user and the overall daily behavior trend of the user group can be mined, so that a powerful basis is provided for the classification of the user group without social relations. It should be noted that the frequent behavior instruction combination is composed of at least two behavior instructions that appear together, and the lengths of the frequent behavior instruction combinations may be different.

103. And calculating sequence matching scores and inter-sequence similarity scores among the behavior sequences in the behavior sequence data set through a sequence comparison algorithm to obtain a sequence similarity feature table.

The sequence alignment algorithm refers to an algorithm for mining the similarity between every two behavior sequences in a data set, and generally speaking, the sequence alignment algorithm can use two indexes to describe the similarity between the sequences, which are the consistency and the similarity, respectively. Currently, the sequence alignment algorithm mainly includes a global sequence alignment algorithm and a local sequence alignment algorithm, and the common sequence alignment algorithms mainly include a Needleman-Wunsch algorithm, a Smith-Waterman algorithm, a FASTA algorithm, a BLAST algorithm, and the like.

Specifically, the computer device may calculate, by a global sequence alignment algorithm and/or a local sequence alignment algorithm, the identity and the similarity between each behavior sequence in the behavior sequence data set and other behavior sequences, where the identity may be expressed by a sequence matching score array, and the similarity may be expressed by an inter-sequence similarity score array. Further, the computer device may calculate a maximum value, a minimum value, an average value, a standard deviation, and a variance in each array of sequence match scores and each array of inter-sequence similarity scores, respectively, to quantify the similarity features between sequences to form a sequence similarity feature table. In the embodiment, by the characteristic of sequence similarity, the behavior relation and the potential connection among the user behaviors can be mined, so that another powerful basis is provided for the user group classification without social relation and behavior tracks. It should be noted that, in this embodiment, the global sequence alignment algorithm or the local sequence alignment algorithm may be separately used to calculate the consistency and the similarity between each behavior sequence and other behavior sequences, or the global sequence alignment algorithm and the local sequence alignment algorithm may be simultaneously used to calculate the global consistency and the global similarity, the local consistency and the local similarity between each behavior sequence and other behavior sequences, respectively, so as to improve the accuracy of sequence alignment.

104. And carrying out frequency statistics on the behavior instructions in the behavior sequence data set to obtain a behavior instruction frequency characteristic table.

Specifically, the computer device may find each behavior instruction in the behavior sequence data set by combining data processing manners such as deduplication, and then count the frequency of occurrence of each behavior instruction in the behavior sequence data set to form a behavior instruction frequency feature table with the user name and the behavior instruction as field names. In the embodiment, by the characteristic of the frequency of the behavior instruction, the behavior inertia of each user and the overall behavior inertia of the user group can be mined, so that a powerful basis is provided for the classification of the user group without social relationship.

105. And classifying and analyzing the frequent instruction combination feature table, the sequence similarity feature table and the behavior instruction frequency feature table by adopting a semi-supervised classification algorithm to obtain user groups with different categories.

The semi-supervised classification algorithm is an algorithm which obtains an initial model by using labeled training data, predicts unlabeled training data by using the initial model, and then iteratively trains the initial model according to a prediction result to obtain a data classification result. The algorithm comprises the following steps: firstly, training a model by using the existing training data, predicting the non-label data, then adding a part of the non-label data with higher confidence degree and labels given by the model into a training set, outputting the current training set and the model when the output result meets the requirements met by the training set and the model, and otherwise, re-training the model until the requirements are met. Currently, the common semi-supervised classification algorithm mainly includes a semi-supervised Support Vector Machines (SVM), a semi-supervised Logistic Regression model (LR), and the like.

Specifically, the computer device may first perform preliminary training through the frequent instruction combination features, the sequence similarity features, and the behavior instruction frequency features of users with classification labels in the user group to obtain an initial classification model, then predict the behavior data of users without classification labels through the initial model to obtain classification labels of users without classification labels, and finally mix the behavior data of all users with the classification labels to perform retraining on the initial model. Repeating the above processes in such a way of continuous iteration until the model parameters and the classification results reach the preset requirements, and obtaining the user group classification model and the user groups with different classes. In the embodiment, by adopting the semi-supervised classification algorithm, a considerable part of workload for adding classification labels to the user data can be reduced, so that the training efficiency of the user classification model is improved, and the labor cost is reduced.

It is understood that, after the behavior sequence data set with the user name as the main object is obtained, the order of generating the frequent instruction combination feature table, the sequence similarity feature table and the behavior instruction frequency feature table based on the behavior sequence data set may be adjusted according to the actual situation, that is, the order of the step 102, the step 103 and the step 104 may be adjusted according to the actual need, and the present embodiment is not limited specifically herein.

According to the classification method for the user group provided by the embodiment, the behavior habit attributes existing among the users are mined by extracting and frequency counting the combination of each operation behavior and the frequent operation behavior of the user group, the potential connection scale between each user and the user group is quantized by calculating the sequence matching score and the inter-sequence similarity score among the behavior sequences in the user group, and the behavior relation attribute among the socializing-free users is compensated. Based on the method, the behavior habit attributes, the behavior relation attributes and the potential connection attributes of the users in the user group are mined, so that the user group classification method can be widely applied to application scenes without social relations among the users and behavior tracks of user operation, and the application range of the user group classification method is expanded. In addition, the method reduces the workload of adding the classification labels to the user group by adopting a semi-supervised classification algorithm, and effectively improves the training efficiency of the user group classification model and the classification efficiency of the user group.

In an embodiment, the step 101 may further include the following steps: the method comprises the steps of firstly obtaining behavior data of a user group, wherein the behavior data of the user group comprises a user name of each user, at least one behavior instruction of each user and operation time of each behavior instruction, then coding the behavior instruction of each user by using a preset character dictionary, sequencing the coded behavior instructions according to the operation time of the behavior instruction to obtain a behavior sequence of each user, and finally generating a behavior sequence data set with the user name of each user as a main object according to the user name of each user and the behavior sequence of each user. In this embodiment, the behavior data of the user group further includes a classification tag of a part of users, that is, a part of users in the user group have a classification tag, and another part of users do not have a classification tag, and correspondingly, the field names of the classification tags are also set in the behavior sequence data set. For example, fig. 2 shows a sample schematic diagram of a behavior sequence dataset, as shown in fig. 2, account refers to a user name, such as "17185", "17187", etc., opt _ seq refers to a behavior sequence, such as "hhB", "hbfhbbhbbbbbbbhbbf", etc., each letter in the behavior sequence refers to a behavior instruction, and the behavior instructions in each behavior sequence are arranged in time sequence, such as "h", "B", etc., label refers to a classification label, there is a classification label, and there is no classification label and there is a special letter, such as "1", "2", "NAN", etc. According to the embodiment, the behavior data of the user group is arranged into the behavior sequence data set, so that feature extraction and classification analysis can be conveniently performed on the behavior data of the user group subsequently, and the data processing efficiency is improved.

In an embodiment, the step 102 may further include the following steps: firstly, extracting frequent behavior instruction combinations in the behavior sequence data set by using an association analysis algorithm to obtain a frequent instruction combination list containing a plurality of frequent behavior instruction combinations, and then counting the frequency of each frequent behavior instruction combination in the frequent instruction combination list in the behavior sequence data set to obtain a frequent instruction combination feature table taking user names and frequent behavior instruction combinations as field names. In this embodiment, the computer device may specifically use the FP-Growth algorithm to extract frequent instruction combinations of all behavior sequences in the behavior sequence data set, so as to obtain a list of frequent instruction combinations with different lengths. For example, fig. 3 shows a sample schematic diagram of a frequent instruction combination feature table, as shown in fig. 3, account refers to a user name, such as "17744.0", "17763.0", etc., other field names refer to frequent behavior instruction combinations, such as "FD", "AC", etc., and numbers under each frequent behavior instruction combination refer to the frequency of occurrence of the frequent behavior instruction combination, such as "8", "16", "9", etc. In this embodiment, the frequent instruction combination list may provide a feature of frequent behavior instruction combination frequency, and through the feature, the daily behavior habit of each user and the overall daily behavior trend of the user group may be mined, so as to provide a basis for the classification accuracy of the user group without social relationship.

In an embodiment, the step 103 may further include the following steps: firstly, calculating a global sequence matching score array and a global sequence similarity score array among behavior sequences in a behavior sequence data set by a global sequence comparison algorithm, respectively calculating a maximum value, a minimum value, an average value, a standard deviation and a variance of the global sequence matching score array and the global sequence similarity score array to obtain a global sequence similarity feature table, then calculating a local sequence matching score array and a local sequence similarity score array among the behavior sequences in the behavior sequence data set by a local sequence comparison algorithm, respectively calculating a maximum value, a minimum value, an average value, a standard deviation and a variance of the local sequence matching score array and the local sequence similarity score array to obtain a local sequence similarity feature table, and finally taking a user name of each user as an associated field, and performing association and combination on the global sequence similarity feature table and the local sequence similarity feature table to obtain a sequence similarity feature table. In this embodiment, the computer device may specifically use a Needleman-Wunsch global sequence alignment algorithm and a Smith-Waterman local sequence alignment algorithm to respectively calculate a global score (sequence matching score) array, a global percent identity (percentage of similarity between sequences) array, a local score array, and a local percent identity array between the behavior sequence of each user and the behavior sequences of all other users, then respectively calculate a maximum value, a minimum value, an average value, a standard deviation, and a variance of each array to output a global sequence similarity feature table and a local sequence similarity feature table, and finally associate and combine the global sequence similarity feature table and the local sequence similarity feature table through a user name field, so as to obtain the sequence similarity feature table. For example, fig. 4 shows a sample diagram of a sequence similarity feature table, as shown in fig. 4, account refers to user names such as "17744.0", "17763.0", etc., and other field names refer to the maximum, minimum, average, standard deviation and variance of each array such as "Ioc _ score _ min", "Ioc _ score _ std", etc. In this embodiment, the sequence similarity feature table may provide a feature of sequence similarity, and through this feature, behavior relationships and potential connections between user behaviors may be mined, so as to improve the classification accuracy of a user group without social relationships and behavior tracks.

In an embodiment, the step 104 may specifically include the following steps: firstly, all behavior instructions in a behavior sequence data set are merged and deduplicated to obtain a behavior instruction list containing all behavior instructions, and then the frequency of each behavior instruction in the behavior instruction list appearing in the behavior sequence data set is counted to obtain a behavior instruction frequency characteristic table taking a user name and the behavior instruction as field names. For example, fig. 5 shows a sample schematic diagram of a behavior instruction frequency characteristic table, as shown in fig. 5, account refers to a user name, such as "17744.0", "17763.0", etc., other field names refer to behavior instructions, such as "a", "B", "C", etc., and the number under each behavior instruction refers to the frequency of occurrence of the behavior instruction, such as "0", "4", "0", etc. In this embodiment, the behavior instruction frequency feature table may provide a feature of behavior instruction frequency, and by the feature, the behavior inertia of each user and the overall behavior inertia of the user group may be mined, so as to further improve the classification accuracy of the user group without social relationship.

In an embodiment, the step 105 may specifically include the following steps: firstly, a user name of each user is used as an association field, association and combination are carried out on a frequent instruction combination feature table, a sequence similarity feature table and a behavior instruction frequency feature table to obtain a feature integrated data table, then classification analysis is carried out on the feature integrated data table through a semi-supervised support vector machine algorithm to obtain a user group classification data table, and user groups with different classes are obtained. For example, fig. 6 shows a sample schematic diagram of a user group classification data table, as shown in fig. 6, account refers to a user name, other field names refer to characteristics such as a behavior instruction and a frequent behavior instruction combination, and label refers to a classification label. Through the user group classification data table, user groups with different categories can be obtained. Further, the classification result of the user group can be more visually observed by making the classification data of the user group as a scatter diagram, wherein a scatter diagram of the classification result of the user group is shown in fig. 7. In the embodiment, by adopting the semi-supervised classification algorithm, a considerable part of workload for adding classification labels to the user data can be reduced, so that the training efficiency of the user classification model is improved, and the labor cost is reduced.

In one embodiment, the user group comprises tagged users and untagged users, wherein the behavior data of the tagged users comprises a classification tag. The step 105 may specifically include the following steps: firstly, training a support vector machine model according to the characteristics of a labeled user in a characteristic integrated data table and the classification labels of the labeled users to obtain an initial user classification model, then inputting the characteristics of a non-labeled user in the characteristic integrated data table into the initial user classification model to obtain the classification labels of the non-labeled user, further optimizing the initial user classification model according to the characteristics of the non-labeled user in the characteristic integrated data table and the classification labels of the non-labeled user to obtain a user classification model, and finally inputting the characteristics of all users in the user group in the characteristic integrated data table into the user classification model to obtain user groups with different categories.

Further, as a refinement and an extension of the specific implementation of the above embodiment, in order to fully explain the implementation process of the embodiment, a method for classifying user groups is provided, as shown in fig. 8, the method includes the following steps:

step 1, acquiring behavior data of a user group, wherein the behavior data comprises a user name, a behavior instruction, operation time of the behavior instruction and an incomplete group classification label of each user;

step 2, data cleaning and processing, which mainly comprises the steps of using a preset character dictionary to code the behavior sequence and generating a behavior sequence data set with the user name as a main object;

step 3, counting the frequent item set as a characteristic, namely performing behavior frequent item calculation and statistics on behavior sequence data of all users through an FP-Growth algorithm, and using the behavior frequent item calculation and statistics as a characteristic field to obtain a data table D0;

step 4, sequence similarity characteristic calculation, namely calculating sequence similarity by using a Needleman-Wunsch algorithm and a Smith-Waterman algorithm aiming at all user behavior sequences, wherein the two algorithms are a global sequence comparison algorithm and a local sequence comparison algorithm respectively and correspond to a score (sequence matching score) array and a percentIdentity (percentage value of similarity between sequences), calculating the maximum value, the minimum value, the average value, the standard deviation and the variance of the score and the percentIdentity array obtained by calculation respectively, and outputting the score and the percentIdentity array as a characteristic column to obtain a data table D1;

step 5, counting the occurrence frequency of each instruction in the behavior sequence of the whole main body object, and taking the occurrence frequency as a characteristic field to obtain a data table D2;

step 6, performing characteristic engineering treatment on all characteristic field data tables D0, D1 and D2, and arranging the characteristic field data tables into a model input format DX;

and 7, obtaining user group classification by using a TSVM semi-supervised classification algorithm.

According to the classification method for the user group provided by the embodiment, global and local sequence similarity comparison calculation is performed on the behavior data of the user group and processed into statistical characteristics, so that the potential connection between each user and all users can be quantized, the behavior relation attribute between users without social contact can be compensated, and the potential connection attribute between the users can be increased; by carrying out frequency statistics on the combination of the behavior instructions and the frequent operation behaviors of the user groups, behavior habit attributes existing among the user groups can be mined, so that the accuracy of classification of the user groups is improved. Finally, the work of manually adding labels can be reduced by using a semi-supervised classification algorithm, so that the automation degree and the operation efficiency of user group classification are improved.

Further, as a specific implementation of the method shown in fig. 1 to fig. 8, the present embodiment provides a user group classification apparatus, as shown in fig. 9, the apparatus includes: the system comprises a user data acquisition module 21, a frequent item feature extraction module 22, a similarity feature extraction module 23, an instruction frequency feature extraction module 24 and a user group classification module 25.

The user data obtaining module 21 may be configured to obtain behavior data of a user group, and pre-process the behavior data of the user group to obtain a behavior sequence dataset in which a user name of each user is a main object, where each user name corresponds to one behavior sequence, and each behavior sequence includes at least one behavior instruction;

the frequent item feature extraction module 22 is configured to extract and sum frequency count the frequent behavior instruction combinations in the behavior sequence data set by using an association analysis algorithm to obtain a frequent instruction combination feature table;

the similarity feature extraction module 23 is configured to calculate, through a sequence comparison algorithm, a sequence matching score and an inter-sequence similarity score between behavior sequences in the behavior sequence data set to obtain a sequence similarity feature table;

the instruction frequency characteristic extraction module 24 is configured to perform frequency statistics on the behavior instructions in the behavior sequence data set to obtain a behavior instruction frequency characteristic table;

the user group classification module 25 may be configured to perform classification analysis on the frequent instruction combination feature table, the sequence similarity feature table, and the behavior instruction frequency feature table by using a semi-supervised classification algorithm, so as to obtain user groups with different categories.

In a specific application scenario, the user data obtaining module 21 is specifically configured to obtain behavior data of a user group, where the behavior data of the user group includes a user name of each user, at least one behavior instruction of each user, and an operation time of each behavior instruction; coding the behavior instruction of each user by using a preset character dictionary; sequencing the coded behavior instructions according to the operation time of the behavior instructions to obtain a behavior sequence of each user; and generating a behavior sequence data set taking the user name of each user as a main object according to the user name of each user and the behavior sequence of each user.

In a specific application scenario, the frequent item feature extraction module 22 is specifically configured to extract a frequent behavior instruction combination in the behavior sequence data set by using an association analysis algorithm, so as to obtain a frequent instruction combination list including a plurality of frequent behavior instruction combinations; and counting the frequency of each frequent behavior instruction combination in the frequent instruction combination list in the behavior sequence data set to obtain a frequent instruction combination feature table taking the user name and the frequent behavior instruction combination as field names.

In a specific application scenario, the similarity feature extraction module 23 is specifically configured to calculate a global sequence matching score array and a global inter-sequence similarity score array between behavior sequences in the behavior sequence data set by using a global sequence comparison algorithm; respectively calculating the maximum value, the minimum value, the average value, the standard deviation and the variance of the global sequence matching score array and the global inter-sequence similarity score array to obtain a global sequence similarity feature table; calculating a local sequence matching score array and a local sequence similarity score array among all behavior sequences in the behavior sequence data set through a local sequence comparison algorithm; respectively calculating the maximum value, the minimum value, the average value, the standard deviation and the variance of the local sequence matching score array and the local sequence inter-similarity score array to obtain a local sequence similarity feature table; and taking the user name of each user as an association field, and performing association combination on the global sequence similarity feature table and the local sequence similarity feature table to obtain a sequence similarity feature table.

In a specific application scenario, the instruction frequency feature extraction module 24 may be specifically configured to perform merging and deduplication processing on all behavior instructions in the behavior sequence data set to obtain a behavior instruction list including all behavior instructions; and counting the frequency of each behavior instruction in the behavior instruction list in the behavior sequence data set to obtain a behavior instruction frequency characteristic table with the user name and the behavior instruction as field names.

In a specific application scenario, the user group classification module 25 is specifically configured to perform association and merging on the frequent instruction combination feature table, the sequence similarity feature table, and the behavior instruction frequency feature table by using a user name of each user as an association field to obtain a feature integration data table; and classifying and analyzing the feature integration data table through a semi-supervised support vector machine algorithm to obtain user groups with different categories.

In a specific application scenario, a user group comprises tagged users and non-tagged users, and behavior data of the tagged users comprises a classification tag; the user group classification module 25 is further specifically configured to train the support vector machine model according to the features of the tagged users in the feature integration data table and the classification tags of the tagged users, so as to obtain an initial user classification model; inputting the characteristics of the users without labels in the characteristic integrated data table into an initial user classification model to obtain the classification labels of the users without labels; optimizing the initial user classification model according to the characteristics of the non-label users in the characteristic integration data table and the classification labels of the non-label users to obtain a user classification model; and inputting the characteristics of all users in the user group in the characteristic integration data table into the user classification model to obtain the user groups with different categories.

It should be noted that other corresponding descriptions of the functional modules related to the classification device for a user group provided in this embodiment may refer to the corresponding descriptions in fig. 1 to fig. 8, and are not described herein again.

Based on the method shown in fig. 1 to 8, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the method for classifying a user group shown in fig. 1 to 8.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.

Based on the method shown in fig. 1 to 8 and the embodiment of the classification apparatus for a user group shown in fig. 9, in order to achieve the above object, the present embodiment further provides an entity device for classifying a user group, which may specifically be a personal computer, a server, a smart phone, a tablet computer, a smart watch, or other network devices, and the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the above-described method as shown in fig. 1 to 8.

Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

Those skilled in the art will appreciate that the classified entity device structure of a user group provided in the present embodiment does not constitute a limitation to the entity device, and may include more or less components, or combine some components, or arrange different components.

The storage medium may further include an operating system and a network communication module. The operating system is a program for managing the hardware of the above-mentioned entity device and the software resources to be identified, and supports the operation of the information processing program and other software and/or programs to be identified. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the entity device.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. The method comprises the steps of obtaining behavior data of a user group, preprocessing the behavior data of the user group to obtain a behavior sequence data set with the user name of each user as a main object, extracting and frequency counting frequent behavior instruction combinations in the behavior sequence data set by using an association analysis algorithm to obtain a frequent instruction combination feature table, calculating sequence matching scores and inter-sequence similarity scores among behavior sequences in the behavior sequence data set by using a sequence comparison algorithm to obtain a sequence similarity feature table, carrying out frequency counting on behavior instructions in the behavior sequence data set to obtain a behavior instruction frequency feature table, and carrying out classification analysis on the frequent instruction combination feature table, the sequence similarity feature table and the behavior instruction frequency feature table by using a semi-supervised classification algorithm to obtain user groups with different classes. Compared with the prior art, the method has the advantages that the behavior habit attributes, the behavior relation attributes and the potential connection attributes of the users in the user group are mined, so that the user group classification method can be widely applied to application scenes without social relations among the users and behavior tracks of user operation, and the application range of the user group classification method is expanded. In addition, the method also reduces the workload of adding the classification labels to the user group, and effectively improves the training efficiency of the user group classification model and the classification efficiency of the user group.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A method for classifying a user population, the method comprising:

acquiring behavior data of a user group, and preprocessing the behavior data of the user group to obtain a behavior sequence data set taking the user name of each user as a main object, wherein each user name corresponds to a behavior sequence, and each behavior sequence comprises at least one behavior instruction;

extracting and carrying out frequency statistics on frequent behavior instruction combinations in the behavior sequence data set by using a correlation analysis algorithm to obtain a frequent instruction combination feature table;

performing frequency statistics on the behavior instructions in the behavior sequence data set to obtain a behavior instruction frequency feature table;

and carrying out classification analysis on the frequent instruction combination feature table, the sequence similarity feature table and the behavior instruction frequency feature table by adopting a semi-supervised classification algorithm to obtain user groups with different categories.

2. The method according to claim 1, wherein the acquiring the behavior data of the user group and preprocessing the behavior data of the user group to obtain a behavior sequence dataset with the user name of each user as a main object comprises:

acquiring behavior data of a user group, wherein the behavior data of the user group comprises a user name of each user, at least one behavior instruction of each user and operation time of each behavior instruction;

coding the behavior instruction of each user by using a preset character dictionary;

sequencing the coded behavior instructions according to the operation time of the behavior instructions to obtain a behavior sequence of each user;

and generating a behavior sequence data set taking the user name of each user as a main object according to the user name of each user and the behavior sequence of each user.

3. The method of claim 1, wherein the extracting and frequency counting the frequent behavior instruction combinations in the behavior sequence data set by using a correlation analysis algorithm to obtain a frequent instruction combination feature table comprises:

extracting frequent behavior instruction combinations in the behavior sequence data set by using a correlation analysis algorithm to obtain a frequent instruction combination list containing a plurality of frequent behavior instruction combinations;

and counting the frequency of each frequent behavior instruction combination in the frequent instruction combination list in the behavior sequence data set to obtain a frequent instruction combination feature table taking the user name and the frequent behavior instruction combination as field names.

4. The method of claim 1, wherein calculating a sequence match score and an inter-sequence similarity score between each behavior sequence in the behavior sequence data set by using a sequence alignment algorithm to obtain a sequence similarity feature table comprises:

calculating a global sequence matching score array and a global inter-sequence similarity score array among all behavior sequences in the behavior sequence data set through a global sequence comparison algorithm;

respectively calculating the maximum value, the minimum value, the average value, the standard deviation and the variance of the global sequence matching score array and the global inter-sequence similarity score array to obtain a global sequence similarity feature table;

calculating a local sequence matching score array and a local sequence similarity score array between each behavior sequence in the behavior sequence data set through a local sequence comparison algorithm;

respectively calculating the maximum value, the minimum value, the average value, the standard deviation and the variance of the local sequence matching score array and the local sequence inter-similarity score array to obtain a local sequence similarity feature table;

and taking the user name of each user as an association field, and associating and combining the global sequence similarity feature table and the local sequence similarity feature table to obtain a sequence similarity feature table.

5. The method of claim 1, wherein performing frequency statistics on the behavior commands in the behavior sequence data set to obtain a behavior command frequency feature table comprises:

merging and de-duplicating all the behavior instructions in the behavior sequence data set to obtain a behavior instruction list containing all the behavior instructions;

and counting the frequency of each behavior instruction in the behavior instruction list in the behavior sequence data set to obtain a behavior instruction frequency characteristic table taking the user name and the behavior instruction as field names.

6. The method according to claim 1, wherein the classifying and analyzing the frequent instruction combination feature table, the sequence similarity feature table and the behavior instruction frequency feature table by using a semi-supervised classification algorithm to obtain user groups with different categories comprises:

taking the user name of each user as an association field, and performing association combination on the frequent instruction combination feature table, the sequence similarity feature table and the behavior instruction frequency feature table to obtain a feature integrated data table;

and carrying out classification analysis on the feature integration data table through a semi-supervised support vector machine algorithm to obtain user groups with different categories.

7. The method of claim 6, wherein the user group comprises tagged users and untagged users, and the behavior data of the tagged users comprises a category tag; then, the classifying and analyzing the feature integration data table through a semi-supervised support vector machine algorithm to obtain user groups with different categories, including:

training a support vector machine model according to the characteristics of the labeled users in the characteristic integration data table and the classification labels of the labeled users to obtain an initial user classification model;

inputting the characteristics of the label-free user in the characteristic integration data table into the initial user classification model to obtain the classification label of the label-free user;

optimizing the initial user classification model according to the characteristics of the label-free user in the characteristic integration data table and the classification label of the label-free user to obtain a user classification model;

and inputting the characteristics of all users in the user group in the characteristic integration data table into the user classification model to obtain user groups with different categories.

8. An apparatus for classifying a user population, the apparatus comprising:

the user data acquisition module is used for acquiring behavior data of a user group and preprocessing the behavior data of the user group to obtain a behavior sequence dataset which takes the user name of each user as a main object, wherein each user name corresponds to one behavior sequence, and each behavior sequence comprises at least one behavior instruction;

9. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, realizing the steps of the method of any one of claims 1 to 7.

10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.