CN113656584A

CN113656584A - User classification method and device, electronic equipment and storage medium

Info

Publication number: CN113656584A
Application number: CN202110949830.9A
Authority: CN
Inventors: 黄莉莉
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2021-11-16

Abstract

The application discloses a user classification method, a user classification device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring the part of speech characteristics of each part of speech according to user text data, wherein the part of speech is obtained based on keyword clustering in the user text data; generating a part-of-speech feature vector of each user according to the generated part-of-speech features and the relationship between each user and the part of speech, wherein the part-of-speech feature vector is an N-dimensional vector, and N is the total number of the parts of speech; acquiring probability distribution of each user about each preset theme according to the word class characteristic vector of each user; and for each user, generating the probability distribution of the user about each preset service scene according to the mapping relation between the preset part of speech and each preset service scene, the probability distribution of the part of speech under each preset theme and the probability distribution of the user about each preset theme.

Description

User classification method and device, electronic equipment and storage medium

Technical Field

The application belongs to the technical field of computers, and particularly relates to a user classification method and device, electronic equipment and a storage medium.

Background

With the gradual increase of the number of users in the internet, refined operation becomes a mainstream operation mode, how to finely classify the users is performed, and a subdivided operation strategy is matched to accurately push appropriate products to the users, so that the problem that each product or operator needs to solve urgently is solved.

In the prior art, existing portrait labels or single keywords are mainly used as input, and a k-means method is used for classifying users.

However, since there are many scenes in which the vocabulary is rich, some of them are few, or there are many synonyms/approximations/neologies, the final output will have a large instability or potential tendency when the keyword is directly input. For example, when a keyword or a popular keyword in a hot scene is used as an input, the final output tends to the information represented by such a keyword, and the output is under an unconscious suspicion, which results in an inaccurate classification result.

Disclosure of Invention

The embodiment of the application aims to provide a user classification method, a user classification device, electronic equipment and a storage medium, and can solve the problem that a classification result is inaccurate in the prior art.

In a first aspect, an embodiment of the present application provides a user classification method, where the method includes:

acquiring the part of speech characteristics of each part of speech according to user text data, wherein the part of speech is obtained based on keyword clustering in the user text data;

generating a part-of-speech feature vector of each user according to the generated part-of-speech features and the relationship between each user and the part of speech, wherein the part-of-speech feature vector is an N-dimensional vector, and N is the total number of the parts of speech;

acquiring probability distribution of each user about each preset theme according to the word class characteristic vector of each user;

and for each user, generating the probability distribution of the user about each preset service scene according to the mapping relation between the preset part of speech and each preset service scene, the probability distribution of the part of speech under each preset theme and the probability distribution of the user about each preset theme.

In a second aspect, an embodiment of the present application provides an apparatus for classifying users, where the apparatus includes:

the first obtaining module is used for obtaining the word class characteristics of each word class according to user text data, wherein the word classes are obtained based on keyword clustering in the user text data;

the first generation module is used for generating a part-of-speech feature vector of each user according to the generated part-of-speech features and the relationship between each user and the part-of-speech, wherein the part-of-speech feature vector is an N-dimensional vector, and N is the total number of the parts-of-speech;

the second obtaining module is used for obtaining the probability distribution of each user about each preset theme according to the part of speech characteristic vector of each user;

and the second generation module is used for generating the probability distribution of the user about each preset service scene according to the mapping relation between the preset part of speech and each preset service scene, the probability distribution of the part of speech under each preset theme and the probability distribution of the user about each preset theme.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, for a user group to be classified, keywords in user text data generated by the user group are extracted, the keywords are clustered to obtain a plurality of parts of speech, and part of speech characteristics of each part of speech are generated. Then generating a part-of-speech feature vector of each user, and acquiring the probability distribution of each user about each theme according to the part-of-speech feature vector of each user; and finally, generating the probability distribution of each user in the user group about each service scene according to the mapping relation among the theme, the part of speech and the service scene and the probability distribution of each user about each theme.

Compared with the prior art, in the embodiment of the application, word classification is carried out on the keywords, the word class replaces a single word to serve as an input characteristic, the user group is classified, the head effect of a hot scene or popular words can be reduced, the input stability and the contribution of long-tail information dimension are improved, and therefore the accuracy of user classification is improved. In addition, by taking the word class as a characteristic input unit, low-dimensional mapping of the theme, the word class and the service scene is established, category division is carried out on user groups, not only the user angle but also the service scene angle can be switched to start when the user is finally recognized, the user can be automatically described in a high-efficiency, stable and convergent manner to the service direction, and the product can be rapidly helped to operate and recognize the user.

Drawings

Fig. 1 is a flowchart of a user classification method provided in an embodiment of the present application;

FIG. 2 is a diagram of a first example of the calculation process of step 104 provided by the embodiment of the present application;

FIG. 3 is a diagram of a second example of the calculation process of step 104 provided by the embodiments of the present application;

FIG. 4 is a third exemplary diagram of the calculation process of step 104 provided by the embodiment of the present application;

FIG. 5 is a flowchart of another user classification method provided in an embodiment of the present application;

FIG. 6 is an exemplary diagram of a word cloud provided by an embodiment of the present application;

FIG. 7 is a flowchart of another user classification method provided in an embodiment of the present application;

FIG. 8 is a flowchart of another user classification method provided in an embodiment of the present application;

fig. 9 is a block diagram illustrating a structure of a user classifying device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 11 is a hardware structure diagram of an electronic device implementing various embodiments of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

As internet users gradually increase, refinement operation is a very popular concept, and how to quickly classify users and make corresponding business strategies is a problem often faced by each product or operation. In most scenes, a common service method is to classify users by defining rules based on own experience and then perform user cognition; the second common method is to simply use the existing label or a single keyword as input and use a k-means method to classify the users, so as to form a new label as cognition.

The time consumption of people divided based on the artificial business experience is relatively long, and the output user cognition is limited by the rule of business experience output, so that a new people can not be found in a heuristic manner; the use of portrait tags or keywords as direct input can solve the limitation of human experience, but also has the following problems:

1) the crowd output is not fixed because the initial clustering point is randomly screened and the Euclidean distance is used for measuring the crowd similarity by k-means, and is easily influenced by abnormal points; 2) keywords as direct input have a great instability or a potential tendency, because some scene vocabularies are rich in types, and some scene vocabularies are few, or synonyms/similar words/new words are many, the final output tends to the information shown by the keywords, and the output is not objectively suspected; 3) the output crowd may not correspond to the existing system of the service, and a service scene mapping process is lacked.

In order to solve the technical problem, embodiments of the present application provide a user classification method, apparatus, electronic device, and storage medium.

The user classification method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

For ease of understanding, some concepts involved in the embodiments of the present application will be described first.

LDA (Latent Dirichlet allocation), which is an algorithm model for calculating probability based on text topics, outputs probability distribution of each topic.

PCA (Principal component analysis) is a method of linearly transforming observed values of a series of variables that may be correlated by using orthogonal transformation to project values of a series of linearly uncorrelated variables, which are Principal components.

Information contribution refers to information gain, i.e., the degree to which the complexity (or uncertainty) of the information is reduced under a condition.

TGI (Target Group Index), which is an Index reflecting the strength or weakness of a Target population in a specific study range, is a statistical Index of a relative quantitative description.

An entity refers to an atlas refers to an integration of an ontology, an instance and a relationship, i.e., an inherent existence, such as a person, an event or an object, while an instance refers to a specific personal object, and a relationship refers to all attributes of the entity, such as the portrait attributes of a person.

The scenario, originally referring to some conditions when the demand is generated, including but not limited to time, space, equipment environment, etc., in the embodiment of the present application, these factors are integrated into the demand type, such as meal takeout, enlightenment education, live competition, etc.

Next, a user classification method provided in the embodiment of the present application is described.

Fig. 1 is a flowchart of a user classification method provided in an embodiment of the present application, and as shown in fig. 1, the method may include the following steps: step 101, step 102, step 103 and step 104, wherein,

in step 101, a part-of-speech feature of each part-of-speech is obtained according to the user text data, wherein the part-of-speech is obtained based on keyword clustering in the user text data.

In the embodiment of the application, for a user group to be classified, user text data generated by user behaviors of each user in the user group is firstly acquired, then keywords in the user text data are extracted, the extracted keywords are clustered to obtain a plurality of parts of speech, and part of speech characteristics of each part of speech are acquired.

In practical applications, the user behavior may include: an act of opening the APP, an act of searching in the APP, an act of shopping in the APP, and so forth.

In the embodiment of the application, when extracting the keywords in the user text data, the method can be realized by the following two steps: 1) sequentially carrying out data cleaning and word segmentation processing on user text data to obtain a plurality of words; 2) and filtering the plurality of words according to a preset word filtering rule to obtain a plurality of keywords, wherein the preset word filtering rule comprises removing words without entity meanings, removing cold words and hot words and removing words with small weights.

In the embodiment of the application, the recognizable errors in the user text data can be corrected by cleaning the user text data, then the cleaned user text data is subjected to word segmentation, and a long sentence is divided into a plurality of short words so as to extract effective keywords in the following process. In practical applications, the data cleansing may include: checking data consistency, processing invalid values and missing values and the like.

In the embodiment of the present application, for a plurality of words obtained by word segmentation processing, words without entity meaning, such as stop words, tone words, and the like, may be removed first. Then, hot words and cold words are removed, wherein the hot words and the cold words can be judged according to the percentile distribution of the number of users used by each word, for example, the words which are mentioned by 80% or more of the users are the "hot words", and the words which are long-tailed and have a very small number of users are the "cold words". And finally, removing the words with smaller weight to finally obtain the effective keywords, wherein the words with smaller weight cannot represent the behavior of the user, and the weights of the words can be obtained through TFIDF or TEXTRANK.

In the embodiment of the present application, when clustering extracted keywords, that is, when performing word classification on keywords, the method can be implemented by the following two steps: 1) inputting the extracted keywords into a word2vec model trained in advance for processing to obtain corresponding word vectors; 2) and processing the word vectors of the extracted keywords based on a clustering algorithm to obtain a plurality of word classes.

In the embodiment of the application, the word2vec model can be trained offline in advance by adopting related training data, and the word can be directly inferred and applied if a new word line exists.

In the embodiment of the application, a k-means clustering algorithm can be used for processing the word vectors of the extracted keywords to obtain a plurality of word classes.

In the embodiment of the application, when the part-of-speech feature of each part-of-speech is obtained, the method can be realized through the following three steps: 1) generating comprehensive characteristics of each keyword under each part of speech according to user behavior characteristics corresponding to each keyword under each part of speech in user text data; 2) generating word weights of the keywords under each part of speech according to user behaviors corresponding to the keywords under each part of speech in the user text data; 3) and for each part of speech, carrying out weighted summation on the comprehensive characteristics of each keyword under the part of speech and the corresponding word weight to obtain the part of speech characteristics of the part of speech.

In the embodiment of the present application, generating the comprehensive characteristics of each keyword under each part of speech according to the user behavior characteristics corresponding to each keyword under each part of speech in the user text data may specifically include the following steps:

processing user behavior characteristics corresponding to each keyword under each part of speech in user text data to obtain weights of the user behavior characteristics corresponding to each keyword under each part of speech, and performing weighted summation on the user behavior characteristics corresponding to each keyword under each part of speech and the corresponding weights to obtain comprehensive characteristics of each keyword under each part of speech, wherein the user behavior characteristics comprise: number of users, days, times and duration.

In practical application, the user behavior characteristics corresponding to the keywords can be processed through a PCA algorithm to obtain the weight of each user behavior characteristic corresponding to the keywords.

In the embodiment of the present application, generating a word weight of each keyword under each part of speech according to a user behavior corresponding to each keyword under each part of speech in user text data may specifically include the following steps:

and processing the user behaviors corresponding to the keywords under each part of speech in the user text data and a plurality of preset portrait labels to obtain behavior preferences of the portrait labels corresponding to the keywords under each part of speech, and determining the behavior preferences corresponding to the keywords as word weights corresponding to the keywords.

In practical application, the user behavior corresponding to the keyword and the preset portrait tags can be processed through an attention mechanism, and behavior preference of the portrait tags corresponding to the keyword is obtained.

In one example, taking part of speech 1 as an example, if part of speech 1 contains: keyword 1, keyword 2, and keyword 3, then the part-of-speech feature of part-of-speech 1 ═ Σ (integrated feature of keyword i ═ word weight of keyword i), i ═ 1,2, 3; the specific treatment process comprises the following steps:

1) generating comprehensive characteristics of keyword 1, comprehensive characteristics of keyword 2, and comprehensive characteristics of keyword 3

The user behavior generates words, the behavior characteristics comprise the number of users, days, times, duration and the like, the four types of characteristics of the keyword 1 are input into Principal Component Analysis (PCA) by taking the keyword 1 as an example, the weights of the four types of characteristics of the keyword 1 are calculated by utilizing the contribution degree of the principal components, the linear combination relation of the principal component factors and the various original characteristics, the four types of characteristics are fused (namely the characteristics and the weights are subjected to weighted summation), and the weights are assigned to the keyword 1 to obtain the comprehensive characteristics of the keyword 1. Similarly, a composite feature of keyword 2 and a composite feature of keyword 3 may be generated.

2) Generating a word weight for keyword 1, a word weight for keyword 2, and a word weight for keyword 3

One part of speech has a plurality of words corresponding to a plurality of user behaviors, taking a keyword 1 as an example, the user behavior corresponding to the keyword 1 combines the behavior preference of the portrait label, and the behavior preference is used as the word weight of the keyword 1 according to the attention mechanism. Similarly, a word weight for keyword 2 and a word weight for keyword 3 may be generated.

3) Generating part-of-speech features for part-of-speech 1

The part-of-speech feature of part-of-speech 1 is the integrated feature of keyword 1 + the word weight of keyword 2 + the integrated feature of keyword 3 + the word weight of keyword 3.

Therefore, in the embodiment of the application, the part of speech replaces a single keyword as an input feature, so that the head effect of a hot scene or a popular word can be reduced, and the input stability and the contribution of long-tail information dimension are improved. In addition, the weight of the keyword is not only dependent on frequency, but also the duration and the number of active days are considered, PCA is used for extracting the weight, the final comprehensive weighting is obtained to obtain the basic characteristics, and the information content higher than that of pure frequency information input can be obtained.

In step 102, a part-of-speech feature vector of each user is generated according to the generated part-of-speech features and the relationship between each user and the part-of-speech, wherein the part-of-speech feature vector is an N-dimensional vector, and N is the total number of parts-of-speech.

In one example, taking a single user a as an example, a plurality of keywords belonging to part of speech 1, part of speech 2, and part of speech 3 in text data generated by the user a are extracted. Assume that 100 parts of speech are generated in step 101: part of speech 1, part of speech 2, …, and part of speech 100, the part of speech feature vector of user a is a 100-dimensional vector, and may specifically be (part of speech feature of part of speech 1, part of speech feature of part of speech 2, part of speech feature of part of speech 3,0, …,0) or (part of speech feature of part of speech 1, part of speech feature of part of speech 2, part of speech feature of part of speech 3, default, …, default).

In one example, the user group to be classified includes 200 users, and then word class feature vectors of the 200 users, that is, 200 word class feature vectors, are generated through step 102.

In step 103, according to the part of speech feature vector of each user, the probability distribution of each user with respect to each preset topic is obtained.

In the embodiment of the application, the part-of-speech feature vector of each user can be respectively input into a preset probability topic model for processing, so that the probability distribution of each user about each topic is obtained.

In the embodiment of the present application, the probabilistic topic model may be an LDA topic model, the input of the model is a part-of-speech feature vector of a single user, and the output of the model is a probability distribution of the single user on each topic.

In the embodiment of the application, the LDA topic model may be trained in advance by using related training data.

In one example, the output of the LDA topic model corresponds to six topics: subject 1, subject 2, subject 3, subject 4, subject 5 and subject 6, and the user group to be classified includes 200 users, then the probability distribution of the 200 users about the six subjects can be obtained through step 103. Taking a user as an example, the probability distribution of the user about the six topics is [0.1,0.2,0.05,0.5,0.1,0.05 ].

In step 104, for each user, a probability distribution of the user about each preset service scene is generated according to a mapping relationship between a preset part of speech and each preset service scene, a probability distribution of the part of speech under each preset topic, and a probability distribution of the user about each preset topic.

In the embodiment of the application, the probability that a single user belongs to each service scene can be calculated and output according to the mapping relation among the theme < - - > part of speech < - - > service scenes, so that the cognitive output demanded by the core service scene of the user can be obtained.

In practical applications, the service scenario may include: mother and infant, enlightening education, shopping, traveling, navigation, job hunting, and the like.

In the embodiment of the present application, the probability distribution of the part of speech under each preset topic may be: and 103, presetting a conjugate matrix of each part of speech represented by the probabilistic topic model and each topic.

In an embodiment provided by the present application, the step 104 may specifically include the following steps (not shown in the figure): step 1041, step 1042, and step 1043, wherein,

in step 1041, for each user, multiplying the probability distribution of the part of speech under each preset topic by the probability distribution of the user about each preset topic to obtain the probability distribution of the user about each part of speech under each preset topic.

In step 1042, the probability distribution of the user about each preset topic in each preset business scenario is generated according to the mapping relationship between the preset part of speech and each preset business scenario and the probability distribution of the user about each part of speech in each preset topic.

In step 1043, the probability distribution of the user about the preset topics in the preset business scenes is multiplied by the probability distribution of the user about the preset topics, so as to generate the probability distribution of the user about the preset business scenes.

In one example, the detailed process may be as follows:

1) pre-establishing mapping relation between word classes and service scenes

The method is characterized in that pre-training and early-stage configuration can be determined, service personnel designate service scene definition and description, then based on the scene description of the service definition, according to map relation embedding and the average similarity of word sets in descriptive words and word classes, the word class with the most similar service scene is automatically given, a relation suggestion of the service scene < - > word class is formed, then the service personnel can perform manual inspection and correction if necessary, and the establishment efficiency of the mapping relation is improved; for example: part of speech 1 (lottery; financial management test; bank; loan) < - > financial management scene.

2) Computing business scenario probabilities

As fig. 2 shows an overall computing link, the overall computing link is mainly composed of 3 computing components: the left one in fig. 2 is a mapping relationship between a pre-constructed part of speech and a service scene as a dimension table; in fig. 2, the middle is the result of the conjugate matrix of the part of speech and the topic output by LDA in step 103, the longitudinal direction represents the part of speech, the row number represents the part of speech number, and the transverse direction represents the result topic, the column number represents the topic number, and the value represents the polynomial distribution of the topic and the generated part of speech as the result dimension table; the right in fig. 2 is the probability distribution of a single user of the LDA output with respect to each topic.

Firstly, calculating the probability distribution of a single user about each part of speech under each topic through the conjugate matrix of each part of speech and each topic represented by the probability distribution of the single user about each topic and the probability topic model.

For example, the probability of a single user a about topic 1 is 0.1, and the polynomial distribution of topic 1 about each part of speech is [0.3,0.1,0.25,0.15,0.2], then the probability distribution of user a under topic 1 for each part of speech is 0.1 [ [0.3,0.1,0.25,0.15,0.2] ═ 0.03,0.01,0.025,0.015,0.02], and other topics can be obtained in a similar way, and finally the distribution matrix of user a topics and generated parts of speech is obtained, as shown in fig. 3.

And then, according to the mapping relation of the word classes and the service scenes and the result shown in fig. 3, merging and summarizing into the probability distribution of a single user about each service scene. As shown in fig. 4, the word classes are mapped into corresponding service scene numbers according to the mapping relationship between the word classes and the service scenes, as shown in the left part of fig. 4, then the word classes are grouped according to the service scenes to obtain the column sum, the probability distribution of the user about each topic under each service scene is obtained, as shown in the middle part of fig. 4, and finally the value of each row is weighted and summed according to the topic distribution probability vector (converted into the column vector) of the user, and the probability distribution of the user about each service scene is obtained, as shown in the right part of fig. 4.

As can be seen from the above embodiment, in this embodiment, for a user group to be classified, keywords in user text data generated by the user group are extracted, the keywords are clustered to obtain a plurality of parts of speech, and a part of speech feature of each part of speech is generated. Then generating a part-of-speech feature vector of each user, and acquiring the probability distribution of each user about each theme according to the part-of-speech feature vector of each user; and finally, generating the probability distribution of each user in the user group about each service scene according to the mapping relation among the theme, the part of speech and the service scene and the probability distribution of each user about each theme. Compared with the prior art, in the embodiment of the application, word classification is carried out on the keywords, the word class replaces a single word to serve as an input characteristic, the user group is classified, the head effect of a hot scene or popular words can be reduced, the input stability and the contribution of long-tail information dimension are improved, and therefore the accuracy of user classification is improved. In addition, by taking the word class as a characteristic input unit, low-dimensional mapping of the theme, the word class and the service scene is established, category division is carried out on user groups, not only the user angle but also the service scene angle can be switched to start when the user is finally recognized, the user can be automatically described in a high-efficiency, stable and convergent manner to the service direction, and the product can be rapidly helped to operate and recognize the user.

In another embodiment provided by the present application, in the embodiment of the present application, a document suggestion may be output based on a population output top focused keyword and a business scenario requirement, so as to assist a user in realizing cognition and landing from a dimension to a business dimension, as shown in fig. 5, the user classification method provided by the embodiment of the present application may further include the following steps after step 104 of the embodiment shown in fig. 1: step 501, step 502, step 503 and step 504, wherein,

in step 501, for each part of speech under each topic, multiplying the word weight of each keyword under the part of speech by the probability distribution of the corresponding part of speech under the topic to obtain the probability distribution of each keyword under the part of speech under the topic.

In step 502, the user coverage and click rate of each keyword under the topic are obtained.

In step 503, for each keyword under the topic, multiplying the probability of the keyword under the topic, the user coverage rate and the click rate to obtain the recommendation coefficient of the keyword under the topic.

In step 504, the business document suggestions containing the corresponding keywords are output in descending order according to the recommendation coefficients of the keywords.

In one example, the user classification results are presented as follows:

1) core scene: childbearing mother and infant, enlightenment education;

2) population autodescription (high TGI high proportion):

woman (66.14%, 132)

24-27 years old (67.52%, 1157)

Married (79.81%, 249), with children (64.03%, 127),

cell phones with preference for 700-;

3) top scene top keywords are shown by a word cloud graph, as shown in fig. 6.

Therefore, in the embodiment of the application, a file suggestion can be output based on the requirement of the crowd for outputting top attention keywords and a service scene, and the user dimensionality-service dimensionality cut-in cognition is assisted by falling to the ground.

In another embodiment provided by the present application, in the embodiment of the present application, after the crowd clustering, the crowd key difference description may be automatically output according to the importance (information contribution and TGI) of the keyword and the portrait label, so as to improve the cognitive efficiency of the user in product operation, as shown in fig. 7, the user classification method provided by the embodiment of the present application may further add the following steps after step 104 in the embodiment shown in fig. 1: step 701, step 702 and step 703, wherein,

in step 701, a predefined portrait tab set is obtained, wherein the portrait tab set comprises a plurality of portrait tabs.

In step 702, the correlation between each service scene and each image tag in the image tag set is calculated according to the probability distribution of the user about each preset service scene.

In the embodiment of the present application, when calculating the correlation, the pearson correlation coefficient is used if the label is a numerical type (for example, age), and the rank correlation coefficient is used if the label is a discrete type (for example, gender).

In step 703, the portrait label with high relevance is used as the crowd description information of the user.

In the embodiment of the present application, the higher correlation means that: the correlation is greater than a preset threshold; the portrait label with high correlation is: and (4) the image label with the correlation larger than a preset threshold value.

Therefore, in the embodiment of the application, after crowd clustering, the crowd key difference description can be automatically output according to the importance (information contribution and TGI) of the keywords and the portrait label, so that the user cognition efficiency of product operation is improved.

In another embodiment provided by the present application, in the embodiment of the present application, when a user concerned with a new product interest needs to be found, a scene association search may be performed based on a graph entity, and an expanded word is further positioned in a corresponding part of speech, so as to find a new interest seed user, as shown in fig. 8, the user classification method provided by the embodiment of the present application may further add the following steps after step 104 of the embodiment shown in fig. 1: step 801, step 802, step 803, and step 804, wherein,

in step 801, for the new online interest, the related keywords of the new interest are obtained, and the related keywords are classified to obtain the corresponding related parts of speech.

In the embodiment of the present application, the new rights may be supplementary services given to the user, such as mobile phone shell, sticker, or other value-added products.

In the embodiment of the application, when the related keywords of the new interest are obtained, the related keywords of the new interest of the service product input by the service personnel can be used as the related keywords, and the related keywords can also be obtained through word expansion, for example, the related words are obtained through a knowledge graph based on scene association and entity relationship.

In step 802, a part-of-speech feature vector of the new interest is generated according to the generated part-of-speech feature and the related part-of-speech, and a probability distribution of the new interest with respect to each topic is obtained according to the part-of-speech feature vector of the new interest.

In the embodiment of the application, the part-of-speech feature vector of the new interest can be input into the probability topic model for processing, so that the probability distribution of the new interest on each topic is obtained.

In step 803, a head topic to which the new interest belongs is determined according to the probability distribution of the new interest on each topic, wherein the probability value of the head topic is the maximum.

In step 804, according to the probability distribution of the users about each topic, a user group which is the same as the head topic of the new interest among the existing users is determined, and the user group is determined as the seed user of the new interest.

Therefore, in the embodiment of the application, the function of selecting the target group by the new interest is added, and when the service has the new interest and does not have the corresponding portrait label, the function can be selected, so that on the premise of perceiving the new user, the new service can be assisted to find the initially matched seed user, and the new service interest can be assisted to initialize the push touch service.

It should be noted that, in the user classification method provided in the embodiment of the present application, the execution main body may be a user classification device, or a control module in the user classification device, which is used for executing the loading user classification method. In the embodiment of the present application, a user classification device executes a method for loading user classifications as an example, which illustrates the user classification device provided in the embodiment of the present application.

Fig. 9 is a block diagram illustrating a structure of a user classifying device according to an embodiment of the present application, and as shown in fig. 9, the user classifying device 900 may include: a first obtaining module 901, a first generating module 902, a second obtaining module 903, and a second generating module 904, wherein,

a first obtaining module 901, configured to obtain a part-of-speech feature of each part-of-speech according to user text data, where the part-of-speech is obtained based on keyword clustering in the user text data;

a first generating module 902, configured to generate a part-of-speech feature vector of each user according to the generated part-of-speech features and a relationship between each user and a part-of-speech, where the part-of-speech feature vector is an N-dimensional vector, and N is a total number of parts-of-speech;

a second obtaining module 903, configured to obtain, according to the part-of-speech feature vector of each user, probability distribution of each user with respect to each preset topic;

a second generating module 904, configured to generate, for each user, a probability distribution of the user about each preset service scene according to a mapping relationship between a preset part of speech and each preset service scene, a probability distribution of the part of speech under each preset topic, and a probability distribution of the user about each preset topic.

Optionally, as an embodiment, the second generating module 904 may include:

the first generation submodule is used for multiplying the probability distribution of the part of speech under each preset theme with the probability distribution of the user about each preset theme to obtain the probability distribution of the user about each part of speech under each preset theme;

the second generation submodule is used for generating the probability distribution of the user about each preset theme under each preset business scene according to the mapping relation between the preset part of speech and each preset business scene and the probability distribution of the user about each part of speech under each preset theme;

and the third generation submodule is used for multiplying the probability distribution of the user about the preset subjects under the preset service scenes with the probability distribution of the user about the preset subjects to generate the probability distribution of the user about the preset service scenes.

Optionally, as an embodiment, the first obtaining module 901 may include:

a fourth generation sub-module, configured to process user behavior features corresponding to the keywords in each part of speech in the user text data, to obtain weights of the user behavior features corresponding to the keywords in each part of speech, and perform weighted summation on the user behavior features corresponding to the keywords in each part of speech and the corresponding weights, to obtain comprehensive features of the keywords in each part of speech, where the user behavior features include: the number of users, days, times and duration;

a fifth generation submodule, configured to process user behaviors corresponding to the keywords in each part of speech in the user text data and a plurality of preset portrait tags, to obtain behavior preferences of the portrait tags corresponding to the keywords in each part of speech, and determine the behavior preferences corresponding to the keywords as word weights corresponding to the keywords;

and the sixth generation submodule is used for carrying out weighted summation on the comprehensive characteristics of the keywords under the part of speech and the corresponding word weight for each part of speech to obtain the part of speech characteristics of the part of speech.

Optionally, as an embodiment, the user classifying device 900 may further include:

a third generation module, configured to, for each part of speech under each topic, multiply a word weight of each keyword under the part of speech by a probability distribution of a corresponding part of speech under the topic, so as to obtain a probability distribution of each keyword under the part of speech under the topic;

the third acquisition module is used for acquiring the user coverage rate and click rate of each keyword under the theme;

the fourth generation module is used for multiplying the probability of each keyword under the theme, the user coverage rate and the click rate to obtain the recommendation coefficient of the keyword under the theme;

and the fifth generation module is used for outputting the business document suggestions containing the corresponding keywords according to the descending order of the recommendation coefficients of the keywords.

the fourth acquisition module is used for acquiring a preset portrait label set, wherein the portrait label set comprises a plurality of portrait labels;

the calculation module is used for calculating the correlation between each service scene and each portrait label in the portrait label set according to the probability distribution of the user about each preset service scene;

and the determining module is used for taking the portrait label with higher relevance as the crowd description information of the user.

The user classification device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The spatial position finding device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The user classification device provided in the embodiment of the present application can implement each process implemented in the embodiment of the method in fig. 1, and is not described here again to avoid repetition.

Optionally, as shown in fig. 10, an electronic device 1000 is further provided in this embodiment of the present application, and includes a processor 1001, a memory 1002, and a program or an instruction stored in the memory 1002 and executable on the processor 1001, where the program or the instruction is executed by the processor 1001 to implement each process of the user classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 11 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application. The electronic device 1100 includes, but is not limited to: a radio frequency unit 1101, a network module 1102, an audio output unit 1103, an input unit 1104, a sensor 1105, a display unit 1106, a user input unit 1107, an interface unit 1108, a memory 1109, a processor 1110, and the like.

Those skilled in the art will appreciate that the electronic device 1100 may further include a power source (e.g., a battery) for supplying power to the various components, and the power source may be logically connected to the processor 1110 via a power management system, so as to manage charging, discharging, and power consumption management functions via the power management system. The electronic device structure shown in fig. 11 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is not repeated here.

The processor 1110 is configured to obtain a part-of-speech feature of each part-of-speech according to user text data, where the part-of-speech is obtained based on keyword clustering in the user text data; generating a part-of-speech feature vector of each user according to the generated part-of-speech features and the relationship between each user and the part of speech, wherein the part-of-speech feature vector is an N-dimensional vector, and N is the total number of the parts of speech; acquiring probability distribution of each user about each preset theme according to the word class characteristic vector of each user; and for each user, generating the probability distribution of the user about each preset service scene according to the mapping relation between the preset part of speech and each preset service scene, the probability distribution of the part of speech under each preset theme and the probability distribution of the user about each preset theme.

Therefore, in the embodiment of the application, the keywords are subjected to word classification, and the word class replaces a single word to serve as an input feature to classify the user group, so that the head effect of a hot scene or a popular word can be reduced, the input stability and the contribution of long-tail information dimension are improved, and the accuracy of user classification is improved. In addition, by taking the word class as a characteristic input unit, low-dimensional mapping of the theme, the word class and the service scene is established, category division is carried out on user groups, not only the user angle but also the service scene angle can be switched to start when the user is finally recognized, the user can be automatically described in a high-efficiency, stable and convergent manner to the service direction, and the product can be rapidly helped to operate and recognize the user.

Optionally, as an embodiment, the processor 1110 is further configured to, for each user, multiply the probability distribution of the part of speech under each preset topic by the probability distribution of the user about each preset topic to obtain the probability distribution of the user about each part of speech under each preset topic;

generating probability distribution of each preset theme of the user under each preset service scene according to the mapping relation between the preset part of speech and each preset service scene and the probability distribution of each part of speech of the user under each preset theme;

and multiplying the probability distribution of the preset subjects of the user under the preset service scenes with the probability distribution of the user under the preset subjects to generate the probability distribution of the user under the preset service scenes.

Optionally, as an embodiment, the processor 1110 is further configured to process user behavior features corresponding to each keyword under each part of speech in the user text data to obtain weights of the user behavior features corresponding to each keyword under each part of speech, and perform weighted summation on the user behavior features corresponding to each keyword under each part of speech and the corresponding weights to obtain comprehensive features of each keyword under each part of speech, where the user behavior features include: the number of users, days, times and duration;

processing user behaviors corresponding to each keyword under each part of speech in user text data and a plurality of preset portrait labels to obtain behavior preferences of the portrait labels corresponding to each keyword under each part of speech, and determining the behavior preferences corresponding to each keyword as word weights of the corresponding keywords;

and for each part of speech, carrying out weighted summation on the comprehensive characteristics of each keyword under the part of speech and the corresponding word weight to obtain the part of speech characteristics of the part of speech.

Optionally, as an embodiment, the processor 1110 is further configured to, for each part of speech under each topic, multiply a word weight of each keyword under the part of speech by a probability distribution of a corresponding part of speech under the topic, so as to obtain a probability distribution of each keyword under the part of speech under the topic;

acquiring user coverage rate and click rate of each keyword under the theme;

for each keyword under the theme, multiplying the probability of the keyword under the theme, the user coverage rate and the click rate to obtain a recommendation coefficient of the keyword under the theme;

and outputting the service file suggestions containing the corresponding keywords according to the descending order of the recommendation coefficients of the keywords.

Optionally, as an embodiment, the processor 1110 is further configured to obtain a preset portrait tab set, where the portrait tab set includes a plurality of portrait tabs;

calculating the correlation between each service scene and each portrait label in the portrait label set according to the probability distribution of the user about each preset service scene;

and using the portrait label with high relevance as the crowd description information of the user.

It should be understood that in the embodiment of the present application, the input Unit 1104 may include a Graphics Processing Unit (GPU) 11041 and a microphone 11042, and the Graphics processor 11041 processes image data of still pictures or video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 1106 may include a display panel 11061, and the display panel 11061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1107 includes a touch panel 11071 and other input devices 11072. A touch panel 11071, also called a touch screen. The touch panel 11071 may include two portions of a touch detection device and a touch controller. Other input devices 11072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 1109 may be used for storing software programs and various data including, but not limited to, application programs and an operating system. Processor 1110 may integrate an application processor that handles primarily operating systems, user interfaces, applications, etc. and a modem processor that handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1110.

The electronic device 1100 is capable of implementing the processes implemented by the electronic device in the foregoing embodiments, and details are not repeated here to avoid repetition.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the user classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the user classification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here. It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for classifying a user, the method comprising:

2. The method according to claim 1, wherein for each user, generating the probability distribution of the user about the preset service scenes according to the mapping relationship between the preset parts of speech and the preset service scenes, the probability distribution of the parts of speech under the preset topics, and the probability distribution of the user about the preset topics comprises:

for each user, multiplying the probability distribution of the part of speech under each preset theme with the probability distribution of the user about each preset theme to obtain the probability distribution of the user about each part of speech under each preset theme;

3. The method according to claim 1 or 2, wherein the obtaining of the part-of-speech feature of each part-of-speech based on the user text data comprises:

processing user behavior characteristics corresponding to each keyword under each part of speech in user text data to obtain weights of the user behavior characteristics corresponding to each keyword under each part of speech, and performing weighted summation on the user behavior characteristics corresponding to each keyword under each part of speech and the corresponding weights to obtain comprehensive characteristics of each keyword under each part of speech, wherein the user behavior characteristics comprise: the number of users, days, times and duration;

4. The method of claim 3, wherein after generating the probability distribution of the user with respect to the preset business scenarios, further comprising:

for each part of speech under each theme, multiplying the word weight of each keyword under the part of speech by the probability distribution of the corresponding part of speech under the theme to obtain the probability distribution of each keyword under the part of speech under the theme;

acquiring user coverage rate and click rate of each keyword under the theme;

5. The method of claim 1, wherein after generating the probability distribution of the user with respect to the preset business scenarios, the method further comprises:

acquiring a preset portrait label set, wherein the portrait label set comprises a plurality of portrait labels;

6. An apparatus for classifying a user, the apparatus comprising:

7. The apparatus of claim 6, wherein the second generating module comprises:

8. The apparatus of claim 6 or 7, wherein the first obtaining module comprises:

9. The apparatus of claim 8, further comprising:

10. The apparatus of claim 6, further comprising:

11. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the user classification method according to any one of claims 1 to 5.

12. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by the processor, implement the steps of the user classification method according to any one of claims 1 to 5.