Detailed Description
In order to better understand the technical solutions, the technical solutions of the embodiments of the present specification are described in detail below with reference to the drawings and specific embodiments, and it should be understood that the specific features of the embodiments and embodiments of the present specification are detailed descriptions of the technical solutions of the embodiments of the present specification, and are not limitations of the technical solutions of the present specification, and the technical features of the embodiments and embodiments of the present specification may be combined with each other without conflict.
For a better understanding of the embodiments of the present specification, the relevant terms are explained as follows.
User labeling: abstract classification and summarization of a certain characteristic of a certain class of a particular population or object, such as "gender", "occupation", "age". User portrait: the label is formed by combining a plurality of labels, and a specific example is formed by a plurality of label values; for example, a queen is an example of a user portrait, which is "male", "20 years", "college", "basketball". User annotation data: and the users participating in the social network remark the information to others.
With the rapid development of the internet, social networks are becoming more and more robust, and huge data is deposited in the social networks. Among them, a big function of the social network is to make friends, and people who participate in the social network can remember the other party or describe some characteristics of the other party, and often add remarks to the other party, and the remark information is so-called user marking data. For example, a user may add notes such as "teacher", "executive" and the like to the remark information of the other party in order to remember his or her own teacher. The inventor finds that the information contained in the user marking data is extremely wide, and great help is provided for user identity identification. Therefore, the embodiment of the present specification provides a user tag mining method, which is implemented to diffuse a population with a certain tag according to user tagging data, and then can evaluate the tag accuracy.
Referring to fig. 1, a schematic diagram of an implementation principle of a user tag mining method provided in the embodiment of the present specification is shown. Fig. 1 shows user annotation data 10 and a user tag mining device 20. The device 20 matches seed keywords corresponding to a certain specific tag (hereinafter referred to as "target tag") in user tagging data based on the user tagging data to determine a batch of seed users, performs word segmentation processing on the user tagging data of the seed users, can diffuse more keywords (hereinafter referred to as "specific keywords") from the seed keywords, matches more user tagging data with the specific keywords, i.e., diffuses more target users from the seed users, and finally determines that the target users are regarded as a user group with the target tag, thereby completing mining the tags of the target users. While the user annotation data 10 is not shown in fig. 1 as being provided in the user tag mining apparatus 20, it is to be understood that the user annotation data 10 is obtained from a network, etc., or may be stored in the apparatus 20, which is not limited thereto.
In a first aspect, an embodiment of the present specification provides a user tag mining method, please refer to fig. 2, which includes steps S201 to S203.
S201: and determining at least one seed keyword of the target label, and matching the seed keyword in the user labeling data to determine a seed user.
The target tag may be understood as a preset tag to be mined, for example, if a user group of a teacher is desired to be mined, the "teacher" is the target tag.
The user marking data is remarked information of users participating in the social network to other users, and the user marking data is not fixed but is continuously updated or increased along with the operation of the users. For example, after a user adds a microblog friend, the friend is labeled with information related to him (her): ID11 × 3311 paneseur.
It can be known from experience that if a teacher tag is to be mined, the teacher user may be labeled as keywords such as "teacher", "executive" and the like in the user labeling data, and the related keywords of these target tags determined by experience are referred to as seed keywords.
And matching in a large amount of user marking data according to the seed key words, and determining the user to be called a seed user. In an alternative, the process of determining the seed user includes: matching the seed keywords in the user labeling data to determine an initial seed user containing the seed keywords in the user labeling data; and counting the times of matching the initial seed users with the seed keywords, and screening the initial seed users with the times larger than a threshold value as the seed users.
For example, using the seed keywords "teacher", "executive" of the professor tag to match with the user annotation data, the ID and keywords on the match are retained. If the user label data is as follows:
ID11 × 3311 wang xiaodi teacher
ID11 star 5638 Limazu BankIn
ID11 star 2314 Zhang Fugui chafing dish shop boss
ID 11X 3567 Li Ming insurance sales
ID11 × 1598 Sunzhi-Strong physical teacher
ID11 × 3034 instructor
ID11 × 7856 length
ID11 × 5673 Property plums
And matching the user annotation data by using the keyword 'teacher and executive owner' and keeping the matched information, namely the information of the initial seed user is as follows: ID11 × 3311 teacher, ID11 × 5638 master, and ID11 × 1598 teacher.
The same user ID may be tagged by many people simultaneously and stored in the user tagging data. Therefore, the times of matching each ID of the initial seed user with the seed key words can be counted, and the ID with the times larger than k is taken as the seed user. Such as: the result obtained by counting the matching times of the seed keywords is as follows: ID11 × 331125 times, ID11 × 563413 times, and ID11 × 15985 times, then ID11 × 3311, ID11 × 5638 times of matching keywords are greater than a threshold k (assuming that k takes the value of 10), and can be used as a seed user.
S202: and carrying out high-frequency word statistics on the user labeling data of the seed user, and determining and screening out a plurality of specific keywords.
For the seed user, namely the user considered to have the target label, the words contained in the user labeling data of the part of users are analyzed, and more label information can be obtained. Since the seed keyword determined in step S201 is obtained empirically, it is not avoided that there is not an incomplete condition, so that more comprehensive keywords can be obtained in this step according to the labeled data of the seed user.
In an alternative, the implementation process of determining and screening out a plurality of specific keywords is as follows:
performing word segmentation processing on user labeling data of seed users, counting word frequency of each word, and reserving high-frequency words with the word frequency larger than a threshold value; and (4) eliminating words irrelevant to the target label from the high-frequency words to obtain the specific keyword.
Also in the above example, the seed user (ID11 × 3311 teacher, ID11 × 5638 master) is taken as an example, the user tagging information of the seed user is subjected to word segmentation processing, and the word frequency of each word is counted. For example, the word frequency statistics obtained are as follows: teacher 2000 times, executive 1500 times, executive 600 times, instructor 500 times, school 400 times, client 300 times, mother 200 times, etc. Since the teacher label is mined, the words related to the teacher are teacher, executive, main, instructor, and school leader, and the irrelevant words such as client, mom are eliminated. The specific keyword obtained is "teacher, class master, instructor, school leader".
It can be seen that the number of the specific keywords is greater than that of the seed keywords, and the specific keywords include the seed keywords, that is, the diffusion of the keywords is realized for the target tag by using the user labeling data.
S203: and matching the specific keywords in the user labeling data to determine the target user.
In an alternative, the determining the implementation of the target user comprises: matching the specific keywords in the user labeling data to determine an initial target user containing the specific keywords in the user labeling data; and counting the times of matching the initial target users with the specific keywords, and screening out the initial specific users with the times larger than a threshold value as the target users.
For example, matching is performed on the user annotation data by using a specific keyword 'teacher, class master, instructor, and school leader', and the ID and the keyword on the matching are reserved. If the user label data is as follows:
ID11 × 3311 wang xiaodi teacher
ID11 star 5638 Limazu BankIn
ID11 star 2314 Zhang Fugui chafing dish shop old
ID 11X 3567 Li Ming insurance sales,
ID11 × 1598 Sunzhi-Strong physical teacher
ID11 × 3034 instructor
ID11 × 7856 length
ID11 × 5673 Property plums
And matching the keywords 'teacher, class master, instructor and school leader' with the user annotation data and retaining the information on the matching, namely the information of the initial target user is as follows: ID11 × 3311 teacher, ID11 × 5638 teacher, ID11 × 1598 teacher, ID11 × 3034 instructor, ID11 × 7856 length.
And then counting the times of matching each ID of the initial target user with the specific keyword, and taking the ID with the times larger than k as the target user. Such as: the result obtained by counting the matching times of the specific keywords is as follows: ID11 × 331150 times, ID11 × 563423 times, ID11 × 159815 times, ID11 × 303420 times, ID11 × 78562 times, then ID11 × 3311 times, ID11 × 1598 times, ID11 × 3034 times match the keyword more than a threshold value d (assuming d takes 3), i.e. determined as the target user.
Therefore, as the specific keywords are diffused from the seed keywords in a larger number in the previous steps, the target users with a larger number than the seed users are obtained by matching the specific keywords in the user labeling data, the diffusion from the seed users to the target users is realized, and the label mining of the part of the target users is realized.
In the above example, for simplicity, only the user tagging data of a limited number of user IDs is taken as an example for explanation, in an actual scene, the user tagging data is often huge in number, for example, the user tagging data includes hundreds of thousands to thousands of pieces of data of universal user IDs, so that according to the user tagging data with huge number, the population diffusion with target tags can be realized, the user tag mining is realized, and the efficiency and the accuracy are greatly improved compared with the conventional method.
In the embodiment of the specification, only a few seed keywords need to be provided through manual experience, seed users are mined through the seed keywords, more specific keywords are obtained from user labeling data through the seed users, the specific keywords are matched with the user labeling data, and then people with mutual exclusion identities are filtered, so that target users can be obtained, and the crowd diffusion is realized. The method is extremely efficient, target users can be quickly excavated with little manual intervention, a large number of target people can be excavated by the method, user labels can be quickly determined in batches, and the method can effectively solve the problem of low label excavation efficiency.
In a second aspect, referring to fig. 3, based on the same inventive concept, an embodiment of the present disclosure provides a user tag mining method. Compared with the embodiment in fig. 2, the embodiment corresponding to fig. 3 further evaluates the accuracy that the target user has the target tag after the target user is determined, so that the accuracy of tag mining can be further improved.
Fig. 3 includes steps 1-11.
1. And determining the seed key words.
The seed keywords may be obtained through human experience, for example, to mine the label of the instructor, the instructor may be labeled as "teacher, executive or the like" in the user labeling data, and if the label of the medical staff is mined, the users may be labeled as "doctor, TCM, physician" or the like, and the words are used as the seed keywords.
2. And matching seed keywords.
And (3) matching the seed keywords in the step (1) with the user labeling data, and keeping the matched user ID and keywords. If the user label data is as follows:
ID11 × 3311 wang xiaodi teacher
ID11 star 5638 Limazu BankIn
ID11 star 2314 Zhang Fugui chafing dish shop boss
ID 11X 3567 Li Ming insurance sales
ID11 × 1598 Sunzhi-Strong physical teacher
ID11 × 3034 instructor
ID11 × 7856 length
ID11 × 5673 Property plums
And matching the user annotation data by using the keyword 'teacher and executive owner' and keeping the matched information, namely the information of the initial seed user is as follows: ID11 × 3311 teacher, ID11 × 5638 master, and ID11 × 1598 teacher.
3. Counting word frequency and obtaining seed users
The same user ID may be tagged by many people simultaneously and stored in the user tagging data. Therefore, the times of matching each ID of the initial seed user with the seed key words can be counted, and the ID with the times larger than k is taken as the seed user. Such as: the result obtained by counting the matching times of the seed keywords is as follows: ID11 × 331125 times, ID11 × 563413 times, and ID11 × 15985 times, then ID11 × 3311, ID11 × 5638 times more than k (assuming that k takes the value of 10) match the keyword, and can be used as the seed user.
4. Word segmentation processing and high-frequency word statistics
And (4) performing word segmentation processing on the user labeling data of the seed user acquired in the step (3), and counting the word frequency of each word. For example, the word frequency statistics obtained are as follows: teacher 2000 times, executive 1500 times, executive 600 times, instructor 500 times, school 400 times, client 300 times, mother 200 times, etc.
5. Screening for specific keywords
The mining is added with a teacher label, so that the words related to the teacher are teacher, executive, main, instructor and school leader, and the irrelevant words such as client, mother and the like are eliminated. The specific keyword obtained is "teacher, class master, instructor, school leader".
6. Matching specific keywords and obtaining matched population
And (3) matching the specific keywords acquired in the step (5) with the user labeling data (the method is similar to the step (2)), counting the times of matching the specific keywords on each user ID (the method is similar to the step (3)), and taking the keywords with the times larger than 1(1 can take 3-5) as a matched population (an initial target user).
7. Filtering mutually exclusive identities of a population
Some crowds with mutual exclusion identities, such as couriers, truck drivers and individual users, are filtered from the matching crowds, mutual exclusion exists between the professions and teaching workers, and target crowds (target users) can be obtained after the crowds are filtered.
8. Selecting positive and negative samples
Taking the user ID of the top m% (for example, m may be 10 to 30) word frequency obtained in step 7 as a positive sample, wherein the word frequency of all words is taken, for example, the user ID corresponding to the word frequency of the top m% of the sum of the number of times obtained by adding 2000 times of teacher, 1500 times of executive principal, 600 times of principal, 500 times of instructor and 400 times of length correction is taken as the positive sample. And randomly selecting n times (n can be 5-10) of the user ID of the number of the positive samples as negative samples in a network grabbing or network cloud disk mode and the like, wherein the negative samples cannot contain the positive samples, and for example, determining users without specific keywords in the annotation data as the negative samples.
9. Building a feature library
And constructing a feature library based on the basic data of the network sediment. The feature library contains multi-dimensional information of the user, such as basic information of age, gender and the like, shopping information, behavior information and the like. The information is the basis for mining user portrait tags and is input into the mechanistic model as basic features.
Therefore, the positive and negative samples are spliced with the feature library to form a training set of the machine learning model.
10. Machine learning model training
And training by using a machine learning model training set to obtain a trained model. Common machine learning models include random forests, gbdt (gradient descent trees), xgboost (extreme gradient boost), logistic regression, lightGBM (lightweight gradient boost), DNN (deep neural network) models, and the like, and the embodiments of the present disclosure do not limit the types of specific algorithms of the models.
11. And (6) scoring the label.
And (3) scoring the labels of the target users by using the machine learning model trained in the step 10, so as to obtain the score of each user for the target labels, and outputting the score as a result. The higher the score, the higher the accuracy of the tag. Users with scores below a threshold may be further culled from the scores, e.g., users with scores above a threshold (say 80 points) are retained.
After the target users are mined in the embodiment, the scoring of each user for the tags can be given through the steps 8-11, and the higher the scoring is, the higher the accuracy is. Moreover, the user tags are scored in a machine learning modeling mode, the mode is very universal and does not involve manual experience, and the mining of the tags can be completed very quickly. The method is a very general user label scoring mode, the label scoring of each user can be quickly obtained without the participation of manual experience, the accuracy is judged according to the scoring degree, and the problem that people with higher accuracy are difficult to obtain from labels is effectively solved.
In a third aspect, embodiments of the present specification further provide a user tag mining apparatus, referring to fig. 4, the apparatus includes:
a seed keyword determination unit 401 for determining at least one seed keyword of the target tag
A seed user determining unit 402, configured to perform matching in the user annotation data according to the seed keyword to determine a seed user;
a specific keyword determining unit 403, configured to perform high-frequency word statistics on the user tagging data of the seed user, and determine and screen out a plurality of specific keywords;
and a target user determining unit 404, configured to perform matching in the user annotation data by using the specific keyword, and determine a target user.
In an alternative, the seed user determining unit 402 includes:
an initial seed user determination subunit 4021, configured to perform matching in the user annotation data by using the seed keyword, and determine an initial seed user whose user annotation data includes the seed keyword;
a seed user screening subunit 4022, configured to count the times that the initial seed users match the seed keywords, and screen out the initial seed users with the times greater than a threshold as the seed users.
In an alternative manner, the specific keyword determination unit 403 includes:
a high-frequency word counting subunit 4031, configured to perform word segmentation processing on the user tagging data of the seed user, count the word frequency of each word, and keep the high-frequency words with the word frequency greater than a threshold;
a specific keyword screening subunit 4032, configured to remove, from the high-frequency words, words that are not related to the target tag, to obtain the specific keyword.
In an alternative, the target user determination unit 404 includes:
an initial target user determination subunit 4041, configured to match the specific keyword in user tagging data, and determine an initial target user whose user tagging data includes the specific keyword;
the target used screening subunit 4042 is configured to count the number of times that the initial target user matches the specific keyword, and screen out an initial specific user whose number of times is greater than a threshold as the target user.
In an alternative, the method further comprises:
and the target user scoring unit 405 is configured to score the accuracy of the target tag of the target user according to the machine learning model.
In an alternative, the method further comprises: a machine learning model training unit 406, the machine learning model training unit 406 comprising:
the positive and negative sample determination subunit 4061 is configured to select, from the target users, a user whose occurrence probability of the high-frequency word is greater than a threshold as a positive sample, and determine a user without a specific keyword in the annotation data as a negative sample;
a training set constructing subunit 4062, configured to splice the positive sample and the negative sample with a pre-constructed feature library to obtain a training set;
and the training subunit 4063 is configured to train the training set based on a machine learning algorithm to obtain the machine learning model.
In an alternative mode, the number of the specific keywords is more than that of the seed keywords, and the specific keywords comprise the seed keywords; the number of the target users is more than that of the seed users, and the target users comprise the seed users.
In a fourth aspect, based on the same inventive concept as the user tag mining method in the foregoing embodiment, the present invention further provides a server, as shown in fig. 5, including a memory 504, a processor 502 and a computer program stored on the memory 504 and executable on the processor 502, wherein the processor 502 implements the steps of any one of the foregoing user tag mining methods when executing the program.
Where in fig. 5 a bus architecture (represented by bus 500) is shown, bus 500 may include any number of interconnected buses and bridges, and bus 500 links together various circuits including one or more processors, represented by processor 502, and memory, represented by memory 504. The bus 500 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 506 provides an interface between the bus 500 and the receiver 501 and transmitter 503. The receiver 501 and the transmitter 503 may be the same element, i.e. a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 502 is responsible for managing the bus 500 and general processing, and the memory 504 may be used for storing data used by the processor 502 in performing operations.
In a fifth aspect, based on the inventive concept of the user tag mining method in the foregoing embodiments, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of any one of the foregoing user tag mining methods.
The description has been presented with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present specification have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all changes and modifications that fall within the scope of the specification.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present specification without departing from the spirit and scope of the specification. Thus, if such modifications and variations of the present specification fall within the scope of the claims of the present specification and their equivalents, the specification is intended to include such modifications and variations.