CN110598091A - User tag mining method, device, server and readable storage medium - Google Patents

User tag mining method, device, server and readable storage medium Download PDF

Info

Publication number
CN110598091A
CN110598091A CN201910735347.3A CN201910735347A CN110598091A CN 110598091 A CN110598091 A CN 110598091A CN 201910735347 A CN201910735347 A CN 201910735347A CN 110598091 A CN110598091 A CN 110598091A
Authority
CN
China
Prior art keywords
user
seed
users
target
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910735347.3A
Other languages
Chinese (zh)
Inventor
温亿明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910735347.3A priority Critical patent/CN110598091A/en
Publication of CN110598091A publication Critical patent/CN110598091A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the specification provides a user label mining method, seed users are mined through a plurality of seed keywords, more specific keywords are obtained from user labeling data through the seed users, the specific keywords are matched and screened with the user labeling data, and finally a large number of target users are determined to realize the crowd spreading of labels.

Description

User tag mining method, device, server and readable storage medium
Technical Field
The embodiment of the specification relates to the technical field of internet, in particular to a user tag mining method, a user tag mining device, a user tag mining server and a readable storage medium.
Background
User portrayal is often applied to the internet and is used for accurately portraying users, and the user portrayal comprises population attributes, social attributes, preference information, relationship information, position information, working state and the like. The user portrait is the core data of the internet and is widely used for data analysis, wind control and the like. User representations typically contain a large number of labels, such as age, gender, occupation, etc. Therefore, how to quickly mine the user tags is a technical problem to be solved.
Disclosure of Invention
The embodiment of the specification provides a user tag mining method, a user tag mining device, a server and a readable storage medium.
In a first aspect, an embodiment of the present specification provides a user tag mining method, including: determining at least one seed keyword of a target label, and matching the seed keyword in user labeling data to determine a seed user; carrying out high-frequency word statistics on the user labeling data of the seed user, and determining and screening out a plurality of specific keywords; and matching the specific keywords in the user labeling data to determine the target user.
In a second aspect, an embodiment of the present specification provides a user tag mining device, including: the seed keyword determining unit is used for determining at least one seed keyword seed user determining unit of the target label and is used for matching the seed keyword in user labeling data to determine a seed user; the specific keyword determining unit is used for carrying out high-frequency word statistics on the user labeling data of the seed user, and determining and screening out a plurality of specific keywords; and the target user determining unit is used for matching the specific keywords in the user labeling data to determine the target user.
In a third aspect, embodiments of the present specification provide a server, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of any one of the above methods when executing the program.
In a fourth aspect, the present specification provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of any of the above methods.
The embodiment of the specification has the following beneficial effects:
in the embodiment of the specification, only a few seed keywords need to be provided through manual experience, seed users are mined through the seed keywords, more specific keywords are obtained from user labeling data through the seed users, the specific keywords are matched with the user labeling data, and then people with mutual exclusion identities are filtered, so that target users can be obtained, and the crowd diffusion is realized. The method is extremely efficient, target users can be quickly excavated with little manual intervention, a large number of target people can be excavated by the method, user labels can be quickly determined in batches, and the method can effectively solve the problem of low label excavation efficiency.
Drawings
Fig. 1 is a schematic diagram illustrating an implementation principle of a user tag mining method provided in an embodiment of the present specification;
fig. 2 is a flowchart of a user tag mining method provided in the first aspect of the embodiments of the present specification;
fig. 3 is a flowchart of a user tag mining method provided in the second aspect of the embodiments of the present specification;
fig. 4 is a schematic structural diagram of a user tag excavating device provided in a third aspect of the embodiments of the present description;
fig. 5 is a schematic structural diagram of a server provided in the fourth aspect of the embodiment of the present specification.
Detailed Description
In order to better understand the technical solutions, the technical solutions of the embodiments of the present specification are described in detail below with reference to the drawings and specific embodiments, and it should be understood that the specific features of the embodiments and embodiments of the present specification are detailed descriptions of the technical solutions of the embodiments of the present specification, and are not limitations of the technical solutions of the present specification, and the technical features of the embodiments and embodiments of the present specification may be combined with each other without conflict.
For a better understanding of the embodiments of the present specification, the relevant terms are explained as follows.
User labeling: abstract classification and summarization of a certain characteristic of a certain class of a particular population or object, such as "gender", "occupation", "age". User portrait: the label is formed by combining a plurality of labels, and a specific example is formed by a plurality of label values; for example, a queen is an example of a user portrait, which is "male", "20 years", "college", "basketball". User annotation data: and the users participating in the social network remark the information to others.
With the rapid development of the internet, social networks are becoming more and more robust, and huge data is deposited in the social networks. Among them, a big function of the social network is to make friends, and people who participate in the social network can remember the other party or describe some characteristics of the other party, and often add remarks to the other party, and the remark information is so-called user marking data. For example, a user may add notes such as "teacher", "executive" and the like to the remark information of the other party in order to remember his or her own teacher. The inventor finds that the information contained in the user marking data is extremely wide, and great help is provided for user identity identification. Therefore, the embodiment of the present specification provides a user tag mining method, which is implemented to diffuse a population with a certain tag according to user tagging data, and then can evaluate the tag accuracy.
Referring to fig. 1, a schematic diagram of an implementation principle of a user tag mining method provided in the embodiment of the present specification is shown. Fig. 1 shows user annotation data 10 and a user tag mining device 20. The device 20 matches seed keywords corresponding to a certain specific tag (hereinafter referred to as "target tag") in user tagging data based on the user tagging data to determine a batch of seed users, performs word segmentation processing on the user tagging data of the seed users, can diffuse more keywords (hereinafter referred to as "specific keywords") from the seed keywords, matches more user tagging data with the specific keywords, i.e., diffuses more target users from the seed users, and finally determines that the target users are regarded as a user group with the target tag, thereby completing mining the tags of the target users. While the user annotation data 10 is not shown in fig. 1 as being provided in the user tag mining apparatus 20, it is to be understood that the user annotation data 10 is obtained from a network, etc., or may be stored in the apparatus 20, which is not limited thereto.
In a first aspect, an embodiment of the present specification provides a user tag mining method, please refer to fig. 2, which includes steps S201 to S203.
S201: and determining at least one seed keyword of the target label, and matching the seed keyword in the user labeling data to determine a seed user.
The target tag may be understood as a preset tag to be mined, for example, if a user group of a teacher is desired to be mined, the "teacher" is the target tag.
The user marking data is remarked information of users participating in the social network to other users, and the user marking data is not fixed but is continuously updated or increased along with the operation of the users. For example, after a user adds a microblog friend, the friend is labeled with information related to him (her): ID11 × 3311 paneseur.
It can be known from experience that if a teacher tag is to be mined, the teacher user may be labeled as keywords such as "teacher", "executive" and the like in the user labeling data, and the related keywords of these target tags determined by experience are referred to as seed keywords.
And matching in a large amount of user marking data according to the seed key words, and determining the user to be called a seed user. In an alternative, the process of determining the seed user includes: matching the seed keywords in the user labeling data to determine an initial seed user containing the seed keywords in the user labeling data; and counting the times of matching the initial seed users with the seed keywords, and screening the initial seed users with the times larger than a threshold value as the seed users.
For example, using the seed keywords "teacher", "executive" of the professor tag to match with the user annotation data, the ID and keywords on the match are retained. If the user label data is as follows:
ID11 × 3311 wang xiaodi teacher
ID11 star 5638 Limazu BankIn
ID11 star 2314 Zhang Fugui chafing dish shop boss
ID 11X 3567 Li Ming insurance sales
ID11 × 1598 Sunzhi-Strong physical teacher
ID11 × 3034 instructor
ID11 × 7856 length
ID11 × 5673 Property plums
And matching the user annotation data by using the keyword 'teacher and executive owner' and keeping the matched information, namely the information of the initial seed user is as follows: ID11 × 3311 teacher, ID11 × 5638 master, and ID11 × 1598 teacher.
The same user ID may be tagged by many people simultaneously and stored in the user tagging data. Therefore, the times of matching each ID of the initial seed user with the seed key words can be counted, and the ID with the times larger than k is taken as the seed user. Such as: the result obtained by counting the matching times of the seed keywords is as follows: ID11 × 331125 times, ID11 × 563413 times, and ID11 × 15985 times, then ID11 × 3311, ID11 × 5638 times of matching keywords are greater than a threshold k (assuming that k takes the value of 10), and can be used as a seed user.
S202: and carrying out high-frequency word statistics on the user labeling data of the seed user, and determining and screening out a plurality of specific keywords.
For the seed user, namely the user considered to have the target label, the words contained in the user labeling data of the part of users are analyzed, and more label information can be obtained. Since the seed keyword determined in step S201 is obtained empirically, it is not avoided that there is not an incomplete condition, so that more comprehensive keywords can be obtained in this step according to the labeled data of the seed user.
In an alternative, the implementation process of determining and screening out a plurality of specific keywords is as follows:
performing word segmentation processing on user labeling data of seed users, counting word frequency of each word, and reserving high-frequency words with the word frequency larger than a threshold value; and (4) eliminating words irrelevant to the target label from the high-frequency words to obtain the specific keyword.
Also in the above example, the seed user (ID11 × 3311 teacher, ID11 × 5638 master) is taken as an example, the user tagging information of the seed user is subjected to word segmentation processing, and the word frequency of each word is counted. For example, the word frequency statistics obtained are as follows: teacher 2000 times, executive 1500 times, executive 600 times, instructor 500 times, school 400 times, client 300 times, mother 200 times, etc. Since the teacher label is mined, the words related to the teacher are teacher, executive, main, instructor, and school leader, and the irrelevant words such as client, mom are eliminated. The specific keyword obtained is "teacher, class master, instructor, school leader".
It can be seen that the number of the specific keywords is greater than that of the seed keywords, and the specific keywords include the seed keywords, that is, the diffusion of the keywords is realized for the target tag by using the user labeling data.
S203: and matching the specific keywords in the user labeling data to determine the target user.
In an alternative, the determining the implementation of the target user comprises: matching the specific keywords in the user labeling data to determine an initial target user containing the specific keywords in the user labeling data; and counting the times of matching the initial target users with the specific keywords, and screening out the initial specific users with the times larger than a threshold value as the target users.
For example, matching is performed on the user annotation data by using a specific keyword 'teacher, class master, instructor, and school leader', and the ID and the keyword on the matching are reserved. If the user label data is as follows:
ID11 × 3311 wang xiaodi teacher
ID11 star 5638 Limazu BankIn
ID11 star 2314 Zhang Fugui chafing dish shop old
ID 11X 3567 Li Ming insurance sales,
ID11 × 1598 Sunzhi-Strong physical teacher
ID11 × 3034 instructor
ID11 × 7856 length
ID11 × 5673 Property plums
And matching the keywords 'teacher, class master, instructor and school leader' with the user annotation data and retaining the information on the matching, namely the information of the initial target user is as follows: ID11 × 3311 teacher, ID11 × 5638 teacher, ID11 × 1598 teacher, ID11 × 3034 instructor, ID11 × 7856 length.
And then counting the times of matching each ID of the initial target user with the specific keyword, and taking the ID with the times larger than k as the target user. Such as: the result obtained by counting the matching times of the specific keywords is as follows: ID11 × 331150 times, ID11 × 563423 times, ID11 × 159815 times, ID11 × 303420 times, ID11 × 78562 times, then ID11 × 3311 times, ID11 × 1598 times, ID11 × 3034 times match the keyword more than a threshold value d (assuming d takes 3), i.e. determined as the target user.
Therefore, as the specific keywords are diffused from the seed keywords in a larger number in the previous steps, the target users with a larger number than the seed users are obtained by matching the specific keywords in the user labeling data, the diffusion from the seed users to the target users is realized, and the label mining of the part of the target users is realized.
In the above example, for simplicity, only the user tagging data of a limited number of user IDs is taken as an example for explanation, in an actual scene, the user tagging data is often huge in number, for example, the user tagging data includes hundreds of thousands to thousands of pieces of data of universal user IDs, so that according to the user tagging data with huge number, the population diffusion with target tags can be realized, the user tag mining is realized, and the efficiency and the accuracy are greatly improved compared with the conventional method.
In the embodiment of the specification, only a few seed keywords need to be provided through manual experience, seed users are mined through the seed keywords, more specific keywords are obtained from user labeling data through the seed users, the specific keywords are matched with the user labeling data, and then people with mutual exclusion identities are filtered, so that target users can be obtained, and the crowd diffusion is realized. The method is extremely efficient, target users can be quickly excavated with little manual intervention, a large number of target people can be excavated by the method, user labels can be quickly determined in batches, and the method can effectively solve the problem of low label excavation efficiency.
In a second aspect, referring to fig. 3, based on the same inventive concept, an embodiment of the present disclosure provides a user tag mining method. Compared with the embodiment in fig. 2, the embodiment corresponding to fig. 3 further evaluates the accuracy that the target user has the target tag after the target user is determined, so that the accuracy of tag mining can be further improved.
Fig. 3 includes steps 1-11.
1. And determining the seed key words.
The seed keywords may be obtained through human experience, for example, to mine the label of the instructor, the instructor may be labeled as "teacher, executive or the like" in the user labeling data, and if the label of the medical staff is mined, the users may be labeled as "doctor, TCM, physician" or the like, and the words are used as the seed keywords.
2. And matching seed keywords.
And (3) matching the seed keywords in the step (1) with the user labeling data, and keeping the matched user ID and keywords. If the user label data is as follows:
ID11 × 3311 wang xiaodi teacher
ID11 star 5638 Limazu BankIn
ID11 star 2314 Zhang Fugui chafing dish shop boss
ID 11X 3567 Li Ming insurance sales
ID11 × 1598 Sunzhi-Strong physical teacher
ID11 × 3034 instructor
ID11 × 7856 length
ID11 × 5673 Property plums
And matching the user annotation data by using the keyword 'teacher and executive owner' and keeping the matched information, namely the information of the initial seed user is as follows: ID11 × 3311 teacher, ID11 × 5638 master, and ID11 × 1598 teacher.
3. Counting word frequency and obtaining seed users
The same user ID may be tagged by many people simultaneously and stored in the user tagging data. Therefore, the times of matching each ID of the initial seed user with the seed key words can be counted, and the ID with the times larger than k is taken as the seed user. Such as: the result obtained by counting the matching times of the seed keywords is as follows: ID11 × 331125 times, ID11 × 563413 times, and ID11 × 15985 times, then ID11 × 3311, ID11 × 5638 times more than k (assuming that k takes the value of 10) match the keyword, and can be used as the seed user.
4. Word segmentation processing and high-frequency word statistics
And (4) performing word segmentation processing on the user labeling data of the seed user acquired in the step (3), and counting the word frequency of each word. For example, the word frequency statistics obtained are as follows: teacher 2000 times, executive 1500 times, executive 600 times, instructor 500 times, school 400 times, client 300 times, mother 200 times, etc.
5. Screening for specific keywords
The mining is added with a teacher label, so that the words related to the teacher are teacher, executive, main, instructor and school leader, and the irrelevant words such as client, mother and the like are eliminated. The specific keyword obtained is "teacher, class master, instructor, school leader".
6. Matching specific keywords and obtaining matched population
And (3) matching the specific keywords acquired in the step (5) with the user labeling data (the method is similar to the step (2)), counting the times of matching the specific keywords on each user ID (the method is similar to the step (3)), and taking the keywords with the times larger than 1(1 can take 3-5) as a matched population (an initial target user).
7. Filtering mutually exclusive identities of a population
Some crowds with mutual exclusion identities, such as couriers, truck drivers and individual users, are filtered from the matching crowds, mutual exclusion exists between the professions and teaching workers, and target crowds (target users) can be obtained after the crowds are filtered.
8. Selecting positive and negative samples
Taking the user ID of the top m% (for example, m may be 10 to 30) word frequency obtained in step 7 as a positive sample, wherein the word frequency of all words is taken, for example, the user ID corresponding to the word frequency of the top m% of the sum of the number of times obtained by adding 2000 times of teacher, 1500 times of executive principal, 600 times of principal, 500 times of instructor and 400 times of length correction is taken as the positive sample. And randomly selecting n times (n can be 5-10) of the user ID of the number of the positive samples as negative samples in a network grabbing or network cloud disk mode and the like, wherein the negative samples cannot contain the positive samples, and for example, determining users without specific keywords in the annotation data as the negative samples.
9. Building a feature library
And constructing a feature library based on the basic data of the network sediment. The feature library contains multi-dimensional information of the user, such as basic information of age, gender and the like, shopping information, behavior information and the like. The information is the basis for mining user portrait tags and is input into the mechanistic model as basic features.
Therefore, the positive and negative samples are spliced with the feature library to form a training set of the machine learning model.
10. Machine learning model training
And training by using a machine learning model training set to obtain a trained model. Common machine learning models include random forests, gbdt (gradient descent trees), xgboost (extreme gradient boost), logistic regression, lightGBM (lightweight gradient boost), DNN (deep neural network) models, and the like, and the embodiments of the present disclosure do not limit the types of specific algorithms of the models.
11. And (6) scoring the label.
And (3) scoring the labels of the target users by using the machine learning model trained in the step 10, so as to obtain the score of each user for the target labels, and outputting the score as a result. The higher the score, the higher the accuracy of the tag. Users with scores below a threshold may be further culled from the scores, e.g., users with scores above a threshold (say 80 points) are retained.
After the target users are mined in the embodiment, the scoring of each user for the tags can be given through the steps 8-11, and the higher the scoring is, the higher the accuracy is. Moreover, the user tags are scored in a machine learning modeling mode, the mode is very universal and does not involve manual experience, and the mining of the tags can be completed very quickly. The method is a very general user label scoring mode, the label scoring of each user can be quickly obtained without the participation of manual experience, the accuracy is judged according to the scoring degree, and the problem that people with higher accuracy are difficult to obtain from labels is effectively solved.
In a third aspect, embodiments of the present specification further provide a user tag mining apparatus, referring to fig. 4, the apparatus includes:
a seed keyword determination unit 401 for determining at least one seed keyword of the target tag
A seed user determining unit 402, configured to perform matching in the user annotation data according to the seed keyword to determine a seed user;
a specific keyword determining unit 403, configured to perform high-frequency word statistics on the user tagging data of the seed user, and determine and screen out a plurality of specific keywords;
and a target user determining unit 404, configured to perform matching in the user annotation data by using the specific keyword, and determine a target user.
In an alternative, the seed user determining unit 402 includes:
an initial seed user determination subunit 4021, configured to perform matching in the user annotation data by using the seed keyword, and determine an initial seed user whose user annotation data includes the seed keyword;
a seed user screening subunit 4022, configured to count the times that the initial seed users match the seed keywords, and screen out the initial seed users with the times greater than a threshold as the seed users.
In an alternative manner, the specific keyword determination unit 403 includes:
a high-frequency word counting subunit 4031, configured to perform word segmentation processing on the user tagging data of the seed user, count the word frequency of each word, and keep the high-frequency words with the word frequency greater than a threshold;
a specific keyword screening subunit 4032, configured to remove, from the high-frequency words, words that are not related to the target tag, to obtain the specific keyword.
In an alternative, the target user determination unit 404 includes:
an initial target user determination subunit 4041, configured to match the specific keyword in user tagging data, and determine an initial target user whose user tagging data includes the specific keyword;
the target used screening subunit 4042 is configured to count the number of times that the initial target user matches the specific keyword, and screen out an initial specific user whose number of times is greater than a threshold as the target user.
In an alternative, the method further comprises:
and the target user scoring unit 405 is configured to score the accuracy of the target tag of the target user according to the machine learning model.
In an alternative, the method further comprises: a machine learning model training unit 406, the machine learning model training unit 406 comprising:
the positive and negative sample determination subunit 4061 is configured to select, from the target users, a user whose occurrence probability of the high-frequency word is greater than a threshold as a positive sample, and determine a user without a specific keyword in the annotation data as a negative sample;
a training set constructing subunit 4062, configured to splice the positive sample and the negative sample with a pre-constructed feature library to obtain a training set;
and the training subunit 4063 is configured to train the training set based on a machine learning algorithm to obtain the machine learning model.
In an alternative mode, the number of the specific keywords is more than that of the seed keywords, and the specific keywords comprise the seed keywords; the number of the target users is more than that of the seed users, and the target users comprise the seed users.
In a fourth aspect, based on the same inventive concept as the user tag mining method in the foregoing embodiment, the present invention further provides a server, as shown in fig. 5, including a memory 504, a processor 502 and a computer program stored on the memory 504 and executable on the processor 502, wherein the processor 502 implements the steps of any one of the foregoing user tag mining methods when executing the program.
Where in fig. 5 a bus architecture (represented by bus 500) is shown, bus 500 may include any number of interconnected buses and bridges, and bus 500 links together various circuits including one or more processors, represented by processor 502, and memory, represented by memory 504. The bus 500 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 506 provides an interface between the bus 500 and the receiver 501 and transmitter 503. The receiver 501 and the transmitter 503 may be the same element, i.e. a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 502 is responsible for managing the bus 500 and general processing, and the memory 504 may be used for storing data used by the processor 502 in performing operations.
In a fifth aspect, based on the inventive concept of the user tag mining method in the foregoing embodiments, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of any one of the foregoing user tag mining methods.
The description has been presented with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present specification have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all changes and modifications that fall within the scope of the specification.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present specification without departing from the spirit and scope of the specification. Thus, if such modifications and variations of the present specification fall within the scope of the claims of the present specification and their equivalents, the specification is intended to include such modifications and variations.

Claims (16)

1. A user tag mining method comprises the following steps:
determining at least one seed keyword of a target label, and matching the seed keyword in user labeling data to determine a seed user;
carrying out high-frequency word statistics on the user labeling data of the seed user, and determining and screening out a plurality of specific keywords;
and matching the specific keywords in the user labeling data to determine the target user.
2. The method of claim 1, wherein the matching of the seed keyword in the user annotation data to determine the seed user comprises:
matching the seed keywords in user labeling data to determine an initial seed user containing the seed keywords in the user labeling data;
and counting the times of matching the initial seed users with the seed keywords, and screening the initial seed users with the times larger than a threshold value as the seed users.
3. The method of claim 1, wherein the performing high-frequency word statistics on the user annotation data of the seed user, determining and screening out a plurality of specific keywords comprises:
performing word segmentation processing on the user labeling data of the seed user, counting the word frequency of each word, and reserving high-frequency words with the word frequency larger than a threshold value;
and removing words irrelevant to the target label from the high-frequency words to obtain the specific keyword.
4. The method of claim 1, wherein the matching of the specific keywords in the user annotation data to determine the target user comprises:
matching the specific keywords in user labeling data to determine an initial target user containing the specific keywords in the user labeling data;
and counting the times of matching the initial target users with the specific keywords, and screening out the initial specific users with the times larger than a threshold value as the target users.
5. The method of claim 1, further comprising:
and according to a machine learning model, scoring the target label accuracy of the target user.
6. The method of claim 5, further comprising: training the machine learning model:
selecting users with the high-frequency word occurrence probability larger than a threshold value from the target users as positive samples, and determining users without specific keywords in the labeling data as negative samples;
splicing the positive sample and the negative sample with a pre-constructed feature library to obtain a training set;
and training the training set based on a machine learning algorithm to obtain the machine learning model.
7. The method according to any one of claims 1 to 6,
the number of the specific keywords is more than that of the seed keywords, and the specific keywords comprise the seed keywords;
the number of the target users is more than that of the seed users, and the target users comprise the seed users.
8. A user tag mining device, comprising:
a seed keyword determination unit for determining at least one seed keyword of the target tag
The seed user determining unit is used for matching the seed key words in the user marking data to determine seed users;
the specific keyword determining unit is used for carrying out high-frequency word statistics on the user labeling data of the seed user, and determining and screening out a plurality of specific keywords;
and the target user determining unit is used for matching the specific keywords in the user labeling data to determine the target user.
9. The apparatus of claim 8, the seed user determination unit comprising:
the initial seed user determining subunit is used for matching the seed keywords in the user labeling data to determine an initial seed user containing the seed keywords in the user labeling data;
and the seed user screening subunit is used for counting the times of matching the initial seed users with the seed keywords, and screening the initial seed users with the times larger than a threshold value as the seed users.
10. The apparatus of claim 8, the specific keyword determination unit comprising:
the high-frequency word counting subunit is used for carrying out word segmentation processing on the user labeling data of the seed user, counting the word frequency of each word and reserving the high-frequency words with the word frequency larger than a threshold value;
and the specific keyword screening subunit is used for removing the words which are irrelevant to the target label from the high-frequency words to obtain the specific keywords.
11. The apparatus of claim 8, the target user determination unit comprising:
the initial target user determining subunit is used for matching the specific keywords in the user labeling data to determine an initial target user containing the specific keywords in the user labeling data;
and the target used screening subunit is used for counting the times of matching the initial target users with the specific keywords, and screening the initial specific users with the times larger than a threshold value as the target users.
12. The apparatus of claim 8, further comprising:
and the target user scoring unit is used for scoring the target label accuracy of the target user according to the machine learning model.
13. The apparatus of claim 12, further comprising: a machine learning model training unit, the machine learning model training unit comprising:
the positive and negative sample determining subunit is used for selecting a user with the high-frequency word occurrence probability larger than a threshold value from the target users as a positive sample and determining a user without a specific keyword in the labeling data as a negative sample;
the training set constructing subunit is used for splicing the positive sample and the negative sample with a pre-constructed feature library to obtain a training set;
and the training subunit is used for training the training set based on a machine learning algorithm to obtain the machine learning model.
14. The apparatus according to any one of claims 8-13,
the number of the specific keywords is more than that of the seed keywords, and the specific keywords comprise the seed keywords;
the number of the target users is more than that of the seed users, and the target users comprise the seed users.
15. A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1 to 7 when executing the program.
16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN201910735347.3A 2019-08-09 2019-08-09 User tag mining method, device, server and readable storage medium Pending CN110598091A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910735347.3A CN110598091A (en) 2019-08-09 2019-08-09 User tag mining method, device, server and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910735347.3A CN110598091A (en) 2019-08-09 2019-08-09 User tag mining method, device, server and readable storage medium

Publications (1)

Publication Number Publication Date
CN110598091A true CN110598091A (en) 2019-12-20

Family

ID=68853804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910735347.3A Pending CN110598091A (en) 2019-08-09 2019-08-09 User tag mining method, device, server and readable storage medium

Country Status (1)

Country Link
CN (1) CN110598091A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111198992A (en) * 2020-01-07 2020-05-26 精硕科技(北京)股份有限公司 Identification method and identification device for mother and infant crowd, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154232A1 (en) * 2012-01-17 2015-06-04 Google Inc. System and method for associating images with semantic entities
WO2015085967A1 (en) * 2013-12-10 2015-06-18 腾讯科技(深圳)有限公司 User behavior data analysis method and device
CN107016026A (en) * 2016-11-11 2017-08-04 阿里巴巴集团控股有限公司 A kind of user tag determination, information-pushing method and equipment
CN109918662A (en) * 2019-03-04 2019-06-21 腾讯科技(深圳)有限公司 A kind of label of e-sourcing determines method, apparatus and readable medium
CN109960707A (en) * 2019-03-20 2019-07-02 上海亿阁信息科技有限公司 A kind of colleges and universities' enrollment data acquisition method and system based on artificial intelligence
CN110069695A (en) * 2017-09-12 2019-07-30 北京国双科技有限公司 Label processing method and device
CN110069769A (en) * 2018-01-22 2019-07-30 腾讯科技(深圳)有限公司 Using label generating method, device and storage equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154232A1 (en) * 2012-01-17 2015-06-04 Google Inc. System and method for associating images with semantic entities
WO2015085967A1 (en) * 2013-12-10 2015-06-18 腾讯科技(深圳)有限公司 User behavior data analysis method and device
CN107016026A (en) * 2016-11-11 2017-08-04 阿里巴巴集团控股有限公司 A kind of user tag determination, information-pushing method and equipment
CN110069695A (en) * 2017-09-12 2019-07-30 北京国双科技有限公司 Label processing method and device
CN110069769A (en) * 2018-01-22 2019-07-30 腾讯科技(深圳)有限公司 Using label generating method, device and storage equipment
CN109918662A (en) * 2019-03-04 2019-06-21 腾讯科技(深圳)有限公司 A kind of label of e-sourcing determines method, apparatus and readable medium
CN109960707A (en) * 2019-03-20 2019-07-02 上海亿阁信息科技有限公司 A kind of colleges and universities' enrollment data acquisition method and system based on artificial intelligence

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111198992A (en) * 2020-01-07 2020-05-26 精硕科技(北京)股份有限公司 Identification method and identification device for mother and infant crowd, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112632385B (en) Course recommendation method, course recommendation device, computer equipment and medium
US20220405607A1 (en) Method for obtaining user portrait and related apparatus
Vialardi et al. A data mining approach to guide students through the enrollment process based on academic performance
CN110569356B (en) Interviewing method and device based on intelligent interviewing interaction system and computer equipment
CN110765117A (en) Fraud identification method and device, electronic equipment and computer-readable storage medium
US9262725B2 (en) Mental modeling for modifying expert model
CN107368521B (en) Knowledge recommendation method and system based on big data and deep learning
CN110222171A (en) A kind of application of disaggregated model, disaggregated model training method and device
KR102144126B1 (en) Apparatus and method for providing information for enterprise
KR102419326B1 (en) Agent system for selective sorting and matching simulation of portfolios
CN110728182B (en) Interview method and device based on AI interview system and computer equipment
CN111651270B (en) Visualization method and device for completing multitask semantic annotation on legal data
US20230351473A1 (en) Apparatus and method for providing user's interior style analysis model on basis of sns text
Kavikondala et al. Automated retraining of machine learning models
CN106897282A (en) The sorting technique and equipment of a kind of customer group
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
US11481580B2 (en) Accessible machine learning
CN117077679B (en) Named entity recognition method and device
CN110598091A (en) User tag mining method, device, server and readable storage medium
CN113505154A (en) Digital reading statistical analysis method and system based on big data
CN117876090A (en) Risk identification method, electronic device, storage medium, and program product
Martin et al. Using citizen science gamification in agriculture collaborative knowledge production
CN113536111B (en) Recommendation method and device for insurance knowledge content and terminal equipment
Pradhan et al. Machine learning architecture and framework
CN112989217B (en) System for managing human veins

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20191220

RJ01 Rejection of invention patent application after publication