CN111046300A - Method and device for determining crowd attributes of users - Google Patents

Method and device for determining crowd attributes of users Download PDF

Info

Publication number
CN111046300A
CN111046300A CN201911299437.9A CN201911299437A CN111046300A CN 111046300 A CN111046300 A CN 111046300A CN 201911299437 A CN201911299437 A CN 201911299437A CN 111046300 A CN111046300 A CN 111046300A
Authority
CN
China
Prior art keywords
topic
user
cluster
users
topics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911299437.9A
Other languages
Chinese (zh)
Inventor
刘欣益
孙付伟
王政英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhizhe Sihai Beijing Technology Co ltd
Original Assignee
Zhizhe Sihai Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhizhe Sihai Beijing Technology Co ltd filed Critical Zhizhe Sihai Beijing Technology Co ltd
Priority to CN201911299437.9A priority Critical patent/CN111046300A/en
Publication of CN111046300A publication Critical patent/CN111046300A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Databases & Information Systems (AREA)
  • Finance (AREA)
  • Theoretical Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure provides a method and a device for determining user crowd attributes, relates to the technical field of internet, and is used for solving the problem that the user crowd attributes obtained based on a machine model are inaccurate in the prior art. The method comprises the following steps: establishing a topic network according to the topic data; clustering topics in the topic network to form topic clusters; determining a connection weight between the user and a topic in the topic network according to the user behavior data; constructing a relation graph between the user and the topic cluster according to the connection weight between the user and the topic in the topic network; and determining the crowd attributes of the users according to the relationship graph between the users and the topic clusters.

Description

Method and device for determining crowd attributes of users
Technical Field
The present disclosure relates to the field of internet technologies, and in particular, to a method and an apparatus for determining a crowd attribute of a user.
Background
If the content platform can know the crowd attributes (such as programmers and product managers) of the users, services such as accurate content recommendation and advertisement recommendation can be performed on the users. The demographic attributes of the user are therefore essential features for the content platform. However, most content platforms do not require the user to fill in relevant information, which makes it difficult to obtain the user's crowd attributes.
The prior art mainly determines the attributes of a population by performing supervised learning based on labeled data. For example, by mining labeled training data, models are built using supervised machine learning algorithms such as gradient boosting trees (GBDT) or supervised deep learning methods. And then training the model according to the labels in the training data, and predicting and obtaining the corresponding crowd labels in the full data by applying the trained model. However, since some stations do not force users to fill in their personal information (such as profession), it becomes very difficult to acquire tagged data, and tagged data can only be roughly mined by the browsing behavior of the users. According to the scheme, a large amount of accurate labeled data is difficult to obtain, so that a machine learning model with high accuracy is difficult to train, and therefore the crowd attribute result of the user obtained based on the machine learning model is inaccurate.
Disclosure of Invention
In view of the above, an object of the embodiments of the present disclosure is to provide a method and an apparatus for determining a user's crowd attribute, so as to solve the problem in the prior art that the user's crowd attribute obtained based on a machine model is inaccurate.
According to a first aspect of the present disclosure, there is provided a method of determining a demographic property of a user, comprising: establishing a topic network according to the topic data; clustering topics in the topic network to form topic clusters; determining a connection weight between a user and a topic in the topic network according to user behavior data; constructing a relation graph between users and topic clusters according to the connection weight between the users and the topics in the topic network; and determining the crowd attributes of the users according to the relationship graph between the users and the topic clusters.
In one possible embodiment, wherein said establishing a topic network comprises: calculating the connection weight between any two topics; and constructing a topic network according to the connection weight between any two topics.
In one possible embodiment, wherein the clustering forms a topic cluster, comprising: inputting the topic pairs with connection relations in the topic network into a clustering algorithm, outputting topic clusters to which the topics belong, and removing abnormal topics from each topic cluster.
In one possible embodiment, wherein the removing of the abnormal topic from each topic cluster comprises: calculating the score of each topic in each topic cluster; and removing the topics with the scores less than or equal to the threshold value from the topic cluster.
In one possible embodiment, the determining the crowd attribute of the user according to the relationship graph between the user and the topic cluster includes: determining a user topic cluster list according to the relation graph between the user and the topic cluster; and determining the crowd attribute of the user according to the intersection number of the target topic cluster label and the topic cluster label in the user topic cluster list, wherein the target topic cluster label is searched in the topic cluster label list according to the target crowd.
In one possible embodiment, the determining a user topic cluster list according to the relationship graph between the user and the topic cluster includes: calculating the scores of the users and the topic clusters to which the users belong according to the relation graph between the users and the topic clusters; and sequencing the scores of the user and the topic cluster to which the user belongs to obtain a user topic cluster list.
In one possible embodiment, the determining the crowd attribute of the user according to the relationship graph between the user and the topic cluster includes: inputting the relation graph between the user and the topic cluster into a spectral clustering algorithm, and outputting a clustering result of the user and the topic cluster; and determining the crowd attributes of the users according to the clustering results of the users and the topic clusters.
According to a second aspect of the present disclosure, there is provided an apparatus for determining a demographic property of a user, comprising: a topic network establishing module configured to establish a topic network from the topic data; a topic cluster forming module configured to cluster topics in the topic network into topic clusters; a connection weight determination module configured to determine a connection weight between a user and a topic in the topic network from user behavior data; the relation graph building module is further configured to build a relation graph between users and the topic clusters according to the connection weight between the users and the topics in the topic network; and the crowd attribute determining module is configured to determine the crowd attributes of the users according to the relation graph between the users and the topic clusters.
In one possible embodiment, wherein the topic network establishment module is specifically configured to: calculating the connection weight between any two topics; and constructing a topic network according to the connection weight between any two topics.
In one possible embodiment, wherein the topic cluster forming module is specifically configured to: inputting the topic pairs with connection relations in the topic network into a clustering algorithm, outputting topic clusters to which the topics belong, and removing abnormal topics from each topic cluster.
In one possible embodiment, among others, it further includes: the topic score calculation module is configured to calculate the score of each topic in each topic cluster; and the topic cluster forming module is further configured to remove topics from the topic cluster for which the score is less than or equal to a threshold value.
In one possible embodiment, the crowd-attribute determining module is specifically configured to: determining a user topic cluster list according to the relation graph between the user and the topic cluster; and determining the crowd attribute of the user according to the intersection number of the target topic cluster label and the topic cluster label in the user topic cluster list, wherein the target topic cluster label is searched in the topic cluster label list according to the target crowd.
In a possible embodiment, the crowd-attribute determining module is further specifically configured to: calculating the scores of the users and the topic clusters to which the users belong according to the relation graph between the users and the topic clusters; and sequencing the scores of the user and the topic cluster to which the user belongs to obtain a user topic cluster list.
In a possible embodiment, the crowd-attribute determining module is further specifically configured to: inputting the relation graph between the user and the topic cluster into a spectral clustering algorithm, and outputting a clustering result of the user and the topic cluster; and determining the crowd attributes of the users according to the clustering results of the users and the topic clusters.
According to a third aspect of the present disclosure, there is provided an electronic device comprising a processor and a memory, wherein the memory stores instructions that, when executed, cause the processor to perform the method according to the first aspect of the present disclosure.
According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium storing instructions which, when executed, implement the method according to the first aspect of the present disclosure.
According to the method and the device for determining the crowd attributes of the users, firstly, a topic network is established according to topic data; clustering topics in the topic network to form topic clusters; then, determining the connection weight between the user and the topic in the topic network according to the user behavior data; constructing a relation graph between the user and the topic cluster according to the connection weight between the user and the topic in the topic network; and determining the crowd attributes of the users according to the relationship graph between the users and the topic clusters. Because the scheme is based on an unsupervised algorithm, the division of the crowd can be realized without marked data as long as the user has a behavior in the station, and the accuracy of the result of dividing the crowd attribute of the user is improved.
In order to make the aforementioned and other objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the embodiments of the present disclosure will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present disclosure and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings may be obtained from the drawings without inventive effort.
FIG. 1 illustrates a flow of crowd division according to an embodiment of the present disclosure;
FIG. 2 illustrates a flow chart of a method of determining a demographic property of a user in accordance with an embodiment of the present disclosure;
FIG. 3 illustrates a correspondence diagram of questions to topics in an embodiment of the present disclosure;
FIG. 4 illustrates a community relationship network diagram in an embodiment of the present disclosure;
FIG. 5 illustrates a relationship diagram between users and topic clusters in an embodiment of the disclosure;
FIG. 6 is a schematic diagram illustrating an apparatus for determining attributes of a user population according to an embodiment of the present disclosure;
fig. 7 shows a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of clearly describing the technical solutions of the embodiments of the present invention, in the embodiments of the present invention, the words "first", "second", and the like are used to distinguish the same items or similar items with basically the same functions or actions, and those skilled in the art can understand that the words "first", "second", and the like do not limit the quantity and execution order.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
The term "comprises/comprising" when used herein refers to the presence of a feature, element or component, but does not preclude the presence or addition of one or more other features, elements or components.
The method for determining the crowd attributes of the users provided by the embodiment of the disclosure comprises the steps of firstly, establishing a topic network according to topic data; clustering topics in the topic network to form topic clusters; then, determining the connection weight between the user and the topic in the topic network according to the user behavior data; constructing a relation graph between the user and the topic cluster according to the connection weight between the user and the topic in the topic network; and determining the crowd attributes of the users according to the relationship graph between the users and the topic clusters. Because the scheme is based on an unsupervised algorithm, the division of the crowd can be realized without marked data as long as the user has a behavior in the station, and the accuracy of the result of dividing the crowd attribute of the user is improved. Embodiments of the present disclosure and their advantages are described in detail below with reference to the accompanying drawings.
The prior art mainly determines the attributes of a population by performing supervised learning based on labeled data. For example, by mining labeled training data, models are built using supervised machine learning algorithms such as gradient boosting trees (GBDT) or supervised deep learning methods. And then training the model according to the labels in the training data, and predicting and obtaining the corresponding crowd labels in the full data by applying the trained model. However, it becomes difficult to obtain tagged data because some stations do not force users to fill out their personal information (e.g., profession). Tagged data can only be roughly mined by the user's browsing behavior, etc. However, it is difficult to obtain a large amount of accurate labeled data, so it is difficult to train a machine learning model with high accuracy, and the result of the crowd attribute of the user obtained based on the machine learning model is inaccurate.
In order to solve the above problem, the present disclosure provides a flow chart of crowd division, as shown in fig. 1, the flow chart mainly includes the following contents: in-site (e.g., known) topic network construction- > topic network clustering- > topic cluster and user bipartite graph construction- > bipartite graph-based population partitioning. A specific implementation process will be described below based on the content of fig. 1.
FIG. 2 illustrates a flow chart of a method of determining a demographic property of a user in an embodiment of the present disclosure. As shown in fig. 2, the method includes:
201. and establishing a topic network according to the topic data.
Illustratively, the topic data described above includes a topic tag for describing a question. Typically, topics are tightly bound to questions, and a question will typically bind more than one topic. When a user binds topics, the topics most related to content are usually bound, so topics appearing in the same question usually have a certain relation. The topics may belong to the same category or may be related to the upper and lower levels in the hierarchical structure. For example, when the problem is: what are the classical target tracking algorithms currently available in computer vision? The corresponding topics may include the following tags: machine Learning, image recognition, computer vision, machine vision, and Deep Learning (Deep Learning). Specifically, reference may be made to a correspondence diagram between questions and topics shown in fig. 3. In the computer vision related problems, topics such as machine vision, machine learning, deep learning and the like are basically played at the same time. In this example, machine learning, deep learning, should be the parent topic of computer vision. Therefore, a topic-aware network can be constructed through topic co-occurrence under the same question.
Preferably, the step 201 includes the following steps: calculating the connection weight between any two topics; and constructing a topic network according to the connection weight between any two topics.
For example, calculating the connection weight between topic a and topic B may be as follows:
Figure BDA0002321491180000061
the weight of the topic A- > topic B is defined as the proportion of the topic A and the topic B in all the questions containing the topic A. According to this definition, A- > B and B- > A may have different connection weights. And mining the co-occurrence relation of the total-station topics according to the formula, and calculating the connection weight. In some embodiments, after filtering out the edges with too low weights, the directed graph is converted into an undirected graph, which is the final topic network. For example, the final topic network may be one that contains up to 6 million topics and 30 million edges.
202. And clustering the topics in the topic network to form topic clusters.
Illustratively, the step 202 includes the following steps: the topic pairs having connection relations in the topic network (for example, an undirected graph of the topic network is obtained in step 201) are input into a clustering algorithm, topic clusters to which the topics belong are output, and abnormal topics are removed from each topic cluster.
Optionally, after obtaining the topic network, clustering the topic network by using an existing network clustering (e.g., community discovery) algorithm. Because there are very many algorithms, we can directly apply the existing algorithms, but the use of the algorithms needs to satisfy the following requirements: 1. the algorithm has high iteration speed; 2. the same topic can be clustered into different clusters; 3. the number of topics in the cluster is distributed uniformly. To satisfy the above points, the present disclosure selects the BigCLAM algorithm. The algorithm considers that a network can be generated through a community relation bipartite graph, namely, a network can be generated given the community relations of all nodes of the network and the nodes. On the contrary, if we get a network, we can reverse the relationship between its nodes and the community.
Fig. 4 shows a community relationship network diagram according to an embodiment of the disclosure. The circular nodes in the upper half of the graph represent communities and the square nodes in the lower half of the graph represent points in the network. The corresponding community relation can be reversely deduced through the existing topic network, so that the clustering result corresponding to the topic is obtained. A community is a topic cluster. As shown in FIG. 4, the square nodes are clustered into three communities A, B, C, each square node having a probability p that it belongs to each communityA、pB、pC. Based on the above, the BigCLAM algorithm is used to obtain the topic cluster, and for the BigCLAM algorithm, the input is a topic pair having a connection relationship, and the output is a cluster to which the topic belongs, that is, a topic cluster list formed by a plurality of topics. For example, the algorithm finally completes the above in 10 secondsAnd clustering 6 ten thousand topics and 30 ten thousand edges to finally obtain 2500 topic clusters.
For example, after clustering topics in the topic network through the above algorithm, there may be abnormal topics not belonging to the topic cluster category in the category, and the abnormal values need to be removed. Optionally, the removing of the abnormal topic from each topic cluster specifically may include the following: calculating the score of each topic in each topic cluster; and removing the topics with the scores less than or equal to the threshold value from the topic cluster. The contents for calculating the scores of the topics in each topic cluster are as follows:
given a topic cluster category C, for any topic t thereiniThe score is calculated as follows:
Figure BDA0002321491180000081
wherein, in the above formula II, t in the above numeratoriFor any topic t in the topic cluster category C, t in the denominator belowiAnd tjIs any two topics in the topic cluster category C, and n is the number of topics in the topic cluster category C.
Optionally, before the abnormal topics are removed, the abnormal topics may be removed by sorting the abnormal topics from small to large according to the scores, and then removing the topics with the scores smaller than the threshold from the category according to the sorting result.
203. Determining a connection weight between the user and a topic in the topic network from the user behavior data.
By way of example, the user behavior data described above may include, but is not limited to: topics and questions that the user is interested in, answers and articles that the user likes, answers and articles that the user has collected, answers and articles that the user has created. The connection weight between the user and the topic in the topic network is the sum of all behaviors generated by a certain user and the weight between a certain topic.
At present, interaction between a user and topics is mainly generated through the user behaviors, and all topics, questions, answers and articles can correspond to topic dimensions, so that interaction between the user and various contents known in a website can be converted into interaction between the user and the topics, and then the frequency of interaction between the user and the topics is calculated to determine the connection weight. In particular, different behaviors may be assigned a different behavior weight, such as the authoring behavior of the user, and the user may be considered to have a strong interest in the topic to actively author the topic. The authoring behavior may be weighted higher than the other behaviors. For example, if the weight of the authoring behavior is set to 6, the weight of the collection behavior is set to 5, the weight of the praise behavior is set to 4, and the weight of the focus behavior is set to 3. A certain user A creates 2 machine learning related articles and 1 deep learning related article; 2 pieces of content related to deep learning and 1 piece of pet are collected; 1 article relevant to deep learning and 2 articles relevant to machine learning are praised; attention was given to 3 pet-related and 1 machine learning-related items. Then the connection weight between the user a and the machine learning topic is: 6 × 2+4 × 2+3 × 1 ═ 23, the connection weight between user a and the deep learning topic is: 6 × 1+5 × 2+4 × 1 ═ 20, the connection weight between user a and the pet topic is: 5 x 1+3 x 3 ═ 14.
204. And constructing a relation graph between the user and the topic cluster according to the connection weight between the user and the topic in the topic network.
Illustratively, the topics are connected, and are clustered into topic clusters through a clustering algorithm. There is a connection between the user and the topic, so the connection relationship between the user and the topic cluster can be calculated in a weighted manner, and thus a relationship diagram between the user and the topic cluster can be obtained, specifically referring to the content shown in fig. 5. The connection relation between the user and the topic cluster refers to the score of the user and the topic cluster.
205. And determining the crowd attributes of the users according to the relationship graph between the users and the topic clusters.
Illustratively, the step 205 may be implemented in a rule-based manner, and specifically includes the following steps: determining a user topic cluster list according to a relation graph between a user and a topic cluster; and determining the crowd attribute of the user according to the intersection number of the target topic cluster label and the topic cluster label in the user topic cluster list, wherein the target topic cluster label is searched in the topic cluster label list according to the target crowd. Optionally, the target group may be set according to a user requirement, for example, the user may be set as a game group, a programmer group, or the like. The topic cluster label list is a topic cluster to which a topic belongs and is output through a clustering algorithm when the topic cluster is formed through clustering, so that the topic cluster label list is obtained.
For example, the determining the user topic cluster list according to the relationship graph between the user and the topic cluster specifically includes the following contents: calculating the scores of the users and the topic clusters to which the users belong according to a relation graph between the users and the topic clusters; and sequencing the scores of the user and the topic cluster to which the user belongs to obtain a user topic cluster list.
As shown in fig. 5, scores of the user and the topic cluster to which the user belongs may be calculated according to a relationship diagram between the user and the topic cluster, and then the scores are sorted to obtain a user topic cluster list, which may specifically refer to the right part of content in fig. 5. The user topic cluster list includes the corresponding relationship among the users, the topic clusters to which the users belong and the scores, for example: user 1 scores 1 for category 1 and 2 for category 2, respectively; user 2 has a score of 1 for category 1 and a score of 2 for category 2, respectively.
For example, the score of the user and the topic cluster to which the user belongs can be calculated by the following formula:
Figure BDA0002321491180000091
wherein C in the above formula IIIiThe topic i of the topic cluster C, and n is the number of topics in the topic cluster C. Consider that some clusters of topics contain more topics and some clusters of topics contain fewer topics. The probability that the topic cluster containing more topics interacts with the user is relatively high, and the score of the user and the topic cluster is increased. There may be a cluster of topics in which the user has a higher interest in a lesser category of topics, such as a user's interaction with all topics within the cluster. Naturally the topic cluster should score higher. Therefore, the influence of the size of the topic cluster (i.e. the number of topics in the topic cluster) needs to be eliminated when calculating the score between the user and the topic cluster.
A specific example is given below to describe the content of determining the user population attribute according to the intersection number of the target topic cluster tag and the topic cluster tags in the user topic cluster list in step 105.
For example, there are 10 users in a certain station, 6 clusters, and the cluster tag list is: label1 Game, Label2 Game, Label3 programmer, Label4 programmer, Label5 programmer, Label6 Pet; the user topic cluster list is: user 1: [ label1, label2, label3 ]; and (4) a user 2: [ label3, label4, label5 ]; user 3: [ label1, label6 ]; the user 4: [ label1, label3, label4, label5 ]; user 5[ label4, label5, label6 ]; ...; the user 10: [ label3 ]. The current demand is to circle people who like games, and the following contents can be specifically realized:
step 1: looking at the topic cluster tag list, find game related labels, find game related ones [ label1, label2 ].
Step 2: and viewing the user topic cluster list according to the found label, and determining the intersection of the topic cluster label in the user topic cluster list and the game-related label, wherein the user with the intersection number > 1 comprises a user 1, a user 3 and a user 4. Therefore, the user 1, the user 3 and the user 4 are labeled with the crowd of 'hobby games', so that the crowd attribute of the user is determined.
Illustratively, the step 105 may also be implemented by contents based on an algorithm, specifically including the following contents: inputting a relation graph between the user and the topic cluster into a spectral clustering algorithm, and outputting a clustering result of the user and the topic cluster; and determining the crowd attributes of the users according to the clustering results of the users and the topic clusters.
In performing the use of algorithms to determine the demographic attributes of the user, the following considerations need to be taken into account: 1. knowing that the total number of (all) users in a station is huge, a clustering algorithm needs to have better time efficiency and space efficiency; 2. in order to label the clustering result with a crowd label, it is necessary that the clustering algorithm can cluster the users and the topic clusters at the same time. Based on the two considerations, a spectral clustering algorithm is finally selected, wherein the spectral clustering is an algorithm evolved from a graph theory, and is widely applied to clustering. The main idea is to treat all data as points of space, which can be connected by edges. The weight of an edge is represented by the distance between points. And (4) cutting graphs formed by all data, so that the sum of edge weights among different subgraphs after the graph cutting is as low as possible, and the sum of edge weights in the subgraphs is as high as possible, thereby achieving the purpose of clustering. Applying spectral clustering to the relationship graph in the present application requires a certain modification of the algorithm (due to the large data volume, the algorithm needs to be modified). In the actual implementation process, as the scale of the relationship graph between the users of the full data and the topic clusters reaches the level of ten million, algorithm development needs to be performed by large data tools such as Spark. Therefore, in the development process, a bipartite graph clustering algorithm based on Spark is realized, sparse matrix multiplication carried by Spark is improved, finally, spectral clustering is carried out on the relation graph data between the total number of users and the topic clusters, and the method can be completed in one hour and is high in speed.
The improved Spark-based clustering algorithm is more efficient, and can be used for clustering the full data in a short time. Compared with the traditional supervised scheme of training and forecasting, the method saves the time overhead by nearly 80 percent. The crowd attributes of a plurality of users can be directly obtained through the spectral clustering algorithm, so that labels can be marked for the users and a plurality of labels can be marked for one user, the accuracy is improved, and the operation rate can be improved.
In summary, the division method based on different crowd attributes of the rule and the algorithm in the disclosure can provide a solution for rapidly mining a single crowd and can also meet the requirement of dividing the crowd of the user in batch. The model does not need to be trained repeatedly to predict data, and various requirements can be responded quickly.
An apparatus for determining the crowd attribute of a user provided by the embodiment of the present disclosure will be described below based on the related description in the embodiment of the method for determining the crowd attribute of a user corresponding to fig. 2. Technical terms, concepts, and the like related to the above-described embodiments in the following embodiments may be described with reference to the above-described embodiments.
Fig. 6 is a schematic structural diagram of an apparatus for determining a crowd attribute of a user according to an embodiment of the present disclosure, and as shown in fig. 6, the apparatus includes: a topic network establishing module 61, a topic cluster forming module 62, a connection weight determining module 63, a relationship graph establishing module 64 and a crowd attribute determining module 65, wherein: a topic network establishing module 61 configured to establish a topic network from the topic data; a topic cluster forming module 62 configured to cluster topics in a topic network into topic clusters; a connection weight determination module 63 configured to determine a connection weight between the user and a topic in the topic network from the user behavior data; the relationship graph building module 64 is further configured to build a relationship graph between the users and the topic clusters according to the connection weights between the users and the topics in the topic network; and a relationship graph building module determining module 65 configured to determine the crowd attributes of the users according to the relationship graph between the users and the topic clusters.
Illustratively, the topic network establishing module 61 is specifically configured to: calculating the connection weight between any two topics; and constructing a topic network according to the connection weight between any two topics.
For example, calculating the connection weight between topic a and topic B may be as follows:
Figure BDA0002321491180000121
the weight of the topic A- > topic B is defined as the proportion of the topic A and the topic B in all the questions containing the topic A. According to this definition, A- > B and B- > A have different connection weights. And mining the co-occurrence relation of the total-station topics according to the formula, and calculating the connection weight. And after filtering the edges with too low weight, converting the directed graph into an undirected graph, namely the final topic network. For example, the final topic network may be one that contains up to 6 million topics and 30 million edges.
Illustratively, the topic cluster formation module 62 is specifically configured to: inputting topic pairs with connection relations in the topic network into a clustering algorithm, outputting topic clusters to which the topics belong, and removing abnormal topics from each topic cluster.
For example, after clustering topics in the topic network through the above algorithm, there are some abnormal topics that do not belong to the category inevitably in the category, and it is necessary to remove the abnormal value. Optionally, the removing of the abnormal topic from each topic cluster by the topic cluster forming module 62 specifically includes the following steps: calculating the score of each topic in each topic cluster; and removing the topics with the scores less than or equal to the threshold value from the topic cluster. The contents for calculating the scores of the topics in each topic cluster are as follows:
given a topic cluster category C, for any topic t thereiniThe score is calculated as follows:
Figure BDA0002321491180000122
wherein in the fifth formula, t in the above numeratoriFor any topic t in the topic cluster category C, t in the denominator belowiAnd tjIs any two topics in the topic cluster category C, and n is the number of topics in the topic cluster category C.
Optionally, before the abnormal topics are removed, the abnormal topics may be removed by sorting the abnormal topics from small to large according to the scores, and then removing the topics with the scores smaller than the threshold from the category according to the sorting result.
Optionally, the apparatus further comprises: a topic score calculation module 66 configured to calculate a score for each topic in each cluster of topics; and the topic cluster formation module 62 is further configured to cull topics having scores less than or equal to a threshold from the topic cluster.
Illustratively, the crowd attribute determination module 65 is specifically configured to: determining a user topic cluster list according to a relation graph between a user and a topic cluster; and determining the crowd attribute of the user according to the intersection number of the target topic cluster label and the topic cluster label in the user topic cluster list, wherein the target topic cluster label is searched in the topic cluster label list according to the target crowd.
Illustratively, the crowd attribute determination module 65 is further specifically configured to: calculating the scores of the users and the topic clusters to which the users belong according to a relation graph between the users and the topic clusters; and sequencing the scores of the user and the topic cluster to which the user belongs to obtain a user topic cluster list.
As shown in fig. 5, scores of the user and the topic cluster to which the user belongs may be calculated according to a relationship diagram between the user and the topic cluster, and then the scores are sorted to obtain a user topic cluster list, which may specifically refer to the right part of content in fig. 5. The user topic cluster list includes the corresponding relationship among the users, the topic clusters to which the users belong and the scores, for example: user 1 scores 1 for category 1 and 2 for category 2, respectively; user 2 has a score of 1 for category 1 and a score of 2 for category 2, respectively.
For example, the score of the user and the topic cluster to which the user belongs can be calculated by the following formula:
Figure BDA0002321491180000131
wherein, C in the above formula sixiThe topic i of the topic cluster C, and n is the number of topics in the topic cluster C. Consider that some clusters of topics contain more topics and some clusters of topics contain fewer topics. The probability that the topic cluster containing more topics interacts with the user is relatively high, and the score of the user and the topic cluster is increased. There may be a cluster of topics in which the user has a higher interest in a lesser category of topics, such as a user's interaction with all topics within the cluster. Naturally the topic cluster should score higher. Therefore, the influence of the size of the topic cluster (i.e. the number of topics in the topic cluster) needs to be eliminated when calculating the score between the user and the topic cluster.
Illustratively, the crowd attribute determination module 65 is further specifically configured to: inputting a relation graph between the user and the topic cluster into a spectral clustering algorithm, and outputting a clustering result of the user and the topic cluster; and determining the crowd attributes of the users according to the clustering results of the users and the topic clusters.
In summary, the crowd attribute determining module 65 is specifically configured to provide a solution for quickly mining a single crowd and also handle a requirement for crowd division of users in batch based on rules and algorithm-based different crowd attributes. The model does not need to be trained repeatedly to predict data, and various requirements can be responded quickly.
The device for determining the crowd attributes of the user, provided by the embodiment of the disclosure, is characterized in that firstly, a topic network is established according to topic data; clustering topics in the topic network to form topic clusters; then, determining the connection weight between the user and the topic in the topic network according to the user behavior data; constructing a relation graph between the user and the topic cluster according to the connection weight between the user and the topic in the topic network; and determining the crowd attributes of the users according to the relationship graph between the users and the topic clusters. Because the scheme is based on an unsupervised algorithm, the division of the crowd can be realized without marked data as long as the user has a behavior in the station, and the accuracy of the result of dividing the crowd attribute of the user is improved.
As shown in fig. 7, a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure includes: a processor (CPU)701, a memory (ROM)702, and a computer program stored on the memory and executable on the processor, the CPU701 implementing the method as shown in fig. 2 when executing the program. The CPU701 can perform various appropriate actions and processes in accordance with a program stored in the read only memory ROM702 or a program loaded from the storage section 708 into the Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The CPU701, the ROM702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
The disclosed embodiments provide a computer storage medium comprising computer instructions that, when executed on a computer, cause the computer to perform the method flow as described above. By way of example, computer-readable storage media can be any available media that can be accessed by a computer or a data storage device, such as a server, data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, only the division of the functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. The specific working processes of the system, the device and the unit described above can refer to the corresponding processes in the foregoing method embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Claims (10)

1. A method of determining a demographic property of a user, comprising:
establishing a topic network according to the topic data;
clustering topics in the topic network to form topic clusters;
determining a connection weight between a user and a topic in the topic network according to user behavior data;
constructing a relation graph between users and topic clusters according to the connection weight between the users and the topics in the topic network; and
and determining the crowd attributes of the users according to the relationship graph between the users and the topic clusters.
2. The method of claim 1, wherein the establishing a topic network comprises:
calculating the connection weight between any two topics; and
and constructing a topic network according to the connection weight between any two topics.
3. The method of claim 1, wherein the clustering forms a topic cluster comprising:
inputting the topic pairs with connection relations in the topic network into a clustering algorithm, outputting topic clusters to which the topics belong, and removing abnormal topics from each topic cluster.
4. The method of claim 3, wherein the removing of the outlier topic from each topic cluster comprises:
calculating the score of each topic in each topic cluster; and
and removing the topics with the scores less than or equal to a threshold value from the topic cluster.
5. The method of claim 1, wherein the determining the demographic attributes of the user from the graph of relationships between the user and the topic cluster comprises:
determining a user topic cluster list according to the relation graph between the user and the topic cluster; and
and determining the crowd attribute of the user according to the intersection number of the target topic cluster label and the topic cluster label in the user topic cluster list, wherein the target topic cluster label is searched in the topic cluster label list according to the target crowd.
6. The method of claim 5, wherein the determining a user topic cluster list from a relationship graph between the user and a topic cluster comprises:
calculating the scores of the users and the topic clusters to which the users belong according to the relation graph between the users and the topic clusters; and
and sequencing the scores of the user and the topic cluster to which the user belongs to obtain a user topic cluster list.
7. The method of claim 1, wherein the determining the demographic attributes of the user from the graph of relationships between the user and the topic cluster comprises:
inputting the relation graph between the user and the topic cluster into a spectral clustering algorithm, and outputting a clustering result of the user and the topic cluster; and
and determining the crowd attributes of the users according to the clustering results of the users and the topic clusters.
8. An apparatus to determine a demographic property of a user, comprising:
a topic network establishing module configured to establish a topic network from the topic data;
a topic cluster forming module configured to cluster topics in the topic network into topic clusters;
a connection weight determination module configured to determine a connection weight between a user and a topic in the topic network from user behavior data;
the relation graph building module is configured to build a relation graph between users and the topic clusters according to the connection weight between the users and the topics in the topic network; and
and the crowd attribute determining module is configured to determine the crowd attributes of the users according to the relation graph between the users and the topic clusters.
9. An electronic device, comprising:
a processor; and
a memory storing instructions that, when executed, cause the processor to perform the method of any of claims 1-7.
10. A computer readable storage medium storing instructions that, when executed, implement the method of any of claims 1 to 7.
CN201911299437.9A 2019-12-17 2019-12-17 Method and device for determining crowd attributes of users Pending CN111046300A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911299437.9A CN111046300A (en) 2019-12-17 2019-12-17 Method and device for determining crowd attributes of users

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911299437.9A CN111046300A (en) 2019-12-17 2019-12-17 Method and device for determining crowd attributes of users

Publications (1)

Publication Number Publication Date
CN111046300A true CN111046300A (en) 2020-04-21

Family

ID=70235156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911299437.9A Pending CN111046300A (en) 2019-12-17 2019-12-17 Method and device for determining crowd attributes of users

Country Status (1)

Country Link
CN (1) CN111046300A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084422A (en) * 2020-08-31 2020-12-15 腾讯科技(深圳)有限公司 Intelligent processing method and device for account data
CN112084422B (en) * 2020-08-31 2024-05-10 腾讯科技(深圳)有限公司 Account data intelligent processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008121872A1 (en) * 2007-03-30 2008-10-09 Amazon Technologies, Inc. Cluster-based assessment of user interests
CN106055617A (en) * 2016-05-26 2016-10-26 乐视控股(北京)有限公司 Data pushing method and device
CN106126669A (en) * 2016-06-28 2016-11-16 北京邮电大学 User collaborative based on label filters content recommendation method and device
CN107885778A (en) * 2017-10-12 2018-04-06 浙江工业大学 A kind of personalized recommendation method based on dynamic point of proximity spectral clustering

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008121872A1 (en) * 2007-03-30 2008-10-09 Amazon Technologies, Inc. Cluster-based assessment of user interests
CN106055617A (en) * 2016-05-26 2016-10-26 乐视控股(北京)有限公司 Data pushing method and device
CN106126669A (en) * 2016-06-28 2016-11-16 北京邮电大学 User collaborative based on label filters content recommendation method and device
CN107885778A (en) * 2017-10-12 2018-04-06 浙江工业大学 A kind of personalized recommendation method based on dynamic point of proximity spectral clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贺超波 等: "在线社交网络挖掘综述" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084422A (en) * 2020-08-31 2020-12-15 腾讯科技(深圳)有限公司 Intelligent processing method and device for account data
CN112084422B (en) * 2020-08-31 2024-05-10 腾讯科技(深圳)有限公司 Account data intelligent processing method and device

Similar Documents

Publication Publication Date Title
CN112632385B (en) Course recommendation method, course recommendation device, computer equipment and medium
CN108287864B (en) Interest group dividing method, device, medium and computing equipment
WO2022141861A1 (en) Emotion classification method and apparatus, electronic device, and storage medium
CN111177575A (en) Content recommendation method and device, electronic equipment and storage medium
CN108363821A (en) A kind of information-pushing method, device, terminal device and storage medium
CN108363790A (en) For the method, apparatus, equipment and storage medium to being assessed
CN110209808A (en) A kind of event generation method and relevant apparatus based on text information
CN111177538A (en) Unsupervised weight calculation-based user interest tag construction method
CN111522886B (en) Information recommendation method, terminal and storage medium
CN112785005A (en) Multi-target task assistant decision-making method and device, computer equipment and medium
CN112084413A (en) Information recommendation method and device and storage medium
CN111563158A (en) Text sorting method, sorting device, server and computer-readable storage medium
CN114398557A (en) Information recommendation method and device based on double portraits, electronic equipment and storage medium
CN110245310B (en) Object behavior analysis method, device and storage medium
CN110929169A (en) Position recommendation method based on improved Canopy clustering collaborative filtering algorithm
Zhou et al. Rank2vec: learning node embeddings with local structure and global ranking
CN110069686A (en) User behavior analysis method, apparatus, computer installation and storage medium
CN112148994A (en) Information push effect evaluation method and device, electronic equipment and storage medium
CN111667018A (en) Object clustering method and device, computer readable medium and electronic equipment
CN111242239A (en) Training sample selection method and device and computer storage medium
Joseph et al. Arab Spring: from newspaper
CN111507098B (en) Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN112632275B (en) Crowd clustering data processing method, device and equipment based on personal text information
CN111046300A (en) Method and device for determining crowd attributes of users
CN113591881A (en) Intention recognition method and device based on model fusion, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination