CN112307746A

CN112307746A - Social network user search intention processing system based on user aggregation topic model

Info

Publication number: CN112307746A
Application number: CN202011344972.4A
Authority: CN
Inventors: 石磊; 费廷伟; 崔斌; 段正轩; 潘菁菁
Original assignee: Beijing Jinghang Computing Communication Research Institute
Current assignee: Beijing Jinghang Computing Communication Research Institute
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-02-02
Anticipated expiration: 2040-11-25
Also published as: CN112307746B

Abstract

The invention relates to a social network user search intention processing system based on a user aggregation topic model, which comprises: the online social network data acquisition module is used for acquiring network data in a social network online; the data preprocessing module is used for carrying out data cleaning on the network data to form a network data set; the search intention acquisition module is used for establishing an online social network user aggregation topic model based on Dirichlet distribution and Gibbs sampling, and processing a network data set to obtain user search intention distribution, focused person search intention distribution and word distribution of user search intention; and aggregating the user intentions based on the user search intention distribution and the attendee search intention distribution to obtain the final social network user search intention. The method and the device solve the problem of sparsity of the social network context, construct the user intention weight expression, realize the processing of the search intention of the social network user, and improve the search experience of the user.

Description

Social network user search intention processing system based on user aggregation topic model

Technical Field

The invention belongs to the technical field of networks, and particularly relates to a social network user search intention processing system based on a user aggregation topic model.

Background

The social network provides a lightweight and rapid communication environment for the user, and the user can propagate and share news events, daily chatting and life and work state conditions by using the social network platform. When a user searches for relevant content from a social network, the system is required to be able to return the desired results and make recommendations based on their search intent. The existing research on social network user search intention processing mainly focuses on topic model-based methods, user clustering-based methods, and methods for comprehensively modeling a user's search intention by using information such as user's private data.

The conventional topic model is designed for modeling semantic information of a standard news document or a long document, and when the social network context is applied, the semantic information is sparse and word co-occurrence information of the context is lacked, so that the effect of good processing of the search intention of a user cannot be obtained. The method for comprehensively modeling the search intention of the user by using the private data of the user, such as search history, access log, click history and other information, is also a hot spot of current research, needs specific data and depends heavily on the private data of the user, such as search history, click history and the like, the acquisition of the private data is difficult for researchers, and the methods ignore the relationship among social network words and the effect of user attributes on the understanding of the search intention and cannot realize the universal application of the understanding of the search intention of the social network user. The clustering method does not consider the association relationship between words in the social network context and neglects the influence of common words on the processing of the search intention of the user.

Disclosure of Invention

In view of the above analysis, the present invention aims to disclose a social network user search intention processing system based on a user aggregation topic model, which solves the problems existing in the current user intention processing.

The invention discloses a social network user search intention processing system based on a user aggregation topic model, which comprises:

the online social network data acquisition module is used for acquiring network data including user information, information of a person concerned and online social content text of the user in a social network online by adopting a crawler technology;

the data preprocessing module is used for carrying out data cleaning on the network data to form a network data set;

the search intention acquisition module is used for establishing an online social network user aggregation topic model based on Dirichlet distribution and Gibbs sampling, and processing the network data set to obtain user search intention distribution, focused person search intention distribution and word distribution of user search intention; and aggregating the user intentions based on the user search intention distribution and the attendee search intention distribution to obtain the final social network user search intention.

Further, the search intention acquisition module comprises a topic model submodule, a prior parameter construction submodule and an intention aggregation submodule;

the topic model submodule comprises a topic-common word distribution model, a topic-word pair distribution model, a user-search intention distribution model, a user-attention person search intention distribution model and a user-classification model and is used for processing a network data set to obtain user search intention distribution, attention person search intention distribution and word distribution of user search intention;

the prior parameter construction sub-module is used for carrying out prior construction on the hyper-parameters in the topic-to-word distribution model;

the intention aggregation submodule is used for carrying out user intention aggregation based on the user search intention distribution and the attention person search intention distribution;

in the topic model sub-module,

processing the network data set based on the user-search intention distribution model to obtain user search intention distribution;

processing the network data set based on the user-attendee search intention distribution model to obtain user search intention distribution;

and processing the network data set based on the topic-common word distribution model, the topic-to-word distribution model and the user-classification model to obtain the word distribution of the user search intention.

Further, the topic-ordinary word distribution model conforms to a dirichlet distribution containing a first hyper-parameter μ;

in the topic-to-word distribution model, words (w)_i，w_j) Is in accordance with a second hyperparameter gamma_iDirichlet distribution of (a); another word w_jThe distribution model conforms to the third hyperparameter gamma_jDirichlet distribution of (a);

the user-search intention distribution model conforms to a dirichlet distribution containing a fourth hyperparameter α;

the user-attendee search intention distribution model conforms to a dirichlet distribution containing a fifth hyperparameter β;

the user-classification model conforms to a dirichlet distribution that includes a sixth hyperparameter η.

Further, the prior parameter construction sub-module performs prior construction based on a recurrent neural network and an inverse document frequency to obtain a second hyperparameter gamma_iAnd a third hyperparameter gamma_j。

Further, the prior parameter construction sub-module comprises a Recurrent Neural Network (RNN) module, an inverse document frequency module, a word pair set construction module and a parameter construction module;

the recurrent neural network RNN module is used for learning words in the documents collected in the network data set through the recurrent neural network RNN to obtain the association probability of two associated words;

the inverse directionDocument frequency module for employing inverse document frequency

Measuring the frequency of occurrence of each word; where | M | represents the total number of documents in the dataset, | M_l∈M:w_i∈m_l| representing the word w_iThe number of occurrences in the document;

the word pair set building module is used for building and extracting a word pair set C ═ C based on output results of the recurrent neural network RNN module and the inverse document frequency module₁,C₂,…,C_w,…,C_N}；

Wherein the content of the first and second substances,

IDF_wiis the word w_iThe inverse document frequency of (d);

is the word w_jThe inverse document frequency of (d); o_tFor the associated word w obtained by the learning of the recurrent neural network RNN_iAnd w_jN is the total number of word pairs;

a parameter construction module for constructing the second hyperparameter

Third hyperparameter

Wherein the content of the first and second substances,

is a preset positive number.

Further, the hidden layer excitation function of the recurrent neural network in the recurrent neural network RNN module is a sigma function; the output layer excitation function is a softmax function.

Further, for each word pair C of the set of word pairs in the topic model submodule_w∈C：

1) Utilizing user-searchUser search intention distribution theta output by intention distribution model_uAs a multinomial distribution of parameters, the intention assignment of word pairs is sampled based on the multinomial distribution: z is a radical of_u,Cw～Multi(θ_u) Wherein Multi represents a multinomial distribution; z is a radical of_u,CwDenotes the user's intention assignment, u denotes the user, C_wRepresenting a word pair;

2) user-attendee search intent distribution output with user-attendee search intent distribution model

As a multinomial distribution of parameters, the intent assignment of a sample word pair:

z_e,Cwan intention assignment indicating a user attendee, e indicating an attendee;

3) for each word in the set of word pairs C;

distribution tau of user classes output by user-classification model_uTaking Bernoulli distribution as parameter, sampling binary switch variable x-Bern (tau)_u) Wherein Bern denotes bernoulli distribution;

if x is 0, the general word distribution phi output by the topic-general word distribution model_z,bAs a polynomial distribution of the parameters, two words w are sampled separately_i，w_j～Multi(φ_z,b)；

If x is 1, the word distribution phi output by the topic-to-word distribution model_z，1、φ_z，2As a multinomial distribution of parameters, a word w is sampled separately_i～Multi(φ_z，1) And another word w_j～Multi(φ_z,2)。

Further, in the topic model submodule, Gibbs sampling is adopted to carry out iterative sampling on the established social network user aggregation topic model, and user search intention distribution, intention distribution of user followers and word distribution of the user are obtained.

Further, the subject model sub-module outputs after gibbs sampling iterative sampling:

user search intent distribution

Intent distribution of user followers

Word distribution of user search intent_k＝[φ_k,v1,φ_k,v2,，...，φ_k,vi，...，φ_k,vn]；

Wherein n is_u,kNumber of subject words, n_uRepresenting the total number of words, s_uRepresenting the number of word pairs assigned to all topics of the user, s_u,kRepresenting the number of the word pairs distributed to the topics of the user attention, and K represents the number of the topics in the data set;

n_k,virepresenting words v in a set C of word pairs_iThe number of times assigned to the subject term; n is_kRepresenting the total number of times of assigning the subject term in the term pair C, V representing the number of all terms in the document, alpha, beta being the fourth, the fifth hyper-parameters, gamma being the second hyper-parameter gamma_iOr a third hyperparameter gamma_j。

Further, the intention aggregation sub-module aggregates the intentions by clustering

Obtaining the weight omega of the user search intention, wherein the weight omega is used for expressing the search intention of the user; theta_uSearching the intention distribution for all users;

and pi is a weight parameter for the distribution of the search intention of all the user followers.

The invention can realize at least one of the following beneficial effects:

the method aims at the problems that the current mainstream social network user search intention processing method needs specific privacy data and does not have universality;

the user search intention distribution is obtained by constructing a social network user aggregation topic model, the problem of sparsity of social network context is solved, modeling subject words and common words are distinguished, and social network word relation learning is realized; and (4) considering the user search intention distribution and the attention person intention distribution, constructing a user intention weight representation, and realizing the understanding and mining of the search intention of the social network user.

The social network user intention processing method can effectively understand and mine the search intention of the user under the condition that no available access log such as search history, click log and other data exists, and the performance is obviously improved.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a schematic connection diagram of a social network user search intention processing system in the present embodiment;

FIG. 2 is a representation of an online social network user aggregation topic model in this embodiment;

fig. 3 is a structure diagram of the Elman RNN network in this embodiment.

Detailed Description

The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.

The embodiment discloses a social network user search intention processing system based on a user aggregation topic model, as shown in figure 1,

specifically, the online social network data acquisition module crawls data in an online social network through web crawler software, for example, crawls the data of the Xinlang microblog; the crawled data comprises information of microblog users, information of followers of the microblog users and online social content text information issued by the microblog users on a microblog.

specifically, the data preprocessing module is used for cleaning and processing the crawled data; deleting error and redundant data in the data, and dummy words without specific content, and only keeping the backbone of the microblog content to form a network data set;

the data preprocessing module comprises an extraction unit, a word segmentation unit and a classification storage unit;

the extraction unit is used for extracting user information, user attention information and user text content from the network data and eliminating messy information in the text content.

The word segmentation unit is used for carrying out word segmentation on the cleaned text content, deleting wrong and redundant words and null words without specific content, such as only the microblog content is reserved; and delete very short text such as "like", "applause", etc. that has no specific meaning.

A classification storage unit, configured to classify and store the user data, the user attendee data, and the social content data to form a microblog text set M ═ M₁,m₂,…,m_l,…,m_NdU & ltu & gt, a microblog user set₁,u₂…, topic set Z ═ Z₁,z₂,…,}。

Specifically, the search intention acquisition module comprises a topic model submodule, a prior parameter construction submodule and an intention aggregation submodule;

in the topic model sub-module,

More specifically, the present invention is to provide a novel,

the topic-ordinary word distribution model conforms to a Dirichlet distribution containing a first hyper-parameter μ; namely, for each topic z, the obtained common word distribution phi of the microblog_z,bDir (μ); b represents a common word;

in the topic-to-word distribution model, words (w)_i，w_j) Is in accordance with a second hyperparameter gamma_iDirichlet distribution of (a); another word w_jThe distribution model conforms to the third hyperparameter gamma_jDirichlet distribution of (a); i.e. for each topic z, a word distribution phi in the microblog word pair distribution_z,1～Dir(γ_i) And another word distribution phi_z,2～Dir(γ_j)；

The user-search intention distribution model conforms to a dirichlet distribution containing a fourth hyperparameter α; that is, for each user u, the resulting user search intention distribution θ_u～Dir(α)；

The user-attendee search intention distribution model conforms to a dirichlet distribution containing a fifth hyperparameter β; i.e., for each user u, the resulting distribution of search intentions of the user's followers

The user-classification model conforms to a dirichlet distribution containing a sixth hyperparameter η; i.e. for each user u, the resulting user's classification distribution τ_u～Dir(η)。

In particular, the online social network user aggregation topic model representation is shown in fig. 2.

Wherein, the first hyperparameter mu, the fourth hyperparameter alpha, the fifth hyperparameter beta and the sixth hyperparameter eta can adopt the conventional Dirichlet distribution hyperparameter value, such as 0.1 or 0.01.

The prior parameter construction sub-module performs prior construction based on a recurrent neural network and an inverse document frequency to obtain a second hyperparameter gamma_iAnd a third hyperparameter gamma_j(ii) a So that the model learns more consistent user search intentions.

Specifically, the prior parameter construction sub-module comprises a Recurrent Neural Network (RNN) module, an inverse document frequency module, a word pair set construction module and a parameter construction module;

preferably, a network structure for learning relationships between words using Elman RNN is shown in fig. 3.

In the context of figure 3 of the drawings,

indicating the current wordAnd T represents the size of the vector,

a hidden unit is represented that is hidden from view,

indicating the output unit at time t. x is the number of_t＝[w_t,h_t-1]Represents an input layer, wherein

The concealment unit and the output unit may perform calculation by equations (1) and (2):

H_t＝δ(U_it) (1)

o_t＝h(V_Ht) (2)

wherein the content of the first and second substances,

and

respectively, a parameter matrix and a vector, and δ (·) represents a sigma function, which is calculated as shown in equation (3):

g (-) is the softmax function, calculated as shown in equation (4):

in the output result, o_tRepresenting word pairs w_j,1And w_j,2Is expressed by the formula (5):

o_t＝P(w_j,2|w_j，1,h_t-1) (5)

wherein o is_tDenotes a given w_j,2，w_j,1The probability of occurrence. Due to the hidden unit H_tAnd H_t-1All previous words can be saved and therefore the association of previous words with the current word can be learned through the properties of the recurrent neural network RNN.

The inverse document frequency module is used for adopting inverse document frequency

Wherein the content of the first and second substances,

IDF_wiis the word w_iThe inverse document frequency of (d);

a parameter construction module for constructing the second hyperparameter

Third hyperparameter

Wherein the content of the first and second substances,

is a preset positive number.

More specifically, for each word pair of the set of word pairs in the topic model submoduleC_wC, performing user intention distribution of word pairs, intention distribution of users' followers and multi-term distribution of each word in the word pairs;

1) user search intention distribution theta output using user-search intention distribution model_uAs a multinomial distribution of parameters, the intention assignment of word pairs is sampled based on the multinomial distribution: z is a radical of_u,Cw～Multi(θ_u) Wherein Multi represents a multinomial distribution; z is a radical of_u,CwDenotes user intention assignment, u denotes user, C_wRepresenting a word pair;

3) for each word in the set of word pairs C;

Unknown parameters in a social network User Aggregation Topic Model (UATM) may be derived using Gibbs sampling. The core of gibbs sampling is iterative sampling of hidden variables by a priori estimation. In the sampling process, the user search intention distribution θ u and the search intention distribution of the attention person need to be integrated

And word distribution phi of the user, iteratively sampling a microblog set M, a theme Z and a switch variable x, and sampling the theme Z according to the following formula:

for all users, n_u,bNumber of common words, n_u,kIndicating the number of subject words. n is_v,bNumber of times of assigning word V to ordinary word, n_k,vRepresenting the number of times a word pair C is assigned to a subject word, n_u,kRepresenting the number of microblogs assigned to topic Z. It is to be noted that n_u＝n_u,b+n_u,k，s_u,zIs the number of word pairs assigned to the topic of the user's attention.

Hidden variables can be derived by equation (6), where Γ (x) represents a gamma function, and π is a weight parameter for adjusting the weighted expression of the user's search intent and the user's attention. Based on the joint distribution and the chain rule, a conditional probability distribution as shown in formula (7) can be obtained:

where-i represents a statistical count that does not contain the ith microblog. Phi is a distribution set of all user search intents; theta is a search intention distribution set of all user followers; Ψ is the set of word distributions in the dataset.

After the conditional probability distribution is obtained, the theme zdi is directly sampled by using a chain rule, and the results shown in the formulas (8) and (9) are obtained by deriving the switch variable x:

where-j denotes the count of the non-statistical jth word, w_iRepresenting the ith word in the microblog document.

In the initial state of gibbs sampling, the hidden variables are sampled according to equations (8) and (9). After sufficient iterations are completed, the user search intention distribution, the intention distribution of the user attendees and the word distribution of the user output by the topic model submodule are shown as formulas (10), (11), (12) and (13):

based on equation (11) and equation (12), the word distribution of the user search intention is obtained, as shown in equation (14):

φ_k＝[φ_k,v1,φ_k,v2,，...，φ_k,vi，...，φ_k,vn] (14)

specifically, the intention aggregation sub-module constructs a weight representation Ω of the user search intention based on the user search intention and the search intention of the attendee to jointly mine the search intention of the user, and a calculation formula is shown as formula (15):

in the formula, theta_uSearching the intention distribution for all users;

And (3) obtaining the final search intention of the social network user according to the search intention distribution of the user obtained by the clustering formula (19). Therefore, an operator of the social network can provide online social content according to the search intention of the user and the word distribution of the search intention of the user, the search time of the user is shortened, and the user experience is improved.

In summary, the embodiments are directed to a social network user search intention processing method that currently mainly requires specific privacy data and has no universality problem; the user search intention distribution is obtained by constructing a social network user aggregation topic model, the problem of sparsity of social network context is solved, modeling subject words and common words are distinguished, and social network word relation learning is realized; the user intention weight representation is constructed by considering the user search intention distribution and the attention user intention distribution, the understanding and mining of the social network user search intention are realized, the user search intention can be effectively understood and mined under the condition that no available access logs such as search history, click logs and other data exist, and the performance is remarkably improved.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A social network user search intent processing system based on a user aggregate topic model, comprising:

2. The social network user search intention processing system of claim 1, wherein the search intention acquisition module comprises a topic model sub-module, a prior parameter construction sub-module, and an intention aggregation sub-module;

in the topic model sub-module,

3. The social network user search intent processing system of claim 2,

the topic-ordinary word distribution model conforms to a Dirichlet distribution containing a first hyper-parameter μ;

4. The social network user search intention processing system of claim 3, wherein the prior parameter construction sub-module derives a second hyper-parameter γ by performing a prior construction based on a recurrent neural network and an inverse document frequency_iAnd a third hyperparameter gamma_j。

5. The social network user search intention processing system of claim 4, wherein the a priori parameter construction sub-module comprises a Recurrent Neural Network (RNN) module, an inverse document frequency module, a word pair set construction module, and a parameter construction module;

Wherein the content of the first and second substances,

IDF_wiis the word w_iThe inverse document frequency of (d);

a parameter construction module for constructing the second hyperparameter

Third hyperparameter

Wherein the content of the first and second substances,

is a preset positive number.

6. The social network user search intention processing system of claim 5, wherein the recurrent neural network in the recurrent neural network RNN module has a hidden layer stimulus function that is a sigma function; the output layer excitation function is a softmax function.

7. The social network user search intent processing system of claim 5, wherein C is a word pair for each word pair of the set of word pairs in the topic model sub-module_w∈C：

1) User search intention distribution theta output using user-search intention distribution model_uAs a multinomial distribution of parameters, the intention assignment of word pairs is sampled based on the multinomial distribution: z is a radical of_u,Cw～Multi(θ_u) Wherein Multi represents a multinomial distribution; z is a radical of_u,CwDenotes the user's intention assignment, u denotes the user, C_wRepresenting a word pair;

3) for each word in the set of word pairs C;

If x is 1, the word distribution phi output by the topic-to-word distribution model_z，1、φ_z，2AsMultiple distribution of parameters, each sampling a word w_i～Multi(φ_z，1) And another word w_j～Multi(φ_z,2)。

8. The system of any one of claims 1 to 7, wherein, in the topic model submodule, Gibbs sampling is used to iteratively sample the established social network user aggregation topic model, so as to obtain user search intention distribution, intention distribution of user attendees, and word distribution of the user.

9. The social network user search intention processing system of claim 8, wherein the topic model sub-module outputs after Gibbs sampling iterative sampling:

user search intent distribution

Intent distribution of user followers

Word distribution of user search intent_k＝[φ_k,v1,φ_k,v2,，…，φ_k,vi，…，φ_k,vn]；

n_k,virepresenting words v in a set C of word pairs_iThe number of times assigned to the subject term; n is_kRepresenting the total number of times of assigning the subject word in the word pair C, V representing the number of all words in the document, alpha, beta being the fourth, the fifth hyper-parameters, gamma beingSecond hyperparameter gamma_iOr a third hyperparameter gamma_j。

10. The social network user search intent processing system of claim 9, wherein the intent aggregation sub-module clusters by