CN112446220A

CN112446220A - Short text aggregation method based on dynamic semantic modeling

Info

Publication number: CN112446220A
Application number: CN202011479885.XA
Authority: CN
Inventors: 石磊; 崔斌; 尹领昌; 邹蕾; 娄东东; 李婷; 马语菡; 刘波; 王岩
Original assignee: Beijing Jinghang Computing Communication Research Institute
Current assignee: Beijing Jinghang Computing Communication Research Institute
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-05

Abstract

The invention relates to a short text aggregation method based on dynamic semantic modeling, which comprises the following steps: acquiring short text data to be aggregated on a time slice with a set interval, and performing data preprocessing to form a data set; capturing a plurality of distributions of topics and a plurality of distributions of words in the data set by establishing a dynamic self-aggregation topic model on each time slice; adopting Gibbs sampling to deduce a plurality of distributions in the dynamic self-aggregation topic model, and finally counting topic distribution and word distribution on each time slice when sampling is converged; and calculating the probability of short text aggregation related to the theme according to the theme distribution and the word distribution on each time slice, and adaptively aggregating the short texts. The method automatically aggregates the short text into the standard long document, so that more consistent subjects can be captured, the problem of sparsity of the short text is solved, heuristic pre-processing or post-processing technology is not needed, the model is simple, and the processing efficiency is high.

Description

Short text aggregation method based on dynamic semantic modeling

Technical Field

The invention belongs to the technical field of short text aggregation, and particularly relates to a short text aggregation method based on dynamic semantic modeling.

Background

The short text semantic modeling is to perform certain operation and processing on massive short text data, can realize automatic modeling of short text topics without some additional preprocessing and post-processing operations, and can accurately infer topics possibly contained in short texts. Dynamic topic modeling, in turn, requires the method itself to support streaming data and be able to model temporal attributes in topics and dynamically infer current topics from previous topics and new data currently being acquired. Particularly, under the current social media environment, a large amount of short text data with time attributes are generated every day, so that a method capable of supporting dynamic topic modeling is needed, and meanwhile, the method is required to be capable of not only processing standard long text data, but also overcoming the problem of data sparsity existing in short text topic modeling. Conventional topic model approaches are designed to model long text data statically, and are less effective in short text topic modeling and in modeling the dynamics of the topic.

Disclosure of Invention

In view of the above analysis, the present invention aims to disclose a method for short text aggregation based on dynamic semantic modeling, and solve the problem of short text aggregation.

The invention discloses a short text aggregation method based on dynamic semantic modeling, which comprises the following steps:

acquiring short text data to be aggregated on a time slice with a set interval, and performing data preprocessing to form a data set;

capturing, on each timeslice, a plurality of distributions θ of topics in a dataset by building a dynamic self-aggregating topic model_t,kMultiple distribution of sum words phi_t,v(ii) a t is time slice, K is 1,2, …, K; k is the number of topics in the data set; v ═ 1,2, …, V; v is the number of words in the data set;

in the establishing of the dynamic self-aggregation topic model, the multi-term distribution theta of the time slice topic_t,kMultiple distribution of sum words phi_t,vMultiple distribution theta depending on previous time slice subject_t-1,kMultiple distribution of sum words phi_t-1,v；

Adopting Gibbs sampling to deduce a plurality of distributions in the dynamic self-aggregation topic model, and finally counting topic distribution and word distribution on each time slice when sampling is converged;

and calculating the probability of short text aggregation related to the theme according to the theme distribution and the word distribution on each time slice, and adaptively aggregating the short texts.

Further, in the dynamic self-aggregation topic model, a dirichlet priory is adopted to construct the persistence accuracy of the topic

Persistence accuracy of sum words

Further, the topic distribution is inferred using a Gibbs sampling algorithm

Obtaining a persistence accuracy a of the topic when the sampling converges_t(ii) a Wherein, theta_t-1,kDistributing theta for the current topic_t,kThe previous topic distribution relied upon;

word distribution phi over previous time slice t-1_t-1Deducing the word distribution phi of the current time slice t_t(ii) a Inferring topic distributions using gibbs sampling algorithm

Obtaining the persistence accuracy beta of the word when the sampling converges_tWherein phi is_t-1,vDistribute phi for the current word_t,vThe distribution of the previous words relied upon.

Further, a short text data set in the data set is represented by { R, S }, wherein R represents unordered short text; s is the distribution of the aggregated documents; the aggregated document set is D;

at the current time slice t, the dynamic self-aggregation topic model modeling process is as follows:

1) for each topic K1.. K in the text data set, the learned persistence accuracy β of the word is used_tAnd the word distribution phi of the previous time slice t-1_t-1Sampling the t word distribution phi of the current time slice_t,k～Dirichlet(β_tφ_t-1,k) In which the words are distributed phi_tIs a polynomial distribution, Dirichlet stands for Dirichlet distribution;

2) for each aggregate document D, D ∈ D in the text dataset, the persistency accuracy α of the learned topic is exploited_tAnd the previous topic distribution theta_t-1Sample topic distribution θ_t,d～Dirichlet(α_tθ_t-1,d) Wherein the subject distribution is theta_tIs a multi-term distribution;

3) generating topics using co-occurrence of word pairs, extracting word pairs B independently from short text R_wTopic assignment k based on multinomial distribution sampling of short text R_Rn～Multinomial(θ_s,d) Sampling word pairs B based on a multi-term distribution_wDistribution w of_i,w_j～Multinomial(φ_z,d) (ii) a z is the word w_iOr w_jThe allocation of (2); theta_s,dAssigning and aggregating topic distributions of document d based on short text; phi is a_z,dA topic distribution for document d based on word assignment and aggregation.

Further, a formal representation of the word pairs is extracted

w_i,w_j,w_lRepresenting three consecutive words in the short text;

formalized representation of the construction of word pairs B_w＝{(w_i,w_j)|w_i,w_j∈d_,i≠j}。

Further, the derivation of the polynomial distribution in the dynamic self-aggregation topic model using gibbs sampling comprises:

at an initial time, the value θ is first assigned_0,k＝1/K，φ_0,V＝1/V；

Obtaining conditional probability distribution through Gibbs sampling and iterative sampling of distribution S and subject k of the aggregated document;

computing new topic persistence accuracy in conjunction with maximum likelihood estimation

Word-sum persistence accuracy

According to new theme persistence accuracy

Word-sum persistence accuracy

And counting the distribution of the subjects and the distribution of the words on the time slice t.

Further, the iterative sampling aggregates the assignment S of the short text and the assignment of the topic k:

according to gibbs sampling and chain rules, conditional distribution is adopted:

distributing S of the sampling short texts;

wherein k represents a topic, N_RRepresenting word pairs B in short text R_wTotal number of (2), N_R,kIn the short text R, the number of word pairs assigned to the topic k, N_t,d,kRepresentation in aggregated documentsd term pair B_wThe number of the subjects k to be assigned,

representing word pair B in aggregate document d_wTotal number of (2), N_RRepresenting the total number of word pairs, N, in the short text R_dTo represent

Representing word pair B in aggregate document d_wThe total amount of the (c),

a count representing the removal of the short text R; n represents the current count;

assignment k of sampling subject k_dm(ii) a Wherein dm represents that the mth word of the aggregation document d is located as a position coordinate point;

w_i、w_jis the ith, jth word, N in the aggregate document d_kThe representation represents the total number of word pairs, N, in the topic k_d，kIndicating that, in the aggregate document d, the number of word pairs assigned to the topic k,

indicates that k is not included at coordinate point dm_dmCounting of (2);

indicates that k is not included_dmWord w_iThe total number assigned to the subject k,

indicates that k is not included_dmWord w_jA total number assigned to subject k; v denotes the total number of all words.

Further, the maximum likelihood estimation is combined to calculate the newPersistence accuracy of the subject matter of

Persistence accuracy of sum words

Wherein the content of the first and second substances,

wherein N is_k，vIndicating the number of word pairs assigned to word v in topic k;

representing a digamma function.

Further, a Gibbs sampling is adopted to deduce a plurality of distributions in the dynamic self-aggregation topic model, and topic distribution on a time slice t is obtained

Word distribution

Further, the probability of the aggregated document d associated with topic k is the probability of short text being assigned to topic k divided by the probability of the aggregated document being assigned to topic k.

The invention can realize at least one of the following beneficial effects:

1) the dynamic topic modeling method models topics by automatically aggregating short texts into standard long documents, so that more consistent topics can be captured, the problem of sparsity of the short texts is solved, heuristic pre-processing or post-processing technologies are not needed, the model is simple, and the processing efficiency is high.

2) The novel Gibbs sampling algorithm can rapidly and effectively derive unknown variables, and calculates topic distribution and word distribution through sampling results, so as to realize topic modeling.

3) The method can effectively dynamically model the theme, the current theme is deduced by using the previously captured theme and newly arrived data as prior knowledge, further dynamic modeling of the theme is realized, and the problem that the traditional theme modeling can only be deduced based on static data is effectively solved.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

Fig. 1 is a flowchart of a short text aggregation method based on dynamic semantic modeling in this embodiment;

fig. 2 is a schematic representation diagram of the dynamic self-aggregation topic model in this embodiment.

Detailed Description

The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.

Typically, the topic of the social network is dynamically changed in different time slices, the formalization of the time slices can be expressed as {.., T-2, T-1, T, }, and the interval of the time slices can be set to one day, one week, one month, or the like. In order to adaptively aggregate short texts into long documents, the embodiment discloses a short text aggregation method based on dynamic semantic modeling, which includes the following steps as shown in fig. 1:

step S1, short text data are obtained from the social network on a time slice with a set interval, and data preprocessing is carried out to form a data set;

step S2, at each time slice, through buildingVertical dynamic self-aggregation topic model captures multiple distributions theta of topics in a dataset_t,kMultiple distribution of sum words phi_t,v(ii) a t is time slice, K is 1,2, …, K; k is the number of topics in the data set; v ═ 1,2, …, V; v is the number of words in the data set;

Step S3, deducing a plurality of distribution in the dynamic self-aggregation topic model by adopting Gibbs sampling, and counting topic distribution and word distribution on each time slice when the sampling is converged;

step S4, according to the topic distribution and the word distribution on each time slice, calculating the probability of short text aggregation related to the topic, and adaptively aggregating the short text.

Specifically, in step S1, based on the short text data crawled from the social network, preprocessing is performed to delete repeated texts and short texts with less than 3 words; performing text word segmentation, stop word deletion and deletion of words with occurrence frequency less than 8 times; a data set for aggregation is obtained.

In particular, in step S2, capturing a plurality of distributions theta of topics by establishing a dynamic self-aggregation topic model in each time slice_t,kMultiple distribution of sum words phi_t,vThe method comprises the following substeps:

1) extracting word pairs;

embodiments of the present invention utilize co-occurrence of word pairs to generate topics, rather than word co-occurrence of traditional approaches. Wherein each word pair contains two unordered words and the word pairs are extracted independently from the same topic, the formalized representation of the construction of the word pairs is B_w＝{(w_i,w_j)|wi,w_jE, d, i is not equal to j }; the formalized representation of the extracted word pairs is

Wherein, B_wThe pair of words is represented by a word pair,_wi,wj,wlrepresenting three consecutive words in a short text.

2) In the dynamic self-aggregation topic model, constructing the persistence accuracy alpha of the topic_tAnd the persistence accuracy of the word;

because in the establishing dynamic self-aggregation theme model, the multi-term distribution theta of the time slice theme_t,kMultiple distribution of sum words phi_t,vMultiple distribution theta depending on previous time slice subject_t-1,kMultiple distribution of sum words phi_t-1,v(ii) a Therefore, the persistence accuracy of the topic is constructed by Dirichlet priors

Persistence accuracy of sum words

In particular, the subject persistence accuracy α_tRepresents the persistence of the topic, i.e., the significance of the topic k of the current time slice t compared to the topic of the previous time slice t-1; current topic distribution θ_t,kDependent on the previous theme distribution theta_t-1,k(ii) a Inferring topic distributions using gibbs sampling algorithm

Subject persistence accuracy alpha is obtained by sampling convergence_t，θ_t-1,kDistributing theta for the current topic_t,kThe previous topic distribution relied upon.

Persistence accuracy beta of the word_tRepresents the persistence of the word, i.e., the persistence of the word w assigned to the subject k at the current time slice t compared to the previous time slice t-1; word distribution phi over previous time slice t-1_t-1Deducing the word distribution phi of the current time slice t_t(ii) a Inferring topic distributions using gibbs sampling algorithm

Sampling convergence to obtain the persistence accuracy beta of the word_t；φ_t-1,vDistribute phi for the current word_t,vIs dependent onThe previous word distribution.

3) Establishing a dynamic self-aggregation topic model

In order to realize self-aggregation of short texts, a short text data set in a data set is represented by { R, S }, wherein R represents unordered short texts, and S is the distribution of an aggregation document and is a hidden variable representing the distribution relation between the aggregation document and the short texts; the aggregated document set is D;

1) for each topic K1.. K in the text data set, the persistence accuracy β of the learned word is used_tAnd the word distribution phi of the previous time slice t-1_t-1，

Sampling the t word distribution phi of the current time slice_t,v～Dirichlet(βtφ_t-1,v) Wherein the words are distributed phi_tIs a polynomial distribution, Dirichlet stands for Dirichlet distribution;

2) for each aggregate document D, D ∈ D in the text dataset, the persistency accuracy α of the learned topic is exploited_tAnd the previous topic distribution theta_t-1，

Sample topic distribution θ_t,d～Dirichlet(α_tθ_t-1,d) Wherein the subject distribution is theta_tIs a multi-term distribution;

3) generating a topic using co-occurrence of word pairs, independently extracting word pair B from short text_w，

Topic assignment k based on multi-term distributed sampling short text_Rn～Multinomial(θ_s,d)；

Sampling word pair B based on multi-term distribution_wDistribution w of_i,w_j～Multinomial(φ_z,d) (ii) a z is the word w_iOr w_jThe allocation of (2);

θ_s,dassigning and aggregating topic distributions of document d based on short text; phi is a_z,dA topic distribution for document d based on word assignment and aggregation.

Specifically, the representation of the dynamic self-aggregation topic model in this embodiment is shown in fig. 2.

In step S2, the dynamic self-aggregation topic model includes an implicit variable S, and in order to calculate the implicit variable, this embodiment adopts gibbs sampling to derive a polynomial distribution in the dynamic self-aggregation topic model.

Specifically, in step S3, the derivation of the polynomial distribution in the dynamic self-aggregation topic model using gibbs sampling includes:

1) at an initial time, the value θ is first assigned_0,k＝1/K，φ_0,v＝1/V；

2) Obtaining conditional probability distribution through Gibbs sampling and iterative sampling of distribution S and subject k of the aggregated document;

specifically, the assigning S and k of the iterative sampling aggregation short text includes:

(1) according to gibbs sampling and chain rules, conditional distribution is adopted:

distributing S of the sampling short texts;

wherein k represents a topic, N_RRepresenting word pairs B in short text R_wTotal number of (2), N_R,kIn the short text R, the number of word pairs assigned to the topic k, N_t,d,kRepresenting word pair B in aggregate document d_wThe number of the subjects k to be assigned,

Representing word pair B in aggregate document d_wThe total amount of the (c),

(2) according to gibbs sampling and chain rules, conditional distribution is adopted:

indicates that k is not included at coordinate point dm_dmCounting of (2);

3) Computing new topic persistence accuracy in conjunction with maximum likelihood estimation

Word-sum persistence accuracy

In particular, maximum likelihood estimation is incorporated to calculate persistence accuracy of new topics

Persistence accuracy of sum words

Making persistence of a theme accurateDegree and word persistence accuracy is more accurate.

Wherein the content of the first and second substances,

representing a digamma function.

4) According to new theme persistence accuracy

Word-sum persistence accuracy

Specifically, the topic distribution and the word distribution on the derived time slice t are as follows:

distribution of topics

Word distribution

In step S4, the probability of the aggregated document related to the topic is calculated according to the topic distribution and the word distribution of each time slice deduced in step S3, and the short text is adaptively aggregated.

Specifically, the probability of the aggregated document d related to the topic k is the probability of the short text being assigned to the topic k divided by the probability of the aggregated document being assigned to the topic k;

wherein the probability that the short text is assigned to the subject k is

The probability that the aggregate document is assigned to topic k is

I.e. the probability of the aggregated document d relating to topic k is:

in summary, the short text aggregation method based on dynamic semantic modeling of the embodiment has the following effects:

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A short text aggregation method based on dynamic semantic modeling is characterized by comprising the following steps:

2. The short text aggregation method based on dynamic semantic modeling according to claim 1, wherein in the dynamic self-aggregation topic model, Dirichlet priors are adopted to construct the persistence accuracy of the topic

Persistence accuracy of sum words

3. The short text aggregation method based on dynamic semantic modeling according to claim 2, whichCharacterized in that the distribution of the topics is deduced using the Gibbs sampling algorithm

4. The short text aggregation method based on dynamic semantic modeling according to claim 3,

representing a short text data set in the data set by { R, S }, wherein R represents unordered short text; s is the distribution of the aggregated documents; the aggregated document set is D;

2) for each aggregate document D, D ∈ D in the text dataset, the persistency accuracy α of the learned topic is exploited_tAnd the previous topic distribution theta_t-1Sample topic distribution θ_t,d～Dirichlet(α_tθ_t-1,d) Wherein the subject distribution is theta_tIs a plurality of itemsDistributing;

3) generating topics using co-occurrence of word pairs, extracting word pairs B independently from short text R_wTopic assignment k based on multinomial distribution sampling of short text R_Rn～Multinomial(θ_s,d) Sampling word pairs B based on a multi-term distribution_wDistribution w of_i,w_j～Multinomial(φ_zD); z is the word w_iOr w_jThe allocation of (2); theta_s,dAssigning and aggregating topic distributions of document d based on short text; phi is a_z,dA topic distribution for document d based on word assignment and aggregation.

5. The short text aggregation method based on semantic dynamic modeling according to claim 4,

extracting a formalized representation of a word pair

w_i,w_j,w_lRepresenting three consecutive words in the short text;

formalized representation of the construction of word pairs B_w＝{(w_i,w_j)|w_i,w_j∈d,i≠j}。

6. The short text aggregation method based on semantic dynamic modeling according to claim 5, wherein the derivation process of the polynomial distribution in the dynamic self-aggregation topic model by using Gibbs sampling comprises:

at an initial time, the value θ is first assigned_0,k＝1/K，φ_0,v＝1/V；

Word-sum persistence accuracy

According to new theme persistence accuracy

Word-sum persistence accuracy

7. The short text aggregation method based on semantic dynamic modeling according to claim 6, wherein the iterative sampling aggregates the assignment S of short text and the assignment of topic k:

distributing S of the sampling short texts;

Representing word pair B in aggregate document d_wThe total amount of the (c),

assignment k of sampling subject k_dm(ii) a Wherein dm represents a position coordinate point where the mth word of the aggregated document d is located;

indicates that k is not included at coordinate point dm_dmCounting of (2);

8. The method for short text aggregation based on semantic dynamic modeling according to claim 7, wherein persistence accuracy of new topics is calculated in combination with maximum likelihood estimation

Persistence accuracy of sum words

Wherein the content of the first and second substances,

wherein N is_k，vRepresents the number of word pairs assigned to word v in topic k;

representing a digamma function.

9. The short text aggregation method based on semantic dynamic modeling according to claim 1,

deriving a plurality of distributions in the dynamic self-aggregation topic model by adopting Gibbs sampling to obtain topic distribution on a time slice t

Word distribution

10. The short text aggregation method based on semantic dynamic modeling according to claim 9, wherein the probability of the aggregation document d related to the topic k is the probability of the short text being assigned to the topic k divided by the probability of the aggregation document being assigned to the topic k.