CN112446220A - Short text aggregation method based on dynamic semantic modeling - Google Patents

Short text aggregation method based on dynamic semantic modeling Download PDF

Info

Publication number
CN112446220A
CN112446220A CN202011479885.XA CN202011479885A CN112446220A CN 112446220 A CN112446220 A CN 112446220A CN 202011479885 A CN202011479885 A CN 202011479885A CN 112446220 A CN112446220 A CN 112446220A
Authority
CN
China
Prior art keywords
distribution
topic
word
short text
aggregation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011479885.XA
Other languages
Chinese (zh)
Inventor
石磊
崔斌
尹领昌
邹蕾
娄东东
李婷
马语菡
刘波
王岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jinghang Computing Communication Research Institute
Original Assignee
Beijing Jinghang Computing Communication Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jinghang Computing Communication Research Institute filed Critical Beijing Jinghang Computing Communication Research Institute
Priority to CN202011479885.XA priority Critical patent/CN112446220A/en
Publication of CN112446220A publication Critical patent/CN112446220A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention relates to a short text aggregation method based on dynamic semantic modeling, which comprises the following steps: acquiring short text data to be aggregated on a time slice with a set interval, and performing data preprocessing to form a data set; capturing a plurality of distributions of topics and a plurality of distributions of words in the data set by establishing a dynamic self-aggregation topic model on each time slice; adopting Gibbs sampling to deduce a plurality of distributions in the dynamic self-aggregation topic model, and finally counting topic distribution and word distribution on each time slice when sampling is converged; and calculating the probability of short text aggregation related to the theme according to the theme distribution and the word distribution on each time slice, and adaptively aggregating the short texts. The method automatically aggregates the short text into the standard long document, so that more consistent subjects can be captured, the problem of sparsity of the short text is solved, heuristic pre-processing or post-processing technology is not needed, the model is simple, and the processing efficiency is high.

Description

Short text aggregation method based on dynamic semantic modeling
Technical Field
The invention belongs to the technical field of short text aggregation, and particularly relates to a short text aggregation method based on dynamic semantic modeling.
Background
The short text semantic modeling is to perform certain operation and processing on massive short text data, can realize automatic modeling of short text topics without some additional preprocessing and post-processing operations, and can accurately infer topics possibly contained in short texts. Dynamic topic modeling, in turn, requires the method itself to support streaming data and be able to model temporal attributes in topics and dynamically infer current topics from previous topics and new data currently being acquired. Particularly, under the current social media environment, a large amount of short text data with time attributes are generated every day, so that a method capable of supporting dynamic topic modeling is needed, and meanwhile, the method is required to be capable of not only processing standard long text data, but also overcoming the problem of data sparsity existing in short text topic modeling. Conventional topic model approaches are designed to model long text data statically, and are less effective in short text topic modeling and in modeling the dynamics of the topic.
Disclosure of Invention
In view of the above analysis, the present invention aims to disclose a method for short text aggregation based on dynamic semantic modeling, and solve the problem of short text aggregation.
The invention discloses a short text aggregation method based on dynamic semantic modeling, which comprises the following steps:
acquiring short text data to be aggregated on a time slice with a set interval, and performing data preprocessing to form a data set;
capturing, on each timeslice, a plurality of distributions θ of topics in a dataset by building a dynamic self-aggregating topic modelt,kMultiple distribution of sum words phit,v(ii) a t is time slice, K is 1,2, …, K; k is the number of topics in the data set; v ═ 1,2, …, V; v is the number of words in the data set;
in the establishing of the dynamic self-aggregation topic model, the multi-term distribution theta of the time slice topict,kMultiple distribution of sum words phit,vMultiple distribution theta depending on previous time slice subjectt-1,kMultiple distribution of sum words phit-1,v
Adopting Gibbs sampling to deduce a plurality of distributions in the dynamic self-aggregation topic model, and finally counting topic distribution and word distribution on each time slice when sampling is converged;
and calculating the probability of short text aggregation related to the theme according to the theme distribution and the word distribution on each time slice, and adaptively aggregating the short texts.
Further, in the dynamic self-aggregation topic model, a dirichlet priory is adopted to construct the persistence accuracy of the topic
Figure BDA0002837010570000021
Persistence accuracy of sum words
Figure BDA0002837010570000022
Further, the topic distribution is inferred using a Gibbs sampling algorithm
Figure BDA0002837010570000023
Obtaining a persistence accuracy a of the topic when the sampling convergest(ii) a Wherein, thetat-1,kDistributing theta for the current topict,kThe previous topic distribution relied upon;
word distribution phi over previous time slice t-1t-1Deducing the word distribution phi of the current time slice tt(ii) a Inferring topic distributions using gibbs sampling algorithm
Figure BDA0002837010570000024
Obtaining the persistence accuracy beta of the word when the sampling convergestWherein phi ist-1,vDistribute phi for the current wordt,vThe distribution of the previous words relied upon.
Further, a short text data set in the data set is represented by { R, S }, wherein R represents unordered short text; s is the distribution of the aggregated documents; the aggregated document set is D;
at the current time slice t, the dynamic self-aggregation topic model modeling process is as follows:
1) for each topic K1.. K in the text data set, the learned persistence accuracy β of the word is usedtAnd the word distribution phi of the previous time slice t-1t-1Sampling the t word distribution phi of the current time slicet,k~Dirichlet(βtφt-1,k) In which the words are distributed phitIs a polynomial distribution, Dirichlet stands for Dirichlet distribution;
2) for each aggregate document D, D ∈ D in the text dataset, the persistency accuracy α of the learned topic is exploitedtAnd the previous topic distribution thetat-1Sample topic distribution θt,d~Dirichlet(αtθt-1,d) Wherein the subject distribution is thetatIs a multi-term distribution;
3) generating topics using co-occurrence of word pairs, extracting word pairs B independently from short text RwTopic assignment k based on multinomial distribution sampling of short text RRn~Multinomial(θs,d) Sampling word pairs B based on a multi-term distributionwDistribution w ofi,wj~Multinomial(φz,d) (ii) a z is the word wiOr wjThe allocation of (2); thetas,dAssigning and aggregating topic distributions of document d based on short text; phi is az,dA topic distribution for document d based on word assignment and aggregation.
Further, a formal representation of the word pairs is extracted
Figure BDA0002837010570000031
wi,wj,wlRepresenting three consecutive words in the short text;
formalized representation of the construction of word pairs Bw={(wi,wj)|wi,wj∈d,i≠j}。
Further, the derivation of the polynomial distribution in the dynamic self-aggregation topic model using gibbs sampling comprises:
at an initial time, the value θ is first assigned0,k=1/K,φ0,V=1/V;
Obtaining conditional probability distribution through Gibbs sampling and iterative sampling of distribution S and subject k of the aggregated document;
computing new topic persistence accuracy in conjunction with maximum likelihood estimation
Figure BDA0002837010570000032
Word-sum persistence accuracy
Figure BDA0002837010570000033
According to new theme persistence accuracy
Figure BDA0002837010570000034
Word-sum persistence accuracy
Figure BDA0002837010570000035
And counting the distribution of the subjects and the distribution of the words on the time slice t.
Further, the iterative sampling aggregates the assignment S of the short text and the assignment of the topic k:
according to gibbs sampling and chain rules, conditional distribution is adopted:
Figure BDA0002837010570000036
distributing S of the sampling short texts;
wherein k represents a topic, NRRepresenting word pairs B in short text RwTotal number of (2), NR,kIn the short text R, the number of word pairs assigned to the topic k, Nt,d,kRepresentation in aggregated documentsd term pair BwThe number of the subjects k to be assigned,
Figure BDA0002837010570000037
representing word pair B in aggregate document dwTotal number of (2), NRRepresenting the total number of word pairs, N, in the short text RdTo represent
Figure BDA0002837010570000038
Representing word pair B in aggregate document dwThe total amount of the (c),
Figure BDA0002837010570000039
a count representing the removal of the short text R; n represents the current count;
according to gibbs sampling and chain rules, conditional distribution is adopted:
Figure BDA00028370105700000310
assignment k of sampling subject kdm(ii) a Wherein dm represents that the mth word of the aggregation document d is located as a position coordinate point;
wi、wjis the ith, jth word, N in the aggregate document dkThe representation represents the total number of word pairs, N, in the topic kd,kIndicating that, in the aggregate document d, the number of word pairs assigned to the topic k,
Figure BDA00028370105700000410
indicates that k is not included at coordinate point dmdmCounting of (2);
Figure BDA0002837010570000041
indicates that k is not includeddmWord wiThe total number assigned to the subject k,
Figure BDA0002837010570000042
indicates that k is not includeddmWord wjA total number assigned to subject k; v denotes the total number of all words.
Further, the maximum likelihood estimation is combined to calculate the newPersistence accuracy of the subject matter of
Figure BDA0002837010570000043
Persistence accuracy of sum words
Figure BDA0002837010570000044
Wherein the content of the first and second substances,
Figure BDA0002837010570000045
Figure BDA0002837010570000046
wherein N isk,vIndicating the number of word pairs assigned to word v in topic k;
Figure BDA0002837010570000047
representing a digamma function.
Further, a Gibbs sampling is adopted to deduce a plurality of distributions in the dynamic self-aggregation topic model, and topic distribution on a time slice t is obtained
Figure BDA0002837010570000048
Word distribution
Figure BDA0002837010570000049
Further, the probability of the aggregated document d associated with topic k is the probability of short text being assigned to topic k divided by the probability of the aggregated document being assigned to topic k.
The invention can realize at least one of the following beneficial effects:
1) the dynamic topic modeling method models topics by automatically aggregating short texts into standard long documents, so that more consistent topics can be captured, the problem of sparsity of the short texts is solved, heuristic pre-processing or post-processing technologies are not needed, the model is simple, and the processing efficiency is high.
2) The novel Gibbs sampling algorithm can rapidly and effectively derive unknown variables, and calculates topic distribution and word distribution through sampling results, so as to realize topic modeling.
3) The method can effectively dynamically model the theme, the current theme is deduced by using the previously captured theme and newly arrived data as prior knowledge, further dynamic modeling of the theme is realized, and the problem that the traditional theme modeling can only be deduced based on static data is effectively solved.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
Fig. 1 is a flowchart of a short text aggregation method based on dynamic semantic modeling in this embodiment;
fig. 2 is a schematic representation diagram of the dynamic self-aggregation topic model in this embodiment.
Detailed Description
The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.
Typically, the topic of the social network is dynamically changed in different time slices, the formalization of the time slices can be expressed as {.., T-2, T-1, T, }, and the interval of the time slices can be set to one day, one week, one month, or the like. In order to adaptively aggregate short texts into long documents, the embodiment discloses a short text aggregation method based on dynamic semantic modeling, which includes the following steps as shown in fig. 1:
step S1, short text data are obtained from the social network on a time slice with a set interval, and data preprocessing is carried out to form a data set;
step S2, at each time slice, through buildingVertical dynamic self-aggregation topic model captures multiple distributions theta of topics in a datasett,kMultiple distribution of sum words phit,v(ii) a t is time slice, K is 1,2, …, K; k is the number of topics in the data set; v ═ 1,2, …, V; v is the number of words in the data set;
in the establishing of the dynamic self-aggregation topic model, the multi-term distribution theta of the time slice topict,kMultiple distribution of sum words phit,vMultiple distribution theta depending on previous time slice subjectt-1,kMultiple distribution of sum words phit-1,v
Step S3, deducing a plurality of distribution in the dynamic self-aggregation topic model by adopting Gibbs sampling, and counting topic distribution and word distribution on each time slice when the sampling is converged;
step S4, according to the topic distribution and the word distribution on each time slice, calculating the probability of short text aggregation related to the topic, and adaptively aggregating the short text.
Specifically, in step S1, based on the short text data crawled from the social network, preprocessing is performed to delete repeated texts and short texts with less than 3 words; performing text word segmentation, stop word deletion and deletion of words with occurrence frequency less than 8 times; a data set for aggregation is obtained.
In particular, in step S2, capturing a plurality of distributions theta of topics by establishing a dynamic self-aggregation topic model in each time slicet,kMultiple distribution of sum words phit,vThe method comprises the following substeps:
1) extracting word pairs;
embodiments of the present invention utilize co-occurrence of word pairs to generate topics, rather than word co-occurrence of traditional approaches. Wherein each word pair contains two unordered words and the word pairs are extracted independently from the same topic, the formalized representation of the construction of the word pairs is Bw={(wi,wj)|wi,wjE, d, i is not equal to j }; the formalized representation of the extracted word pairs is
Figure BDA0002837010570000061
Wherein, BwThe pair of words is represented by a word pair,wi,wj,wlrepresenting three consecutive words in a short text.
2) In the dynamic self-aggregation topic model, constructing the persistence accuracy alpha of the topictAnd the persistence accuracy of the word;
because in the establishing dynamic self-aggregation theme model, the multi-term distribution theta of the time slice themet,kMultiple distribution of sum words phit,vMultiple distribution theta depending on previous time slice subjectt-1,kMultiple distribution of sum words phit-1,v(ii) a Therefore, the persistence accuracy of the topic is constructed by Dirichlet priors
Figure BDA0002837010570000062
Persistence accuracy of sum words
Figure BDA0002837010570000063
In particular, the subject persistence accuracy αtRepresents the persistence of the topic, i.e., the significance of the topic k of the current time slice t compared to the topic of the previous time slice t-1; current topic distribution θt,kDependent on the previous theme distribution thetat-1,k(ii) a Inferring topic distributions using gibbs sampling algorithm
Figure BDA0002837010570000064
Subject persistence accuracy alpha is obtained by sampling convergencet,θt-1,kDistributing theta for the current topict,kThe previous topic distribution relied upon.
Persistence accuracy beta of the wordtRepresents the persistence of the word, i.e., the persistence of the word w assigned to the subject k at the current time slice t compared to the previous time slice t-1; word distribution phi over previous time slice t-1t-1Deducing the word distribution phi of the current time slice tt(ii) a Inferring topic distributions using gibbs sampling algorithm
Figure BDA0002837010570000071
Sampling convergence to obtain the persistence accuracy beta of the wordt;φt-1,vDistribute phi for the current wordt,vIs dependent onThe previous word distribution.
3) Establishing a dynamic self-aggregation topic model
In order to realize self-aggregation of short texts, a short text data set in a data set is represented by { R, S }, wherein R represents unordered short texts, and S is the distribution of an aggregation document and is a hidden variable representing the distribution relation between the aggregation document and the short texts; the aggregated document set is D;
at the current time slice t, the dynamic self-aggregation topic model modeling process is as follows:
1) for each topic K1.. K in the text data set, the persistence accuracy β of the learned word is usedtAnd the word distribution phi of the previous time slice t-1t-1
Sampling the t word distribution phi of the current time slicet,v~Dirichlet(βtφt-1,v) Wherein the words are distributed phitIs a polynomial distribution, Dirichlet stands for Dirichlet distribution;
2) for each aggregate document D, D ∈ D in the text dataset, the persistency accuracy α of the learned topic is exploitedtAnd the previous topic distribution thetat-1
Sample topic distribution θt,d~Dirichlet(αtθt-1,d) Wherein the subject distribution is thetatIs a multi-term distribution;
3) generating a topic using co-occurrence of word pairs, independently extracting word pair B from short textw
Topic assignment k based on multi-term distributed sampling short textRn~Multinomial(θs,d);
Sampling word pair B based on multi-term distributionwDistribution w ofi,wj~Multinomial(φz,d) (ii) a z is the word wiOr wjThe allocation of (2);
θs,dassigning and aggregating topic distributions of document d based on short text; phi is az,dA topic distribution for document d based on word assignment and aggregation.
Specifically, the representation of the dynamic self-aggregation topic model in this embodiment is shown in fig. 2.
In step S2, the dynamic self-aggregation topic model includes an implicit variable S, and in order to calculate the implicit variable, this embodiment adopts gibbs sampling to derive a polynomial distribution in the dynamic self-aggregation topic model.
Specifically, in step S3, the derivation of the polynomial distribution in the dynamic self-aggregation topic model using gibbs sampling includes:
1) at an initial time, the value θ is first assigned0,k=1/K,φ0,v=1/V;
2) Obtaining conditional probability distribution through Gibbs sampling and iterative sampling of distribution S and subject k of the aggregated document;
specifically, the assigning S and k of the iterative sampling aggregation short text includes:
(1) according to gibbs sampling and chain rules, conditional distribution is adopted:
Figure BDA0002837010570000081
distributing S of the sampling short texts;
wherein k represents a topic, NRRepresenting word pairs B in short text RwTotal number of (2), NR,kIn the short text R, the number of word pairs assigned to the topic k, Nt,d,kRepresenting word pair B in aggregate document dwThe number of the subjects k to be assigned,
Figure BDA0002837010570000082
representing word pair B in aggregate document dwTotal number of (2), NRRepresenting the total number of word pairs, N, in the short text RdTo represent
Figure BDA0002837010570000083
Representing word pair B in aggregate document dwThe total amount of the (c),
Figure BDA0002837010570000084
a count representing the removal of the short text R; n represents the current count;
(2) according to gibbs sampling and chain rules, conditional distribution is adopted:
Figure BDA0002837010570000085
assignment k of sampling subject kdm(ii) a Wherein dm represents that the mth word of the aggregation document d is located as a position coordinate point;
wi、wjis the ith, jth word, N in the aggregate document dkThe representation represents the total number of word pairs, N, in the topic kd,kIndicating that, in the aggregate document d, the number of word pairs assigned to the topic k,
Figure BDA0002837010570000088
indicates that k is not included at coordinate point dmdmCounting of (2);
Figure BDA0002837010570000086
indicates that k is not includeddmWord wiThe total number assigned to the subject k,
Figure BDA0002837010570000087
indicates that k is not includeddmWord wjA total number assigned to subject k; v denotes the total number of all words.
3) Computing new topic persistence accuracy in conjunction with maximum likelihood estimation
Figure BDA0002837010570000091
Word-sum persistence accuracy
Figure BDA0002837010570000092
In particular, maximum likelihood estimation is incorporated to calculate persistence accuracy of new topics
Figure BDA0002837010570000093
Persistence accuracy of sum words
Figure BDA0002837010570000094
Making persistence of a theme accurateDegree and word persistence accuracy is more accurate.
Wherein the content of the first and second substances,
Figure BDA0002837010570000095
Figure BDA0002837010570000096
wherein N isk,vIndicating the number of word pairs assigned to word v in topic k;
Figure BDA0002837010570000097
representing a digamma function.
4) According to new theme persistence accuracy
Figure BDA0002837010570000098
Word-sum persistence accuracy
Figure BDA0002837010570000099
And counting the distribution of the subjects and the distribution of the words on the time slice t.
Specifically, the topic distribution and the word distribution on the derived time slice t are as follows:
distribution of topics
Figure BDA00028370105700000910
Word distribution
Figure BDA00028370105700000911
In step S4, the probability of the aggregated document related to the topic is calculated according to the topic distribution and the word distribution of each time slice deduced in step S3, and the short text is adaptively aggregated.
Specifically, the probability of the aggregated document d related to the topic k is the probability of the short text being assigned to the topic k divided by the probability of the aggregated document being assigned to the topic k;
wherein the probability that the short text is assigned to the subject k is
Figure BDA00028370105700000913
The probability that the aggregate document is assigned to topic k is
Figure BDA00028370105700000912
I.e. the probability of the aggregated document d relating to topic k is:
Figure BDA0002837010570000101
in summary, the short text aggregation method based on dynamic semantic modeling of the embodiment has the following effects:
1) the dynamic topic modeling method models topics by automatically aggregating short texts into standard long documents, so that more consistent topics can be captured, the problem of sparsity of the short texts is solved, heuristic pre-processing or post-processing technologies are not needed, the model is simple, and the processing efficiency is high.
2) The novel Gibbs sampling algorithm can rapidly and effectively derive unknown variables, and calculates topic distribution and word distribution through sampling results, so as to realize topic modeling.
3) The method can effectively dynamically model the theme, the current theme is deduced by using the previously captured theme and newly arrived data as prior knowledge, further dynamic modeling of the theme is realized, and the problem that the traditional theme modeling can only be deduced based on static data is effectively solved.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (10)

1. A short text aggregation method based on dynamic semantic modeling is characterized by comprising the following steps:
acquiring short text data to be aggregated on a time slice with a set interval, and performing data preprocessing to form a data set;
capturing, on each timeslice, a plurality of distributions θ of topics in a dataset by building a dynamic self-aggregating topic modelt,kMultiple distribution of sum words phit,v(ii) a t is time slice, K is 1,2, …, K; k is the number of topics in the data set; v ═ 1,2, …, V; v is the number of words in the data set;
in the establishing of the dynamic self-aggregation topic model, the multi-term distribution theta of the time slice topict,kMultiple distribution of sum words phit,vMultiple distribution theta depending on previous time slice subjectt-1,kMultiple distribution of sum words phit-1,v
Adopting Gibbs sampling to deduce a plurality of distributions in the dynamic self-aggregation topic model, and finally counting topic distribution and word distribution on each time slice when sampling is converged;
and calculating the probability of short text aggregation related to the theme according to the theme distribution and the word distribution on each time slice, and adaptively aggregating the short texts.
2. The short text aggregation method based on dynamic semantic modeling according to claim 1, wherein in the dynamic self-aggregation topic model, Dirichlet priors are adopted to construct the persistence accuracy of the topic
Figure FDA0002837010560000011
Persistence accuracy of sum words
Figure FDA0002837010560000012
3. The short text aggregation method based on dynamic semantic modeling according to claim 2, whichCharacterized in that the distribution of the topics is deduced using the Gibbs sampling algorithm
Figure FDA0002837010560000013
Obtaining a persistence accuracy a of the topic when the sampling convergest(ii) a Wherein, thetat-1,kDistributing theta for the current topict,kThe previous topic distribution relied upon;
word distribution phi over previous time slice t-1t-1Deducing the word distribution phi of the current time slice tt(ii) a Inferring topic distributions using gibbs sampling algorithm
Figure FDA0002837010560000014
Obtaining the persistence accuracy beta of the word when the sampling convergestWherein phi ist-1,vDistribute phi for the current wordt,vThe distribution of the previous words relied upon.
4. The short text aggregation method based on dynamic semantic modeling according to claim 3,
representing a short text data set in the data set by { R, S }, wherein R represents unordered short text; s is the distribution of the aggregated documents; the aggregated document set is D;
at the current time slice t, the dynamic self-aggregation topic model modeling process is as follows:
1) for each topic K1.. K in the text data set, the learned persistence accuracy β of the word is usedtAnd the word distribution phi of the previous time slice t-1t-1Sampling the t word distribution phi of the current time slicet,k~Dirichlet(βtφt-1,k) In which the words are distributed phitIs a polynomial distribution, Dirichlet stands for Dirichlet distribution;
2) for each aggregate document D, D ∈ D in the text dataset, the persistency accuracy α of the learned topic is exploitedtAnd the previous topic distribution thetat-1Sample topic distribution θt,d~Dirichlet(αtθt-1,d) Wherein the subject distribution is thetatIs a plurality of itemsDistributing;
3) generating topics using co-occurrence of word pairs, extracting word pairs B independently from short text RwTopic assignment k based on multinomial distribution sampling of short text RRn~Multinomial(θs,d) Sampling word pairs B based on a multi-term distributionwDistribution w ofi,wj~Multinomial(φzD); z is the word wiOr wjThe allocation of (2); thetas,dAssigning and aggregating topic distributions of document d based on short text; phi is az,dA topic distribution for document d based on word assignment and aggregation.
5. The short text aggregation method based on semantic dynamic modeling according to claim 4,
extracting a formalized representation of a word pair
Figure FDA0002837010560000023
wi,wj,wlRepresenting three consecutive words in the short text;
formalized representation of the construction of word pairs Bw={(wi,wj)|wi,wj∈d,i≠j}。
6. The short text aggregation method based on semantic dynamic modeling according to claim 5, wherein the derivation process of the polynomial distribution in the dynamic self-aggregation topic model by using Gibbs sampling comprises:
at an initial time, the value θ is first assigned0,k=1/K,φ0,v=1/V;
Obtaining conditional probability distribution through Gibbs sampling and iterative sampling of distribution S and subject k of the aggregated document;
computing new topic persistence accuracy in conjunction with maximum likelihood estimation
Figure FDA0002837010560000021
Word-sum persistence accuracy
Figure FDA0002837010560000022
According to new theme persistence accuracy
Figure FDA0002837010560000031
Word-sum persistence accuracy
Figure FDA0002837010560000032
And counting the distribution of the subjects and the distribution of the words on the time slice t.
7. The short text aggregation method based on semantic dynamic modeling according to claim 6, wherein the iterative sampling aggregates the assignment S of short text and the assignment of topic k:
according to gibbs sampling and chain rules, conditional distribution is adopted:
Figure FDA0002837010560000033
distributing S of the sampling short texts;
wherein k represents a topic, NRRepresenting word pairs B in short text RwTotal number of (2), NR,kIn the short text R, the number of word pairs assigned to the topic k, Nt,d,kRepresenting word pair B in aggregate document dwThe number of the subjects k to be assigned,
Figure FDA0002837010560000034
representing word pair B in aggregate document dwTotal number of (2), NRRepresenting the total number of word pairs, N, in the short text RdTo represent
Figure FDA0002837010560000035
Representing word pair B in aggregate document dwThe total amount of the (c),
Figure FDA00028370105600000313
a count representing the removal of the short text R; n represents the current count;
according to gibbs sampling and chain rules, conditional distribution is adopted:
Figure FDA0002837010560000036
assignment k of sampling subject kdm(ii) a Wherein dm represents a position coordinate point where the mth word of the aggregated document d is located;
wi、wjis the ith, jth word, N in the aggregate document dkThe representation represents the total number of word pairs, N, in the topic kd,kIndicating that, in the aggregate document d, the number of word pairs assigned to the topic k,
Figure FDA00028370105600000314
indicates that k is not included at coordinate point dmdmCounting of (2);
Figure FDA0002837010560000037
indicates that k is not includeddmWord wiThe total number assigned to the subject k,
Figure FDA0002837010560000038
indicates that k is not includeddmWord wjA total number assigned to subject k; v denotes the total number of all words.
8. The method for short text aggregation based on semantic dynamic modeling according to claim 7, wherein persistence accuracy of new topics is calculated in combination with maximum likelihood estimation
Figure FDA0002837010560000039
Persistence accuracy of sum words
Figure FDA00028370105600000310
Wherein the content of the first and second substances,
Figure FDA00028370105600000311
Figure FDA00028370105600000312
wherein N isk,vRepresents the number of word pairs assigned to word v in topic k;
Figure FDA0002837010560000041
representing a digamma function.
9. The short text aggregation method based on semantic dynamic modeling according to claim 1,
deriving a plurality of distributions in the dynamic self-aggregation topic model by adopting Gibbs sampling to obtain topic distribution on a time slice t
Figure FDA0002837010560000042
Word distribution
Figure FDA0002837010560000043
10. The short text aggregation method based on semantic dynamic modeling according to claim 9, wherein the probability of the aggregation document d related to the topic k is the probability of the short text being assigned to the topic k divided by the probability of the aggregation document being assigned to the topic k.
CN202011479885.XA 2020-12-15 2020-12-15 Short text aggregation method based on dynamic semantic modeling Pending CN112446220A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011479885.XA CN112446220A (en) 2020-12-15 2020-12-15 Short text aggregation method based on dynamic semantic modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011479885.XA CN112446220A (en) 2020-12-15 2020-12-15 Short text aggregation method based on dynamic semantic modeling

Publications (1)

Publication Number Publication Date
CN112446220A true CN112446220A (en) 2021-03-05

Family

ID=74739113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011479885.XA Pending CN112446220A (en) 2020-12-15 2020-12-15 Short text aggregation method based on dynamic semantic modeling

Country Status (1)

Country Link
CN (1) CN112446220A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160110343A1 (en) * 2014-10-21 2016-04-21 At&T Intellectual Property I, L.P. Unsupervised topic modeling for short texts
CN107992549A (en) * 2017-11-28 2018-05-04 南京信息工程大学 Dynamic short text stream Clustering Retrieval method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160110343A1 (en) * 2014-10-21 2016-04-21 At&T Intellectual Property I, L.P. Unsupervised topic modeling for short texts
CN107992549A (en) * 2017-11-28 2018-05-04 南京信息工程大学 Dynamic short text stream Clustering Retrieval method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
李雷等: "社会网络中基于U_BTM模型的主题挖掘", 《计算机应用研究》 *
牛亚男: "具有词判别力学习能力的短文本聚类概率模型研究", 《计算机应用研究》 *
石磊 等: "《Dynamic Topic Modeling via Self-aggregation for Short Text Streams》", 《PEER-TO-PEER NETWORKING AND APPLICATIONS》 *
石磊等: "基于RNN和主题模型的社交网络突发话题发现", 《通信学报》 *

Similar Documents

Publication Publication Date Title
TWI653542B (en) Method, system and device for discovering and tracking hot topics based on network media data flow
CN107391772B (en) Text classification method based on naive Bayes
CN109165294B (en) Short text classification method based on Bayesian classification
CN108897784B (en) Emergency multidimensional analysis system based on social media
Ahmed et al. Detecting sentiment dynamics and clusters of Twitter users for trending topics in COVID-19 pandemic
CN107430625B (en) Classifying documents by clustering
US20180240036A1 (en) Automatic segmentation of a collection of user profiles
CN109271520B (en) Data extraction method, data extraction device, storage medium, and electronic apparatus
Wu et al. Personalized microblog sentiment classification via multi-task learning
Perdana et al. Combining likes-retweet analysis and naive bayes classifier within twitter for sentiment analysis
CN112131322B (en) Time sequence classification method and device
US20180041765A1 (en) Compact video representation for video event retrieval and recognition
Koo et al. Partglot: Learning shape part segmentation from language reference games
CN111177559A (en) Text travel service recommendation method and device, electronic equipment and storage medium
Sree et al. Data analytics: Why data normalization
He et al. Identifying user behavior on Twitter based on multi-scale entropy
WO2018157410A1 (en) Efficient annotation of large sample group
WO2016106944A1 (en) Method for creating virtual human on mapreduce platform
CN110264311B (en) Business promotion information accurate recommendation method and system based on deep learning
CN107506475A (en) A kind of magnanimity electric power customer service file classification method based on Spark
CN112446220A (en) Short text aggregation method based on dynamic semantic modeling
CN112115712A (en) Topic-based group emotion analysis method
CN112507713A (en) Text aggregation system based on dynamic self-aggregation topic model
Assenmacher et al. Textual one-pass stream clustering with automated distance threshold adaption
CN111178038B (en) Document similarity recognition method and device based on latent semantic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210305

RJ01 Rejection of invention patent application after publication