CN109558546A - A kind of the microblog topic expression model generating method and device of Behavior-based control analysis - Google Patents
A kind of the microblog topic expression model generating method and device of Behavior-based control analysis Download PDFInfo
- Publication number
- CN109558546A CN109558546A CN201811315209.1A CN201811315209A CN109558546A CN 109558546 A CN109558546 A CN 109558546A CN 201811315209 A CN201811315209 A CN 201811315209A CN 109558546 A CN109558546 A CN 109558546A
- Authority
- CN
- China
- Prior art keywords
- topic
- behavior
- lexical item
- based control
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 122
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000004364 calculation method Methods 0.000 claims abstract description 17
- 230000006399 behavior Effects 0.000 claims description 232
- 230000003542 behavioural effect Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 210000003813 thumb Anatomy 0.000 description 4
- 238000001514 detection method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000004744 fabric Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 244000097202 Rathbunia alamosensis Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000013456 study Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method and device, this method comprises: step S1, the document of microblog users publication, forwarding and comment is combined together and generates customer documentation set;Step S2 generates topic model using LDA model to customer documentation set;Step S3 calculates weight inside the lexical item of Behavior-based control analysis to each lexical item of each topic;Step S4 calculates weight outside the lexical item of Behavior-based control analysis to each lexical item in customer documentation set;Step S5, according to the comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of obtained weight calculation;Step S6 indicates model according to the topic that the comprehensive weight of acquisition calculates Behavior-based control analysis to each topic, and the subsequent accuracy that topic discovery, EVOLUTION ANALYSIS etc. are carried out using topic model can be improved by combining user behavior factor in topic model in the present invention.
Description
Technical field
The present invention relates to microblog topics to indicate modelling technique field, talks about more particularly to a kind of microblogging of Behavior-based control analysis
Topic indicates model generating method and device.
Background technique
Currently, internet has gradually developed into ubiquitous information propagation and computing platform, the social network being thus born
Network, which is served by, to be developed rapidly, and becomes to become more and more popular.More and more people using social platform content of the discussions, deliver
Opinion, sharing information etc., this results in generating hundreds of millions of information daily, how under such data scale quick and precisely
The new topic of discovery, to information recommendation, public sentiment control etc. have vital effect.And the basic research of topic discovery
One of task is how to indicate that topic, any topic find that method is all built upon a certain specific topics and indicates the base of model
On plinth, same topic discovery method indicates that the effect under model is likely to far from each other in different topics, so for words
Topic indicates that the research of model is particularly important.
Topic model since appearance just become topic discovery, more documents summarize, the meaning of a word identification with disambiguation, sentiment analysis,
The mainstream technology of the multiple fields such as information retrieval, these fields obtain topic by topic model training, in order to hold that topic more
It is easily easily absorbed on a cognitive level by the user, how to choose the problem of representative lexical item set indicates topic is worth more concerns.
Topic is the multinomial distribution in lexical item in form, and there are exact numerical values reciteds in each topic for lexical item
Probability, the set expression topic that can be made up of several or more than ten of lexical item of maximum probability.Lift a simply example, following table
It is topic " sport ", the distribution of " news " and " amusement " in lexical item, if choosing the collection of the lexical item composition of three maximum probabilities
Closing indicates topic, then " sport " topic can be represented with { champion, match, basketball }, and " news " topic is with { president sings
Meeting, champion } it represents, " amusement " is represented with { concert, singer, champion }.
Distribution of 1. topic of table in lexical item
Topic | Basketball | Singer | President | News conference | Match | Champion | Concert | Election contest |
Sport | 0.2 | 0.02 | 0 | 0.08 | 0.3 | 0.4 | 0 | 0 |
News | 0.1 | 0.1 | 0.2 | 0 | 0.1 | 0.2 | 0.2 | 0.1 |
Amusement | 0 | 0.3 | 0 | 0.1 | 0.1 | 0.2 | 0.3 | 0 |
Indicate that model is LDA model using most common topic at present.LDA model is a kind of topic model, it is assumed that often
Piece document is made of k topic, and each topic has a fixed lexical item probability distribution.LDA model can be according to probability distribution
Form provide in collection of document the topic situation of every document and the lexical item distribution situation of each topic, while it is a kind of
Unsupervised learning algorithm does not need to mark training set by hand in training, and what is needed is only collection of document and specified topic
Quantity k.
Gibbs Sampling is easy with its understanding, handles simple advantage is widely applied in the parameter Estimation side of model
The process in face, parameter Estimation is simply described below: being distributed topic to each lexical item in text at random when initial, then is counted every
Occur the quantity of term under a topic and the quantity of lexical item in topic occurs in each document, each round, which calculates, excludes current lexical item
Topic distribution, estimate that current lexical item distributes to the probability of each topic according to the distribution of the topic of other all lexical items, worked as
After preceding lexical item belongs to the probability distribution of all topics, it is that the lexical item distributes a new theme according to this probability distribution, then uses
Same method constantly updates the theme of all lexical items, topic distribution and the lexical item point of each topic until finding each document
Cloth restrains just stopping iteration, exports parameter to be estimated.
Therefore, unsupervised learning can be passed through by one group of collection of document by LDA model and Gibbs Sampling at present
Algorithm (does not need to mark training set by hand), obtains the lexical item of the topic distribution and each topic of each document.
LDA model includes at present text topic detection, text classification and Text similarity computing in text mining field
Aspect is all widely used.People have also done many improvements around LDA model, on the whole, primarily directed to word-based
Topic indicates there is the problems such as readable poor, semantic relevance is weak if item set.Some correlative studys by introducing in a model
The method of extraneous knowledge promotes the applicable application scenarios of the topic representation method based on lexical item set: Kitajima et al. considers thing
Lexical item in LDA model is replaced with event or single verb by part factor;Sridhar et al. is in traditional topic model
In incorporated phrase element representation topic;Wang et al. based on the entry in wikipedia, by topic be mapped as in entry to
Amount utilizes the readable readability for promoting topic of entry.These work can make up to a certain extent based on lexical item set
The problem that the readability of expression is poor, semantic relevance is weak, but not being concerned with how to choose in topic has more preferable distinction
The problem of lexical item.
Existing topic indicates model mainly for long text and specification text, such as TDT (Topic Detection
And Tracking, topic detection and tracking) topic discovery task mainly towards news information stream.And one side of microblogging text
The features such as face has length short and small, and term is random, this makes conventional needle include the vacation of multiple topics to a document of long text
If going wrong, for example, Sina weibo regulation character boundary cannot be more than 140, in such short text, a document one
As only one topic;It on the other hand, include behavioural information of many users, such as forwarding, comment etc., these behaviors in microblogging
The identification and expression of message session topic are also valuable, but there is no the rows for considering user in traditional topic representation method
For factor.Therefore, the effect is unsatisfactory when existing topic model is indicated for microblog topic, needs to make improvements.
Summary of the invention
In order to overcome the deficiencies of the above existing technologies, one of present invention is designed to provide a kind of analysis of Behavior-based control
Microblog topic indicates model generating method and device, to solve the problems, such as that microblogging short essay information content in topic analysis is inadequate.
The microblog topic that another object of the present invention is to provide a kind of Behavior-based control analysis indicate model generating method and
Device, to improve subsequent using topic model progress topic discovery, evolution by combining user behavior factor in topic model
The accuracy of analysis etc..
In view of the above and other objects, the present invention proposes that a kind of microblog topic of Behavior-based control analysis indicates model generation side
Method includes the following steps:
The document of microblog users publication, forwarding and its comment is combined together and generates customer documentation set by step S1;
Step S2 generates topic model using LDA model to the customer documentation set of generation;
Step S3, to each lexical item of each topic in customer documentation set, inside the lexical item for calculating Behavior-based control analysis
Weight;
Step S4 calculates weight outside the lexical item of Behavior-based control analysis to each lexical item in customer documentation set;
Step S5, according to weight and lexical item external weight inside the lexical item of the obtained Behavior-based control analysis of step S3 and step S4
The comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of re-computation;
Step S6, according to the comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of acquisition to each topic
The topic for calculating Behavior-based control analysis indicates model.
Preferably, further include following steps after step S6:
Step S7 indicates model and step S2 according to the topic of the Behavior-based control analysis of the step S6 each topic obtained
The LDA topic that the topic model obtained using LDA calculates Behavior-based control analysis to each topic indicates, obtains final topic table
Representation model.
Preferably, step S2 further comprises:
Step S200 generates document-topic model and topic-lexical item model ρ using LDA model to customer documentation set
(θ)LDA;
Step S201, to each of customer documentation set document, from the document of LDA model generation -- it is chosen in topic
The highest topic of probability is as document topic.
Preferably, step S3 further comprises:
Step S300 calculates separately the inside weight H (w, θ, b) of the lexical item of every kind of behavior according to behavior typeinside;
Step S301, according to the inside weight H (w, θ, b) of the lexical item of every kind of behaviorinsideCalculate the word of Behavior-based control analysis
The inside weight H (w, θ) of iteminside。
Preferably, the calculating of weight is as follows inside the lexical item of the Behavior-based control analysis:
H (w, θ)inside=σ * H (w, θ, b1)inside+ μ * H (w, θ, b2)inside+ τ * H (w, θ, b3)inside
Wherein, H (w, θ, b)insideLexical item w is indicated at topic θ, the behavior inside weight in behavior type b, D (θ, b)
Indicate the collection of document at topic θ, behavior b, TFwiIt is lexical item w in document DiIn the frequency of occurrences, TFwIt is lexical item w in behavior
The sum of the frequency of all documents, σ, μ, τ indicate the weight factor of different behaviors under collection of document D (θ, b).
Preferably, step S4 further comprises:
Step S400 calculates separately the external weight H (w, b) of the lexical item of every kind of behavior according to behavior typeoutside;
Step S401, according to the external weight H (w, b) of the lexical item of every kind of behavioroutsideCalculate the word of Behavior-based control analysis
The external weight H (w) of itemoutside。
Preferably, the external weight H (w) of the lexical item of the Behavior-based control analysisoutsideCalculating it is as follows:
H(w)outside=σ * H (w, b1)outside+ μ * H (w, b2)outside+ τ * H (w, b3)outside
Wherein, H (w, b)outsideLexical item w is indicated in all topics, the behavior outside weight in behavior type b, D (b)
Indicate the collection of document in all documents under behavior b, k is the quantity of topic, DFwjInclude in the collection of document of expression topic j
The number of documents of lexical item w, DFwIt is that all number of documents comprising lexical item w, σ, μ, τ indicate the weight of different behaviors in corpus
The factor.
Preferably, in step S5, the comprehensive weight of the Behavior-based control analysis of each lexical item is calculated as follows under each topic:
In order to achieve the above objectives, the present invention also provides a kind of microblog topics of Behavior-based control analysis to indicate that model generates dress
It sets, comprising:
Customer documentation set generation unit, the document for issuing, forwarding and its commenting on by microblog users are combined together
Generate customer documentation set;
Initial topic model generation unit generates topic model using LDA model for the customer documentation set to generation;
Weight calculation unit inside the lexical item of Behavior-based control analysis, for each of each topic in customer documentation set
Lexical item calculates weight inside the lexical item of Behavior-based control analysis;
Weight outside the lexical item of Behavior-based control analysis, for calculating based on row to each lexical item in customer documentation set
For weight outside the lexical item of analysis;
Comprehensive weight computing unit, inside lexical item for being analyzed according to the Behavior-based control weight calculation unit be based on
Weight and weight calculation outside lexical item are each inside the lexical item for the Behavior-based control analysis that the lexical item external weight of behavioural analysis restores
The comprehensive weight of the Behavior-based control analysis of each lexical item under topic;
The topic of Behavior-based control analysis indicates model computing unit, for each lexical item under each topic according to acquisition
The topic that the comprehensive weight of Behavior-based control analysis calculates Behavior-based control analysis to each topic indicates model.
Preferably, described device further include:
Topic indicates model generation unit, and the topic for being analyzed according to the Behavior-based control indicates that model computing unit obtains
The topic of the Behavior-based control analysis of each topic obtained indicates that model and the initial topic model generation unit are obtained using LDA
The LDA topic that topic model out calculates Behavior-based control analysis to each topic indicates that obtaining final topic indicates model.
Compared with prior art, a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method and device
For microblogging short text feature, the microblogging of forwarding and its comment with former blog article included together as a document process, in LDA
A topic is only chosen in the multiple topics for the document that model obtains constructs topic model, and root as the topic of document
The topic obtained according to LDA model-lexical item distribution, considers distribution situation of the lexical item inside user's difference behavior, user is allowed to set
Different user behavior is determined to the impact factor of topic model, so that user behavior factor is combined in topic model, so that obtaining
Topic indicate that model is more accurate, more targetedly, improve it is subsequent carry out topic discovery using topic model, EVOLUTION ANALYSIS etc. is calculated
The accuracy of method.
Detailed description of the invention
Fig. 1 is the step flow chart that a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method;
Fig. 2 is the system architecture diagram that a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating means.
Specific embodiment
Below by way of specific specific example and embodiments of the present invention are described with reference to the drawings, those skilled in the art can
Understand further advantage and effect of the invention easily by content disclosed in the present specification.The present invention can also pass through other differences
Specific example implemented or applied, details in this specification can also be based on different perspectives and applications, without departing substantially from
Various modifications and change are carried out under spirit of the invention.
Fig. 1 is the step flow chart that a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method.Such as
Shown in Fig. 1, a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method, includes the following steps:
The document of microblog users publication, forwarding and its comment is combined together and generates customer documentation set by step S1.?
In the specific embodiment of the invention, the document of microblog users publication, forwarding, comment is arranged, the microblogging and original of user comment
Blog article forms a document together, behavior label bi, respectively b1 (publication), b2 (forwarding), b3 (comment) on all document bands.
Step S2 generates topic model using LDA model to the customer documentation set of generation.
Specifically, step S2 further comprises:
Step S200 generates " document-topic " model and " topic-lexical item " using LDA model to customer documentation set
Model ρ (θ)LDA。
Step S201, to each of customer documentation set document, from the document of LDA model generation -- it is chosen in topic
The highest topic of probability is as document topic.It is noted that each document only corresponds to a topic to microblogging short text.
Step S3, to each lexical item in each topic in customer documentation set, in the lexical item for calculating Behavior-based control analysis
Portion's weight.
Specifically, step S3 further comprises:
Step S300 calculates separately the inside weight H (w, θ, b) of the lexical item of every kind of behavior according to behavior typeinside。
Internal weight is to describe the uniformity coefficient that lexical item is distributed between each document of a specific topics, it is considered that,
In microblogging, influence of the different behaviors to user inside weight is had differences.Therefore, word is discussed according to behavior type respectively first
Weight inside the behavior of item.
Weight is description in a specific topic inside behavior, what lexical item was distributed between a certain behavior type Documents
Uniformity coefficient.That is, it is distributed more uniform, the attribute feature of more suitable expression behavior under specific topics,
It calculates shown as the following formula:
Wherein, H (w, θ, b)insideLexical item w is indicated at topic θ, the behavior inside weight in behavior type b.D (θ, b)
Indicate the collection of document at topic θ, behavior b, TFwiIt is lexical item w in document DiIn the frequency of occurrences, TFwIt is lexical item w in behavior
The sum of the frequency of all documents under collection of document D (θ, b).
Weight is bigger inside behavior, and lexical item is distributed more uniform under a certain behavior of specific topics, that is to say, that Ke Yigeng
The behavior feature under specific topics is indicated well.Best situation is in the document under all behavior collection of document D (θ, b)
The frequency of appearance is identical.
From formula 1 it can be seen that H (w, θ, b)insideIt needs to calculate repeatedly, because for behavior each under specific topics
It needs to calculate primary.There are a plurality of types of behaviors for true social networks, such as issue, forward, and comment thumbs up.For letter
Change process, the present invention only consider the publication of microblogging, and the behavior of three types is commented in forwarding.
Step S301, according to the inside weight H (w, θ, b) of the lexical item of every kind of behaviorinsideCalculate the word of Behavior-based control analysis
The inside weight H (w, θ) of iteminside:
Publication to microblogging, forwarding, the behavior for commenting on three types assign their different weights respectively, and weight indicates this
The significance level that kind behavior indicates topic, so the calculating of the inside weight of last Behavior-based control analysis is as follows:
H (w, θ)inside=σ * H (w, θ, b1)inside+ μ * H (w, θ, b2)inside+ τ * H (w, θ, b3)inside(formula 2)
Wherein σ, μ, τ indicate the weight factor of different behaviors.Due to temporarily not considering other factors here, so three power
The sum of repeated factor is 1.
Step S4 calculates weight outside the lexical item of Behavior-based control analysis to each lexical item in customer documentation set.
Specifically, step S4 further comprises:
Step S400 calculates separately the external weight H (w, b) of the lexical item of every kind of behavior according to behavior typeoutside。
External weight is the uniform situation for describing lexical item and being distributed in all topics, and lexical item is distributed more uniform, this lexical item
More unsuitable any topic of description.In microblogging, influence of the different behaviors to weight outside user is had differences.Therefore, first
Weight outside the behavior of lexical item is first discussed according to behavior type respectively.
Weight is description in a collection of document outside behavior, and lexical item is distributed equal between a certain behavior type Documents
Even degree.That is, it is distributed more uniform, more unsuitable any topic of expression, shown in calculating as the following formula:
Wherein, H (w, b)outsideLexical item w is indicated in all topics, the behavior outside weight in behavior type b, D (b)
Indicate the collection of document in all documents under behavior b, k is the quantity of topic, DFwiInclude in the collection of document of expression topic j
The number of documents of lexical item w, DFwIt is all number of documents comprising lexical item w in corpus.According to formula (3) calculate it can be concluded that,
External behavior weight is bigger, and lexical item w is distributed more uniform, the worst situation in all topic behaviors to be wrapped under each topic behavior
The number of documents of the w containing lexical item is identical.
From formula it can be seen that H (w, b)outsideIt needs to calculate repeatedly, each behavior is required to calculate primary.Really
Social networks there are a plurality of types of behaviors, such as issue, forward, comment thumbs up.In order to simplify process, the present invention is only examined
Consider the publication of microblogging, the behavior of three types is commented in forwarding.
Step S401, according to the external weight H (w, b) of the lexical item of every kind of behavioroutsideCalculate the word of Behavior-based control analysis
The external weight H (w) of itemoutside:
Publication to microblogging, forwarding, the behavior for commenting on three types assign their different weights respectively, and weight indicates this
The significance level that kind behavior indicates topic, so the calculating of the external weight of last Behavior-based control analysis is as follows:
H(w)outside=σ * H (w, b1)outside+ μ * H (w, b2)outside+ τ * H (w, b3)outside(formula 4)
Wherein σ, μ, τ indicate the weight factor of different behaviors.Due to temporarily not considering other factors here, so three power
The sum of repeated factor is 1.
Step S5, according to weight and lexical item external weight inside the lexical item of the obtained Behavior-based control analysis of step S3 and step S4
The comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of re-computation.
Analyzed by step S3 and step S4 about the discussion of internal weight and external weight it is found that both weights for
It measures typical lexical item and plays great role.Therefore, the present invention is on the basis of the topic model that LDA model obtains, in conjunction with lexical item
To measure it for the comprehensive weight situation of specified topic, calculation formula is as follows for internal weight and external weight:
Step S6, according to the comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of acquisition to each topic
The topic for calculating Behavior-based control analysis indicates model.
In the specific embodiment of the invention, by different lexical items to the weight proportion of specified topic by topic under each lexical item
Weight normalization after obtained the topic of Behavior-based control analysis and indicated, shown in following formula (6):
ρbehavior(θ)=(ω (wl, θ), ω (w2, θ) ..., ω (wn, θ)) (formula 6)
Step S7 indicates model and step S2 according to the topic of the Behavior-based control analysis of the step S6 each topic obtained
The LDA topic that the topic model obtained using LDA calculates Behavior-based control analysis to each topic indicates, obtains final topic table
Representation model.
Model ρ is indicated in the topic for calculating Behavior-based control analysisbehaviorAfter (θ), also need to combine step S2 using LDA
The topic model ρ (θ) obtainedLDA, that is, comprehensively consider frequency that lexical item occurs in topic and behavioural analysis situation provide finally
Topic indicate model, pass through ρ (θ)behaviorWith ρ (θ)LDAObtaining the sensitive LDA topic of topic θ distribution indicates, following formula 7
It is shown:
ρ(θ)BEH-LDA=p* ρ (θ)LDA+(1-p)*ρ(θ)behavior(formula 7)
Wherein p ∈ (0,1) is a linear dimensions, is measured ρ (θ)LDAWith ρ (θ)behaviorBetween linear weight.
Fig. 2 is the system architecture diagram that a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating means.Such as
Shown in Fig. 2, a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating means, comprising:
Customer documentation set generation unit 201, the document for issuing, forwarding and its commenting on by microblog users is incorporated in one
It rises and generates customer documentation set.In the specific embodiment of the invention, customer documentation set generation unit 201 sends out microblog users
Cloth, forwarding, comment document arranged, the microblogging of user comment forms a document, all document bands together with former blog article
Upper behavior label bi, respectively b1 (publication), b2 (forwarding), b3 (comment).
Initial topic model generation unit 202 generates topic mould using LDA model for the customer documentation set to generation
Type.
Specifically, initial topic model generation unit 202 further comprises:
Topic model generation module, for customer documentation set, using LDA model generate " document-topic " model and
" topic-lexical item " model ρ (θ)LDA。
Document topic selection unit, the text for being generated from LDA model to each of customer documentation set document
Shelves -- the highest topic of probability is chosen in topic as document topic.It is noted that each document is only right to microblogging short text
Answer a topic).
Weight calculation unit 203 inside the lexical item of Behavior-based control analysis, for in each topic in customer documentation set
Each lexical item, calculate Behavior-based control analysis lexical item inside weight.
In the specific embodiment of the invention, weight calculation unit 203 is specifically used for inside the lexical item of Behavior-based control analysis:
The inside weight H (w, θ, b) of the lexical item of every kind of behavior is calculated separately according to behavior typeinside, then according to every kind
The inside weight H (w, θ, b) of the lexical item of behaviorinsideCalculate the inside weight H (w, θ) of the lexical item of Behavior-based control analysisinside, tool
Body is as follows:
Internal weight is to describe the uniformity coefficient that lexical item is distributed between each document of a specific topics, it is considered that,
In microblogging, influence of the different behaviors to user inside weight is had differences.Therefore, word is discussed according to behavior type respectively first
Weight inside the behavior of item.
Weight is description in a specific topic inside behavior, what lexical item was distributed between a certain behavior type Documents
Uniformity coefficient.That is, it is distributed more uniform, the attribute feature of more suitable expression behavior under specific topics,
It calculates shown as the following formula:
Wherein, H (w, θ, b)insideLexical item w is indicated at topic θ, the behavior inside weight in behavior type b.D (θ, b)
Indicate the collection of document at topic θ, behavior b, TFwiIt is lexical item w in document DiIn the frequency of occurrences, TFwIt is lexical item w in behavior
The sum of the frequency of all documents under collection of document D (θ, b).
Weight is bigger inside behavior, and lexical item is distributed more uniform under a certain behavior of specific topics, that is to say, that Ke Yigeng
The behavior feature under specific topics is indicated well.Best situation is in the document under all behavior collection of document D (θ, b)
The frequency of appearance is identical.
As can be seen from the above formula that H (w, θ, b)insideIt needs to calculate repeatedly, because for behavior each under specific topics
It requires to calculate primary.There are a plurality of types of behaviors for true social networks, such as issue, forward, and comment thumbs up.In order to
The process of simplification, the present invention only consider the publication of microblogging, forward, comment on the behavior of three types.
Publication to microblogging, forwarding, the behavior for commenting on three types assign their different weights respectively, and weight indicates this
The significance level that kind behavior indicates topic, so the calculating of the inside weight of last Behavior-based control analysis is as follows:
H (w, θ)inside=σ * H (w, θ, b1)inside+ μ * H (w, θ, b2)inside+ τ * H (w, θ, b3)inside
Wherein σ, μ, τ indicate the weight factor of different behaviors.Due to temporarily not considering other factors here, so three power
The sum of repeated factor is 1.
Weight calculation unit 204 outside the lexical item of Behavior-based control analysis, for each lexical item in customer documentation set,
Calculate weight outside the lexical item of Behavior-based control analysis.
In the specific embodiment of the invention, weight calculation unit 204 is specifically used for outside the lexical item of Behavior-based control analysis:
The external weight H (w, b) of the lexical item of every kind of behavior is calculated separately according to behavior typeoutside, according to every kind of behavior
Lexical item external weight H (w, b)outsideCalculate the external weight H (w) of the lexical item of Behavior-based control analysisoutside, it is specific as follows:
External weight is the uniform situation for describing lexical item and being distributed in all topics, and lexical item is distributed more uniform, this lexical item
More unsuitable any topic of description.In microblogging, influence of the different behaviors to weight outside user is had differences.Therefore, first
Weight outside the behavior of lexical item is first discussed according to behavior type respectively.
Weight is description in a collection of document outside behavior, and lexical item is distributed equal between a certain behavior type Documents
Even degree.That is, it is distributed more uniform, more unsuitable any topic of expression, shown in calculating as the following formula:
Wherein, H (w, b)outsideLexical item w is indicated in all topics, the behavior outside weight in behavior type b, D (b)
Indicate the collection of document in all documents under behavior b, k is the quantity of topic, DFwjInclude in the collection of document of expression topic j
The number of documents of lexical item w, DFwIt is all number of documents comprising lexical item w in corpus.It can be obtained according to above-mentioned formula calculating
Out, external behavior weight is bigger, and it is each topic behavior that lexical item w is distributed more uniform, the worst situation in all topic behaviors
The number of documents comprising lexical item w is identical down.
From formula it can be seen that H (w, b)outsideIt needs to calculate repeatedly, each behavior is required to calculate primary.Really
Social networks there are a plurality of types of behaviors, such as issue, forward, comment thumbs up.In order to simplify process, the present invention is only examined
Consider the publication of microblogging, the behavior of three types is commented in forwarding.
Publication to microblogging, forwarding, the behavior for commenting on three types assign their different weights respectively, and weight indicates this
The significance level that kind behavior indicates topic, so the calculating of the external weight of last Behavior-based control analysis is as follows:
H(w)outside=σ * H (w, b1)outside+ μ * H (w, b2)outside+ τ * H (w, b3)outside
Wherein σ, μ, τ indicate the weight factor of different behaviors.Due to temporarily not considering other factors here, so three power
The sum of repeated factor is 1.
Comprehensive weight computing unit 205, lexical item inside weight calculation unit 203 and base for being analyzed according to Behavior-based control
The lexical item inside weight and the lexical item outside each words of weight calculation that weight calculation unit 204 obtains outside the lexical item of behavioural analysis
The comprehensive weight of the Behavior-based control analysis of each lexical item under topic.
The lexical item outside weight of weight calculation unit 203 and Behavior-based control analysis inside the lexical item analyzed by Behavior-based control
Computing unit 204 is analyzed about the discussion of internal weight and external weight it is found that both weights rise for measuring typical lexical item
To great role.Therefore, the present invention is on the basis of the topic model that LDA model obtains, in conjunction with the inside weight of lexical item and outside
For weight to measure it for the comprehensive weight situation of specified topic, calculation formula is as follows:
The topic of Behavior-based control analysis indicates model computing unit 206, for each word under each topic according to acquisition
The topic that the comprehensive weight of the Behavior-based control analysis of item calculates Behavior-based control analysis to each topic indicates model.
In the specific embodiment of the invention, by different lexical items to the weight proportion of specified topic by topic under each lexical item
Weight normalization after obtained Behavior-based control analysis topic indicate, shown in following formula:
ρbehavior(θ)=(ω (w1, θ), ω (w2, θ) ..., ω (wn, θ))
Topic indicates model generation unit 207, and the topic for being analyzed according to Behavior-based control indicates model computing unit 206
The topic model that the topic of the Behavior-based control analysis of each topic obtained indicates that model and LDA are obtained calculates each topic
The LDA topic of Behavior-based control analysis indicates that obtaining final topic indicates model.
Model ρ is indicated in the topic for calculating Behavior-based control analysisbehaviorAfter (θ), the topic for combining LDA to obtain also is needed
Model ρ (θ)LDA, that is, comprehensively consider frequency that lexical item occurs in topic and behavioural analysis situation provide final topic and indicates
Model passes through ρ (θ)behaviorWith ρ (θ)LDAObtaining the sensitive LDA topic of topic θ distribution indicates, shown in following formula:
ρ(θ)BEH-LDA=p* ρ (θ)LDA+(1-p)*ρ(θ)behavior
Wherein p ∈ (0,1) is a linear dimensions, is measured ρ (θ)LDAWith ρ (θ)behaviorBetween linear weight.
In conclusion a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method and device for micro-
Rich short text feature obtains the microblogging of forwarding and its comment in LDA model included together as a document process with former blog article
A topic is only chosen in multiple topics of a document out and constructs topic model as the topic of document, and according to LDA
The topic that model obtains-lexical item distribution, considers distribution situation of the lexical item inside user's difference behavior, allows user to set different
User behavior is to the impact factor of topic model, so that user behavior factor is combined in topic model, so that obtained topic
It indicates that model is more accurate, more targetedly, improves the subsequent standard for carrying out topic discovery, EVOLUTION ANALYSIS scheduling algorithm using topic model
Exactness.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.Any
Without departing from the spirit and scope of the present invention, modifications and changes are made to the above embodiments by field technical staff.Therefore,
The scope of the present invention, should be as listed in the claims.
Claims (10)
1. a kind of microblog topic of Behavior-based control analysis indicates model generating method, include the following steps:
The document of microblog users publication, forwarding and its comment is combined together and generates customer documentation set by step S1;
Step S2 generates topic model using LDA model to the customer documentation set of generation;
Step S3 calculates weight inside the lexical item of Behavior-based control analysis to each lexical item of each topic in customer documentation set;
Step S4 calculates weight outside the lexical item of Behavior-based control analysis to each lexical item in customer documentation set;
Step S5, according to weight and lexical item external weight restatement inside the lexical item of the obtained Behavior-based control analysis of step S3 and step S4
Calculate the comprehensive weight of the Behavior-based control analysis of each lexical item under each topic;
Step S6 calculates each topic according to the comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of acquisition
The topic of Behavior-based control analysis indicates model.
2. a kind of microblog topic of Behavior-based control analysis as described in claim 1 indicates model generating method, which is characterized in that
Further include following steps after step S6:
Step S7 indicates that model and step S2 use according to the topic of the Behavior-based control analysis of the step S6 each topic obtained
The LDA topic that the topic model that LDA is obtained calculates Behavior-based control analysis to each topic indicates that obtaining final topic indicates mould
Type.
3. a kind of microblog topic of Behavior-based control analysis as claimed in claim 2 indicates model generating method, which is characterized in that
Step S2 further comprises:
Step S200 generates document-topic model and topic-lexical item model ρ using LDA model to customer documentation set
(θ)LDA;
Step S201, to each of customer documentation set document, from the document of LDA model generation -- probability is chosen in topic
Highest topic is as document topic.
4. a kind of microblog topic of Behavior-based control analysis as claimed in claim 3 indicates model generating method, which is characterized in that
Step S3 further comprises:
Step S300 calculates separately the inside weight H (w, θ, b) of the lexical item of every kind of behavior according to behavior typeinside;
Step S301, according to the inside weight H (w, θ, b) of the lexical item of every kind of behaviorinsideCalculate the lexical item of Behavior-based control analysis
Internal weight H (w, θ)inside。
5. a kind of microblog topic of Behavior-based control analysis as claimed in claim 4 indicates model generating method, which is characterized in that
The calculating of weight is as follows inside the lexical item of the Behavior-based control analysis:
H (w, θ)inside=σ * H (w, θ, b1)inside+ μ * H (w, θ, b2)inside+ τ * H (w, θ, b3)inside
Wherein, H (w, θ, b)insideLexical item w is indicated at topic θ, weight inside the behavior in behavior type b, D (θ, b) is indicated
Collection of document at topic θ, behavior b, TFwiIt is lexical item w in document DiIn the frequency of occurrences, TFwIt is lexical item w in behavior document
The sum of the frequency of all documents, σ, μ, τ indicate the weight factor of different behaviors under set D (θ, b).
6. a kind of microblog topic of Behavior-based control analysis as claimed in claim 4 indicates model generating method, which is characterized in that
Step S4 further comprises:
Step S400 calculates separately the external weight H (w, b) of the lexical item of every kind of behavior according to behavior typeoutside;
Step S401, according to the external weight H (w, b) of the lexical item of every kind of behavioroutsideCalculate the outer of the lexical item of Behavior-based control analysis
Portion weight H (w)outside。
7. a kind of microblog topic of Behavior-based control analysis as claimed in claim 6 indicates model generating method, which is characterized in that
The external weight H (w) of the lexical item of the Behavior-based control analysisoutsideCalculating it is as follows:
H(w)outside=σ * H (w, b1)outside+ μ * H (w, b2)outside+ τ * H (w, b3)outside
Wherein, H (w, b)outsideLexical item w is indicated in all topics, weight outside the behavior in behavior type b, D (b) is indicated
Collection of document in all documents under behavior b, k are the quantity of topic, DFwjIndicate to include lexical item w's in the collection of document of topic j
Number of documents, DFwIt is that all number of documents comprising lexical item w, σ, μ, τ indicate the weight factor of different behaviors in corpus.
8. a kind of microblog topic of Behavior-based control analysis as claimed in claim 6 indicates model generating method, which is characterized in that
In step S5, the comprehensive weight of the Behavior-based control analysis of each lexical item is calculated as follows under each topic:
9. a kind of microblog topic of Behavior-based control analysis indicates model generating means, comprising:
Customer documentation set generation unit, the document for issuing, forwarding and its commenting on by microblog users are combined together generation
Customer documentation set;
Initial topic model generation unit generates topic model using LDA model for the customer documentation set to generation;
Weight calculation unit inside the lexical item of Behavior-based control analysis, for each word to each topic in customer documentation set
, calculate weight inside the lexical item of Behavior-based control analysis;
Weight outside the lexical item of Behavior-based control analysis, for calculating Behavior-based control point to each lexical item in customer documentation set
Weight outside the lexical item of analysis;
Comprehensive weight computing unit, lexical item inside weight calculation unit and Behavior-based control for being analyzed according to the Behavior-based control
Weight and the lexical item outside each topic of weight calculation inside the lexical item for the Behavior-based control analysis that the lexical item external weight of analysis restores
Under each lexical item Behavior-based control analysis comprehensive weight;
Behavior-based control analysis topic indicate model computing unit, under each topic according to acquisition each lexical item based on
The topic that the comprehensive weight of behavioural analysis calculates Behavior-based control analysis to each topic indicates model.
10. a kind of microblog topic of Behavior-based control analysis as claimed in claim 9 indicates that model generating means, feature exist
In described device further include:
Topic indicates model generation unit, and the topic for being analyzed according to the Behavior-based control indicates what model computing unit obtained
The topic of the Behavior-based control analysis of each topic indicates what model and the initial topic model generation unit were obtained using LDA
The LDA topic that topic model calculates Behavior-based control analysis to each topic indicates that obtaining final topic indicates model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811315209.1A CN109558546A (en) | 2018-11-06 | 2018-11-06 | A kind of the microblog topic expression model generating method and device of Behavior-based control analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811315209.1A CN109558546A (en) | 2018-11-06 | 2018-11-06 | A kind of the microblog topic expression model generating method and device of Behavior-based control analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109558546A true CN109558546A (en) | 2019-04-02 |
Family
ID=65865953
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811315209.1A Pending CN109558546A (en) | 2018-11-06 | 2018-11-06 | A kind of the microblog topic expression model generating method and device of Behavior-based control analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109558546A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111339762A (en) * | 2020-02-14 | 2020-06-26 | 广州大学 | Topic representation model construction method and device based on hybrid intelligence |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101571853A (en) * | 2009-05-22 | 2009-11-04 | 哈尔滨工程大学 | Evolution analysis device and method for contents of network topics |
CN102708153A (en) * | 2012-04-18 | 2012-10-03 | 中国信息安全测评中心 | Self-adaption finding and predicting method and system for hot topics of online social network |
-
2018
- 2018-11-06 CN CN201811315209.1A patent/CN109558546A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101571853A (en) * | 2009-05-22 | 2009-11-04 | 哈尔滨工程大学 | Evolution analysis device and method for contents of network topics |
CN102708153A (en) * | 2012-04-18 | 2012-10-03 | 中国信息安全测评中心 | Self-adaption finding and predicting method and system for hot topics of online social network |
Non-Patent Citations (1)
Title |
---|
LU DENG: "Topic Detection Based on User Intention", 《2015 IEEE 15TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111339762A (en) * | 2020-02-14 | 2020-06-26 | 广州大学 | Topic representation model construction method and device based on hybrid intelligence |
CN111339762B (en) * | 2020-02-14 | 2023-04-07 | 广州大学 | Topic representation model construction method and device based on hybrid intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mohammad et al. | Using hashtags to capture fine emotion categories from tweets | |
Cui et al. | Emotion tokens: Bridging the gap among multilingual twitter sentiment analysis | |
CN110046228B (en) | Short text topic identification method and system | |
Bai et al. | Characterizing and predicting early reviewers for effective product marketing on e-commerce websites | |
CN104133897B (en) | A kind of microblog topic source tracing method based on topic influence | |
CN107273348B (en) | Topic and emotion combined detection method and device for text | |
CN106202053B (en) | A kind of microblogging theme sentiment analysis method of social networks driving | |
CN108804701A (en) | Personage's portrait model building method based on social networks big data | |
CN107103093B (en) | Short text recommendation method and device based on user behavior and emotion analysis | |
Kershaw et al. | Towards modelling language innovation acceptance in online social networks | |
Razek et al. | Text-based intelligent learning emotion system | |
US9058328B2 (en) | Search device, search method, search program, and computer-readable memory medium for recording search program | |
Narducci et al. | A general architecture for an emotion-aware content-based recommender system | |
Wandabwa et al. | Topical affinity in short text microblogs | |
JP6368264B2 (en) | Contributor Analyzing Device, Program, and Method for Analyzing Contributor's Profile Item from Posted Sentence | |
CN110019556B (en) | Topic news acquisition method, device and equipment thereof | |
CN109558546A (en) | A kind of the microblog topic expression model generating method and device of Behavior-based control analysis | |
CN110324278A (en) | Account main body consistency detecting method, device and equipment | |
Tutaysalgir et al. | Clustering based personality prediction on Turkish tweets | |
CN107590742B (en) | Behavior-based social network user attribute value inversion method | |
CN112487303B (en) | Topic recommendation method based on social network user attributes | |
CN109522409A (en) | A kind of topic expression model generating method and device that vocabulary distribution is sensitive | |
CN108920475A (en) | A kind of short text similarity calculating method | |
CN110990592B (en) | Online microblog burst topic detection method and detection device | |
WO2022173397A1 (en) | A recommendation system using artificial intelligence algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190402 |