CN109558546A - A kind of the microblog topic expression model generating method and device of Behavior-based control analysis - Google Patents

A kind of the microblog topic expression model generating method and device of Behavior-based control analysis Download PDF

Info

Publication number
CN109558546A
CN109558546A CN201811315209.1A CN201811315209A CN109558546A CN 109558546 A CN109558546 A CN 109558546A CN 201811315209 A CN201811315209 A CN 201811315209A CN 109558546 A CN109558546 A CN 109558546A
Authority
CN
China
Prior art keywords
topic
behavior
lexical item
based control
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811315209.1A
Other languages
Chinese (zh)
Inventor
韩伟红
李树栋
黄子中
方滨兴
贾焰
王乐
周斌
殷丽华
田志宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN201811315209.1A priority Critical patent/CN109558546A/en
Publication of CN109558546A publication Critical patent/CN109558546A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method and device, this method comprises: step S1, the document of microblog users publication, forwarding and comment is combined together and generates customer documentation set;Step S2 generates topic model using LDA model to customer documentation set;Step S3 calculates weight inside the lexical item of Behavior-based control analysis to each lexical item of each topic;Step S4 calculates weight outside the lexical item of Behavior-based control analysis to each lexical item in customer documentation set;Step S5, according to the comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of obtained weight calculation;Step S6 indicates model according to the topic that the comprehensive weight of acquisition calculates Behavior-based control analysis to each topic, and the subsequent accuracy that topic discovery, EVOLUTION ANALYSIS etc. are carried out using topic model can be improved by combining user behavior factor in topic model in the present invention.

Description

A kind of the microblog topic expression model generating method and device of Behavior-based control analysis
Technical field
The present invention relates to microblog topics to indicate modelling technique field, talks about more particularly to a kind of microblogging of Behavior-based control analysis Topic indicates model generating method and device.
Background technique
Currently, internet has gradually developed into ubiquitous information propagation and computing platform, the social network being thus born Network, which is served by, to be developed rapidly, and becomes to become more and more popular.More and more people using social platform content of the discussions, deliver Opinion, sharing information etc., this results in generating hundreds of millions of information daily, how under such data scale quick and precisely The new topic of discovery, to information recommendation, public sentiment control etc. have vital effect.And the basic research of topic discovery One of task is how to indicate that topic, any topic find that method is all built upon a certain specific topics and indicates the base of model On plinth, same topic discovery method indicates that the effect under model is likely to far from each other in different topics, so for words Topic indicates that the research of model is particularly important.
Topic model since appearance just become topic discovery, more documents summarize, the meaning of a word identification with disambiguation, sentiment analysis, The mainstream technology of the multiple fields such as information retrieval, these fields obtain topic by topic model training, in order to hold that topic more It is easily easily absorbed on a cognitive level by the user, how to choose the problem of representative lexical item set indicates topic is worth more concerns.
Topic is the multinomial distribution in lexical item in form, and there are exact numerical values reciteds in each topic for lexical item Probability, the set expression topic that can be made up of several or more than ten of lexical item of maximum probability.Lift a simply example, following table It is topic " sport ", the distribution of " news " and " amusement " in lexical item, if choosing the collection of the lexical item composition of three maximum probabilities Closing indicates topic, then " sport " topic can be represented with { champion, match, basketball }, and " news " topic is with { president sings Meeting, champion } it represents, " amusement " is represented with { concert, singer, champion }.
Distribution of 1. topic of table in lexical item
Topic Basketball Singer President News conference Match Champion Concert Election contest
Sport 0.2 0.02 0 0.08 0.3 0.4 0 0
News 0.1 0.1 0.2 0 0.1 0.2 0.2 0.1
Amusement 0 0.3 0 0.1 0.1 0.2 0.3 0
Indicate that model is LDA model using most common topic at present.LDA model is a kind of topic model, it is assumed that often Piece document is made of k topic, and each topic has a fixed lexical item probability distribution.LDA model can be according to probability distribution Form provide in collection of document the topic situation of every document and the lexical item distribution situation of each topic, while it is a kind of Unsupervised learning algorithm does not need to mark training set by hand in training, and what is needed is only collection of document and specified topic Quantity k.
Gibbs Sampling is easy with its understanding, handles simple advantage is widely applied in the parameter Estimation side of model The process in face, parameter Estimation is simply described below: being distributed topic to each lexical item in text at random when initial, then is counted every Occur the quantity of term under a topic and the quantity of lexical item in topic occurs in each document, each round, which calculates, excludes current lexical item Topic distribution, estimate that current lexical item distributes to the probability of each topic according to the distribution of the topic of other all lexical items, worked as After preceding lexical item belongs to the probability distribution of all topics, it is that the lexical item distributes a new theme according to this probability distribution, then uses Same method constantly updates the theme of all lexical items, topic distribution and the lexical item point of each topic until finding each document Cloth restrains just stopping iteration, exports parameter to be estimated.
Therefore, unsupervised learning can be passed through by one group of collection of document by LDA model and Gibbs Sampling at present Algorithm (does not need to mark training set by hand), obtains the lexical item of the topic distribution and each topic of each document.
LDA model includes at present text topic detection, text classification and Text similarity computing in text mining field Aspect is all widely used.People have also done many improvements around LDA model, on the whole, primarily directed to word-based Topic indicates there is the problems such as readable poor, semantic relevance is weak if item set.Some correlative studys by introducing in a model The method of extraneous knowledge promotes the applicable application scenarios of the topic representation method based on lexical item set: Kitajima et al. considers thing Lexical item in LDA model is replaced with event or single verb by part factor;Sridhar et al. is in traditional topic model In incorporated phrase element representation topic;Wang et al. based on the entry in wikipedia, by topic be mapped as in entry to Amount utilizes the readable readability for promoting topic of entry.These work can make up to a certain extent based on lexical item set The problem that the readability of expression is poor, semantic relevance is weak, but not being concerned with how to choose in topic has more preferable distinction The problem of lexical item.
Existing topic indicates model mainly for long text and specification text, such as TDT (Topic Detection And Tracking, topic detection and tracking) topic discovery task mainly towards news information stream.And one side of microblogging text The features such as face has length short and small, and term is random, this makes conventional needle include the vacation of multiple topics to a document of long text If going wrong, for example, Sina weibo regulation character boundary cannot be more than 140, in such short text, a document one As only one topic;It on the other hand, include behavioural information of many users, such as forwarding, comment etc., these behaviors in microblogging The identification and expression of message session topic are also valuable, but there is no the rows for considering user in traditional topic representation method For factor.Therefore, the effect is unsatisfactory when existing topic model is indicated for microblog topic, needs to make improvements.
Summary of the invention
In order to overcome the deficiencies of the above existing technologies, one of present invention is designed to provide a kind of analysis of Behavior-based control Microblog topic indicates model generating method and device, to solve the problems, such as that microblogging short essay information content in topic analysis is inadequate.
The microblog topic that another object of the present invention is to provide a kind of Behavior-based control analysis indicate model generating method and Device, to improve subsequent using topic model progress topic discovery, evolution by combining user behavior factor in topic model The accuracy of analysis etc..
In view of the above and other objects, the present invention proposes that a kind of microblog topic of Behavior-based control analysis indicates model generation side Method includes the following steps:
The document of microblog users publication, forwarding and its comment is combined together and generates customer documentation set by step S1;
Step S2 generates topic model using LDA model to the customer documentation set of generation;
Step S3, to each lexical item of each topic in customer documentation set, inside the lexical item for calculating Behavior-based control analysis Weight;
Step S4 calculates weight outside the lexical item of Behavior-based control analysis to each lexical item in customer documentation set;
Step S5, according to weight and lexical item external weight inside the lexical item of the obtained Behavior-based control analysis of step S3 and step S4 The comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of re-computation;
Step S6, according to the comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of acquisition to each topic The topic for calculating Behavior-based control analysis indicates model.
Preferably, further include following steps after step S6:
Step S7 indicates model and step S2 according to the topic of the Behavior-based control analysis of the step S6 each topic obtained The LDA topic that the topic model obtained using LDA calculates Behavior-based control analysis to each topic indicates, obtains final topic table Representation model.
Preferably, step S2 further comprises:
Step S200 generates document-topic model and topic-lexical item model ρ using LDA model to customer documentation set (θ)LDA
Step S201, to each of customer documentation set document, from the document of LDA model generation -- it is chosen in topic The highest topic of probability is as document topic.
Preferably, step S3 further comprises:
Step S300 calculates separately the inside weight H (w, θ, b) of the lexical item of every kind of behavior according to behavior typeinside
Step S301, according to the inside weight H (w, θ, b) of the lexical item of every kind of behaviorinsideCalculate the word of Behavior-based control analysis The inside weight H (w, θ) of iteminside
Preferably, the calculating of weight is as follows inside the lexical item of the Behavior-based control analysis:
H (w, θ)inside=σ * H (w, θ, b1)inside+ μ * H (w, θ, b2)inside+ τ * H (w, θ, b3)inside
Wherein, H (w, θ, b)insideLexical item w is indicated at topic θ, the behavior inside weight in behavior type b, D (θ, b) Indicate the collection of document at topic θ, behavior b, TFwiIt is lexical item w in document DiIn the frequency of occurrences, TFwIt is lexical item w in behavior The sum of the frequency of all documents, σ, μ, τ indicate the weight factor of different behaviors under collection of document D (θ, b).
Preferably, step S4 further comprises:
Step S400 calculates separately the external weight H (w, b) of the lexical item of every kind of behavior according to behavior typeoutside
Step S401, according to the external weight H (w, b) of the lexical item of every kind of behavioroutsideCalculate the word of Behavior-based control analysis The external weight H (w) of itemoutside
Preferably, the external weight H (w) of the lexical item of the Behavior-based control analysisoutsideCalculating it is as follows:
H(w)outside=σ * H (w, b1)outside+ μ * H (w, b2)outside+ τ * H (w, b3)outside
Wherein, H (w, b)outsideLexical item w is indicated in all topics, the behavior outside weight in behavior type b, D (b) Indicate the collection of document in all documents under behavior b, k is the quantity of topic, DFwjInclude in the collection of document of expression topic j The number of documents of lexical item w, DFwIt is that all number of documents comprising lexical item w, σ, μ, τ indicate the weight of different behaviors in corpus The factor.
Preferably, in step S5, the comprehensive weight of the Behavior-based control analysis of each lexical item is calculated as follows under each topic:
In order to achieve the above objectives, the present invention also provides a kind of microblog topics of Behavior-based control analysis to indicate that model generates dress It sets, comprising:
Customer documentation set generation unit, the document for issuing, forwarding and its commenting on by microblog users are combined together Generate customer documentation set;
Initial topic model generation unit generates topic model using LDA model for the customer documentation set to generation;
Weight calculation unit inside the lexical item of Behavior-based control analysis, for each of each topic in customer documentation set Lexical item calculates weight inside the lexical item of Behavior-based control analysis;
Weight outside the lexical item of Behavior-based control analysis, for calculating based on row to each lexical item in customer documentation set For weight outside the lexical item of analysis;
Comprehensive weight computing unit, inside lexical item for being analyzed according to the Behavior-based control weight calculation unit be based on Weight and weight calculation outside lexical item are each inside the lexical item for the Behavior-based control analysis that the lexical item external weight of behavioural analysis restores The comprehensive weight of the Behavior-based control analysis of each lexical item under topic;
The topic of Behavior-based control analysis indicates model computing unit, for each lexical item under each topic according to acquisition The topic that the comprehensive weight of Behavior-based control analysis calculates Behavior-based control analysis to each topic indicates model.
Preferably, described device further include:
Topic indicates model generation unit, and the topic for being analyzed according to the Behavior-based control indicates that model computing unit obtains The topic of the Behavior-based control analysis of each topic obtained indicates that model and the initial topic model generation unit are obtained using LDA The LDA topic that topic model out calculates Behavior-based control analysis to each topic indicates that obtaining final topic indicates model.
Compared with prior art, a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method and device For microblogging short text feature, the microblogging of forwarding and its comment with former blog article included together as a document process, in LDA A topic is only chosen in the multiple topics for the document that model obtains constructs topic model, and root as the topic of document The topic obtained according to LDA model-lexical item distribution, considers distribution situation of the lexical item inside user's difference behavior, user is allowed to set Different user behavior is determined to the impact factor of topic model, so that user behavior factor is combined in topic model, so that obtaining Topic indicate that model is more accurate, more targetedly, improve it is subsequent carry out topic discovery using topic model, EVOLUTION ANALYSIS etc. is calculated The accuracy of method.
Detailed description of the invention
Fig. 1 is the step flow chart that a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method;
Fig. 2 is the system architecture diagram that a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating means.
Specific embodiment
Below by way of specific specific example and embodiments of the present invention are described with reference to the drawings, those skilled in the art can Understand further advantage and effect of the invention easily by content disclosed in the present specification.The present invention can also pass through other differences Specific example implemented or applied, details in this specification can also be based on different perspectives and applications, without departing substantially from Various modifications and change are carried out under spirit of the invention.
Fig. 1 is the step flow chart that a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method.Such as Shown in Fig. 1, a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method, includes the following steps:
The document of microblog users publication, forwarding and its comment is combined together and generates customer documentation set by step S1.? In the specific embodiment of the invention, the document of microblog users publication, forwarding, comment is arranged, the microblogging and original of user comment Blog article forms a document together, behavior label bi, respectively b1 (publication), b2 (forwarding), b3 (comment) on all document bands.
Step S2 generates topic model using LDA model to the customer documentation set of generation.
Specifically, step S2 further comprises:
Step S200 generates " document-topic " model and " topic-lexical item " using LDA model to customer documentation set Model ρ (θ)LDA
Step S201, to each of customer documentation set document, from the document of LDA model generation -- it is chosen in topic The highest topic of probability is as document topic.It is noted that each document only corresponds to a topic to microblogging short text.
Step S3, to each lexical item in each topic in customer documentation set, in the lexical item for calculating Behavior-based control analysis Portion's weight.
Specifically, step S3 further comprises:
Step S300 calculates separately the inside weight H (w, θ, b) of the lexical item of every kind of behavior according to behavior typeinside
Internal weight is to describe the uniformity coefficient that lexical item is distributed between each document of a specific topics, it is considered that, In microblogging, influence of the different behaviors to user inside weight is had differences.Therefore, word is discussed according to behavior type respectively first Weight inside the behavior of item.
Weight is description in a specific topic inside behavior, what lexical item was distributed between a certain behavior type Documents Uniformity coefficient.That is, it is distributed more uniform, the attribute feature of more suitable expression behavior under specific topics, It calculates shown as the following formula:
Wherein, H (w, θ, b)insideLexical item w is indicated at topic θ, the behavior inside weight in behavior type b.D (θ, b) Indicate the collection of document at topic θ, behavior b, TFwiIt is lexical item w in document DiIn the frequency of occurrences, TFwIt is lexical item w in behavior The sum of the frequency of all documents under collection of document D (θ, b).
Weight is bigger inside behavior, and lexical item is distributed more uniform under a certain behavior of specific topics, that is to say, that Ke Yigeng The behavior feature under specific topics is indicated well.Best situation is in the document under all behavior collection of document D (θ, b) The frequency of appearance is identical.
From formula 1 it can be seen that H (w, θ, b)insideIt needs to calculate repeatedly, because for behavior each under specific topics It needs to calculate primary.There are a plurality of types of behaviors for true social networks, such as issue, forward, and comment thumbs up.For letter Change process, the present invention only consider the publication of microblogging, and the behavior of three types is commented in forwarding.
Step S301, according to the inside weight H (w, θ, b) of the lexical item of every kind of behaviorinsideCalculate the word of Behavior-based control analysis The inside weight H (w, θ) of iteminside:
Publication to microblogging, forwarding, the behavior for commenting on three types assign their different weights respectively, and weight indicates this The significance level that kind behavior indicates topic, so the calculating of the inside weight of last Behavior-based control analysis is as follows:
H (w, θ)inside=σ * H (w, θ, b1)inside+ μ * H (w, θ, b2)inside+ τ * H (w, θ, b3)inside(formula 2)
Wherein σ, μ, τ indicate the weight factor of different behaviors.Due to temporarily not considering other factors here, so three power The sum of repeated factor is 1.
Step S4 calculates weight outside the lexical item of Behavior-based control analysis to each lexical item in customer documentation set.
Specifically, step S4 further comprises:
Step S400 calculates separately the external weight H (w, b) of the lexical item of every kind of behavior according to behavior typeoutside
External weight is the uniform situation for describing lexical item and being distributed in all topics, and lexical item is distributed more uniform, this lexical item More unsuitable any topic of description.In microblogging, influence of the different behaviors to weight outside user is had differences.Therefore, first Weight outside the behavior of lexical item is first discussed according to behavior type respectively.
Weight is description in a collection of document outside behavior, and lexical item is distributed equal between a certain behavior type Documents Even degree.That is, it is distributed more uniform, more unsuitable any topic of expression, shown in calculating as the following formula:
Wherein, H (w, b)outsideLexical item w is indicated in all topics, the behavior outside weight in behavior type b, D (b) Indicate the collection of document in all documents under behavior b, k is the quantity of topic, DFwiInclude in the collection of document of expression topic j The number of documents of lexical item w, DFwIt is all number of documents comprising lexical item w in corpus.According to formula (3) calculate it can be concluded that, External behavior weight is bigger, and lexical item w is distributed more uniform, the worst situation in all topic behaviors to be wrapped under each topic behavior The number of documents of the w containing lexical item is identical.
From formula it can be seen that H (w, b)outsideIt needs to calculate repeatedly, each behavior is required to calculate primary.Really Social networks there are a plurality of types of behaviors, such as issue, forward, comment thumbs up.In order to simplify process, the present invention is only examined Consider the publication of microblogging, the behavior of three types is commented in forwarding.
Step S401, according to the external weight H (w, b) of the lexical item of every kind of behavioroutsideCalculate the word of Behavior-based control analysis The external weight H (w) of itemoutside:
Publication to microblogging, forwarding, the behavior for commenting on three types assign their different weights respectively, and weight indicates this The significance level that kind behavior indicates topic, so the calculating of the external weight of last Behavior-based control analysis is as follows:
H(w)outside=σ * H (w, b1)outside+ μ * H (w, b2)outside+ τ * H (w, b3)outside(formula 4)
Wherein σ, μ, τ indicate the weight factor of different behaviors.Due to temporarily not considering other factors here, so three power The sum of repeated factor is 1.
Step S5, according to weight and lexical item external weight inside the lexical item of the obtained Behavior-based control analysis of step S3 and step S4 The comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of re-computation.
Analyzed by step S3 and step S4 about the discussion of internal weight and external weight it is found that both weights for It measures typical lexical item and plays great role.Therefore, the present invention is on the basis of the topic model that LDA model obtains, in conjunction with lexical item To measure it for the comprehensive weight situation of specified topic, calculation formula is as follows for internal weight and external weight:
Step S6, according to the comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of acquisition to each topic The topic for calculating Behavior-based control analysis indicates model.
In the specific embodiment of the invention, by different lexical items to the weight proportion of specified topic by topic under each lexical item Weight normalization after obtained the topic of Behavior-based control analysis and indicated, shown in following formula (6):
ρbehavior(θ)=(ω (wl, θ), ω (w2, θ) ..., ω (wn, θ)) (formula 6)
Step S7 indicates model and step S2 according to the topic of the Behavior-based control analysis of the step S6 each topic obtained The LDA topic that the topic model obtained using LDA calculates Behavior-based control analysis to each topic indicates, obtains final topic table Representation model.
Model ρ is indicated in the topic for calculating Behavior-based control analysisbehaviorAfter (θ), also need to combine step S2 using LDA The topic model ρ (θ) obtainedLDA, that is, comprehensively consider frequency that lexical item occurs in topic and behavioural analysis situation provide finally Topic indicate model, pass through ρ (θ)behaviorWith ρ (θ)LDAObtaining the sensitive LDA topic of topic θ distribution indicates, following formula 7 It is shown:
ρ(θ)BEH-LDA=p* ρ (θ)LDA+(1-p)*ρ(θ)behavior(formula 7)
Wherein p ∈ (0,1) is a linear dimensions, is measured ρ (θ)LDAWith ρ (θ)behaviorBetween linear weight.
Fig. 2 is the system architecture diagram that a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating means.Such as Shown in Fig. 2, a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating means, comprising:
Customer documentation set generation unit 201, the document for issuing, forwarding and its commenting on by microblog users is incorporated in one It rises and generates customer documentation set.In the specific embodiment of the invention, customer documentation set generation unit 201 sends out microblog users Cloth, forwarding, comment document arranged, the microblogging of user comment forms a document, all document bands together with former blog article Upper behavior label bi, respectively b1 (publication), b2 (forwarding), b3 (comment).
Initial topic model generation unit 202 generates topic mould using LDA model for the customer documentation set to generation Type.
Specifically, initial topic model generation unit 202 further comprises:
Topic model generation module, for customer documentation set, using LDA model generate " document-topic " model and " topic-lexical item " model ρ (θ)LDA
Document topic selection unit, the text for being generated from LDA model to each of customer documentation set document Shelves -- the highest topic of probability is chosen in topic as document topic.It is noted that each document is only right to microblogging short text Answer a topic).
Weight calculation unit 203 inside the lexical item of Behavior-based control analysis, for in each topic in customer documentation set Each lexical item, calculate Behavior-based control analysis lexical item inside weight.
In the specific embodiment of the invention, weight calculation unit 203 is specifically used for inside the lexical item of Behavior-based control analysis:
The inside weight H (w, θ, b) of the lexical item of every kind of behavior is calculated separately according to behavior typeinside, then according to every kind The inside weight H (w, θ, b) of the lexical item of behaviorinsideCalculate the inside weight H (w, θ) of the lexical item of Behavior-based control analysisinside, tool Body is as follows:
Internal weight is to describe the uniformity coefficient that lexical item is distributed between each document of a specific topics, it is considered that, In microblogging, influence of the different behaviors to user inside weight is had differences.Therefore, word is discussed according to behavior type respectively first Weight inside the behavior of item.
Weight is description in a specific topic inside behavior, what lexical item was distributed between a certain behavior type Documents Uniformity coefficient.That is, it is distributed more uniform, the attribute feature of more suitable expression behavior under specific topics, It calculates shown as the following formula:
Wherein, H (w, θ, b)insideLexical item w is indicated at topic θ, the behavior inside weight in behavior type b.D (θ, b) Indicate the collection of document at topic θ, behavior b, TFwiIt is lexical item w in document DiIn the frequency of occurrences, TFwIt is lexical item w in behavior The sum of the frequency of all documents under collection of document D (θ, b).
Weight is bigger inside behavior, and lexical item is distributed more uniform under a certain behavior of specific topics, that is to say, that Ke Yigeng The behavior feature under specific topics is indicated well.Best situation is in the document under all behavior collection of document D (θ, b) The frequency of appearance is identical.
As can be seen from the above formula that H (w, θ, b)insideIt needs to calculate repeatedly, because for behavior each under specific topics It requires to calculate primary.There are a plurality of types of behaviors for true social networks, such as issue, forward, and comment thumbs up.In order to The process of simplification, the present invention only consider the publication of microblogging, forward, comment on the behavior of three types.
Publication to microblogging, forwarding, the behavior for commenting on three types assign their different weights respectively, and weight indicates this The significance level that kind behavior indicates topic, so the calculating of the inside weight of last Behavior-based control analysis is as follows:
H (w, θ)inside=σ * H (w, θ, b1)inside+ μ * H (w, θ, b2)inside+ τ * H (w, θ, b3)inside
Wherein σ, μ, τ indicate the weight factor of different behaviors.Due to temporarily not considering other factors here, so three power The sum of repeated factor is 1.
Weight calculation unit 204 outside the lexical item of Behavior-based control analysis, for each lexical item in customer documentation set, Calculate weight outside the lexical item of Behavior-based control analysis.
In the specific embodiment of the invention, weight calculation unit 204 is specifically used for outside the lexical item of Behavior-based control analysis:
The external weight H (w, b) of the lexical item of every kind of behavior is calculated separately according to behavior typeoutside, according to every kind of behavior Lexical item external weight H (w, b)outsideCalculate the external weight H (w) of the lexical item of Behavior-based control analysisoutside, it is specific as follows:
External weight is the uniform situation for describing lexical item and being distributed in all topics, and lexical item is distributed more uniform, this lexical item More unsuitable any topic of description.In microblogging, influence of the different behaviors to weight outside user is had differences.Therefore, first Weight outside the behavior of lexical item is first discussed according to behavior type respectively.
Weight is description in a collection of document outside behavior, and lexical item is distributed equal between a certain behavior type Documents Even degree.That is, it is distributed more uniform, more unsuitable any topic of expression, shown in calculating as the following formula:
Wherein, H (w, b)outsideLexical item w is indicated in all topics, the behavior outside weight in behavior type b, D (b) Indicate the collection of document in all documents under behavior b, k is the quantity of topic, DFwjInclude in the collection of document of expression topic j The number of documents of lexical item w, DFwIt is all number of documents comprising lexical item w in corpus.It can be obtained according to above-mentioned formula calculating Out, external behavior weight is bigger, and it is each topic behavior that lexical item w is distributed more uniform, the worst situation in all topic behaviors The number of documents comprising lexical item w is identical down.
From formula it can be seen that H (w, b)outsideIt needs to calculate repeatedly, each behavior is required to calculate primary.Really Social networks there are a plurality of types of behaviors, such as issue, forward, comment thumbs up.In order to simplify process, the present invention is only examined Consider the publication of microblogging, the behavior of three types is commented in forwarding.
Publication to microblogging, forwarding, the behavior for commenting on three types assign their different weights respectively, and weight indicates this The significance level that kind behavior indicates topic, so the calculating of the external weight of last Behavior-based control analysis is as follows:
H(w)outside=σ * H (w, b1)outside+ μ * H (w, b2)outside+ τ * H (w, b3)outside
Wherein σ, μ, τ indicate the weight factor of different behaviors.Due to temporarily not considering other factors here, so three power The sum of repeated factor is 1.
Comprehensive weight computing unit 205, lexical item inside weight calculation unit 203 and base for being analyzed according to Behavior-based control The lexical item inside weight and the lexical item outside each words of weight calculation that weight calculation unit 204 obtains outside the lexical item of behavioural analysis The comprehensive weight of the Behavior-based control analysis of each lexical item under topic.
The lexical item outside weight of weight calculation unit 203 and Behavior-based control analysis inside the lexical item analyzed by Behavior-based control Computing unit 204 is analyzed about the discussion of internal weight and external weight it is found that both weights rise for measuring typical lexical item To great role.Therefore, the present invention is on the basis of the topic model that LDA model obtains, in conjunction with the inside weight of lexical item and outside For weight to measure it for the comprehensive weight situation of specified topic, calculation formula is as follows:
The topic of Behavior-based control analysis indicates model computing unit 206, for each word under each topic according to acquisition The topic that the comprehensive weight of the Behavior-based control analysis of item calculates Behavior-based control analysis to each topic indicates model.
In the specific embodiment of the invention, by different lexical items to the weight proportion of specified topic by topic under each lexical item Weight normalization after obtained Behavior-based control analysis topic indicate, shown in following formula:
ρbehavior(θ)=(ω (w1, θ), ω (w2, θ) ..., ω (wn, θ))
Topic indicates model generation unit 207, and the topic for being analyzed according to Behavior-based control indicates model computing unit 206 The topic model that the topic of the Behavior-based control analysis of each topic obtained indicates that model and LDA are obtained calculates each topic The LDA topic of Behavior-based control analysis indicates that obtaining final topic indicates model.
Model ρ is indicated in the topic for calculating Behavior-based control analysisbehaviorAfter (θ), the topic for combining LDA to obtain also is needed Model ρ (θ)LDA, that is, comprehensively consider frequency that lexical item occurs in topic and behavioural analysis situation provide final topic and indicates Model passes through ρ (θ)behaviorWith ρ (θ)LDAObtaining the sensitive LDA topic of topic θ distribution indicates, shown in following formula:
ρ(θ)BEH-LDA=p* ρ (θ)LDA+(1-p)*ρ(θ)behavior
Wherein p ∈ (0,1) is a linear dimensions, is measured ρ (θ)LDAWith ρ (θ)behaviorBetween linear weight.
In conclusion a kind of microblog topic of Behavior-based control analysis of the present invention indicates model generating method and device for micro- Rich short text feature obtains the microblogging of forwarding and its comment in LDA model included together as a document process with former blog article A topic is only chosen in multiple topics of a document out and constructs topic model as the topic of document, and according to LDA The topic that model obtains-lexical item distribution, considers distribution situation of the lexical item inside user's difference behavior, allows user to set different User behavior is to the impact factor of topic model, so that user behavior factor is combined in topic model, so that obtained topic It indicates that model is more accurate, more targetedly, improves the subsequent standard for carrying out topic discovery, EVOLUTION ANALYSIS scheduling algorithm using topic model Exactness.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.Any Without departing from the spirit and scope of the present invention, modifications and changes are made to the above embodiments by field technical staff.Therefore, The scope of the present invention, should be as listed in the claims.

Claims (10)

1. a kind of microblog topic of Behavior-based control analysis indicates model generating method, include the following steps:
The document of microblog users publication, forwarding and its comment is combined together and generates customer documentation set by step S1;
Step S2 generates topic model using LDA model to the customer documentation set of generation;
Step S3 calculates weight inside the lexical item of Behavior-based control analysis to each lexical item of each topic in customer documentation set;
Step S4 calculates weight outside the lexical item of Behavior-based control analysis to each lexical item in customer documentation set;
Step S5, according to weight and lexical item external weight restatement inside the lexical item of the obtained Behavior-based control analysis of step S3 and step S4 Calculate the comprehensive weight of the Behavior-based control analysis of each lexical item under each topic;
Step S6 calculates each topic according to the comprehensive weight of the Behavior-based control analysis of each lexical item under each topic of acquisition The topic of Behavior-based control analysis indicates model.
2. a kind of microblog topic of Behavior-based control analysis as described in claim 1 indicates model generating method, which is characterized in that Further include following steps after step S6:
Step S7 indicates that model and step S2 use according to the topic of the Behavior-based control analysis of the step S6 each topic obtained The LDA topic that the topic model that LDA is obtained calculates Behavior-based control analysis to each topic indicates that obtaining final topic indicates mould Type.
3. a kind of microblog topic of Behavior-based control analysis as claimed in claim 2 indicates model generating method, which is characterized in that Step S2 further comprises:
Step S200 generates document-topic model and topic-lexical item model ρ using LDA model to customer documentation set (θ)LDA
Step S201, to each of customer documentation set document, from the document of LDA model generation -- probability is chosen in topic Highest topic is as document topic.
4. a kind of microblog topic of Behavior-based control analysis as claimed in claim 3 indicates model generating method, which is characterized in that Step S3 further comprises:
Step S300 calculates separately the inside weight H (w, θ, b) of the lexical item of every kind of behavior according to behavior typeinside
Step S301, according to the inside weight H (w, θ, b) of the lexical item of every kind of behaviorinsideCalculate the lexical item of Behavior-based control analysis Internal weight H (w, θ)inside
5. a kind of microblog topic of Behavior-based control analysis as claimed in claim 4 indicates model generating method, which is characterized in that The calculating of weight is as follows inside the lexical item of the Behavior-based control analysis:
H (w, θ)inside=σ * H (w, θ, b1)inside+ μ * H (w, θ, b2)inside+ τ * H (w, θ, b3)inside
Wherein, H (w, θ, b)insideLexical item w is indicated at topic θ, weight inside the behavior in behavior type b, D (θ, b) is indicated Collection of document at topic θ, behavior b, TFwiIt is lexical item w in document DiIn the frequency of occurrences, TFwIt is lexical item w in behavior document The sum of the frequency of all documents, σ, μ, τ indicate the weight factor of different behaviors under set D (θ, b).
6. a kind of microblog topic of Behavior-based control analysis as claimed in claim 4 indicates model generating method, which is characterized in that Step S4 further comprises:
Step S400 calculates separately the external weight H (w, b) of the lexical item of every kind of behavior according to behavior typeoutside
Step S401, according to the external weight H (w, b) of the lexical item of every kind of behavioroutsideCalculate the outer of the lexical item of Behavior-based control analysis Portion weight H (w)outside
7. a kind of microblog topic of Behavior-based control analysis as claimed in claim 6 indicates model generating method, which is characterized in that The external weight H (w) of the lexical item of the Behavior-based control analysisoutsideCalculating it is as follows:
H(w)outside=σ * H (w, b1)outside+ μ * H (w, b2)outside+ τ * H (w, b3)outside
Wherein, H (w, b)outsideLexical item w is indicated in all topics, weight outside the behavior in behavior type b, D (b) is indicated Collection of document in all documents under behavior b, k are the quantity of topic, DFwjIndicate to include lexical item w's in the collection of document of topic j Number of documents, DFwIt is that all number of documents comprising lexical item w, σ, μ, τ indicate the weight factor of different behaviors in corpus.
8. a kind of microblog topic of Behavior-based control analysis as claimed in claim 6 indicates model generating method, which is characterized in that In step S5, the comprehensive weight of the Behavior-based control analysis of each lexical item is calculated as follows under each topic:
9. a kind of microblog topic of Behavior-based control analysis indicates model generating means, comprising:
Customer documentation set generation unit, the document for issuing, forwarding and its commenting on by microblog users are combined together generation Customer documentation set;
Initial topic model generation unit generates topic model using LDA model for the customer documentation set to generation;
Weight calculation unit inside the lexical item of Behavior-based control analysis, for each word to each topic in customer documentation set , calculate weight inside the lexical item of Behavior-based control analysis;
Weight outside the lexical item of Behavior-based control analysis, for calculating Behavior-based control point to each lexical item in customer documentation set Weight outside the lexical item of analysis;
Comprehensive weight computing unit, lexical item inside weight calculation unit and Behavior-based control for being analyzed according to the Behavior-based control Weight and the lexical item outside each topic of weight calculation inside the lexical item for the Behavior-based control analysis that the lexical item external weight of analysis restores Under each lexical item Behavior-based control analysis comprehensive weight;
Behavior-based control analysis topic indicate model computing unit, under each topic according to acquisition each lexical item based on The topic that the comprehensive weight of behavioural analysis calculates Behavior-based control analysis to each topic indicates model.
10. a kind of microblog topic of Behavior-based control analysis as claimed in claim 9 indicates that model generating means, feature exist In described device further include:
Topic indicates model generation unit, and the topic for being analyzed according to the Behavior-based control indicates what model computing unit obtained The topic of the Behavior-based control analysis of each topic indicates what model and the initial topic model generation unit were obtained using LDA The LDA topic that topic model calculates Behavior-based control analysis to each topic indicates that obtaining final topic indicates model.
CN201811315209.1A 2018-11-06 2018-11-06 A kind of the microblog topic expression model generating method and device of Behavior-based control analysis Pending CN109558546A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811315209.1A CN109558546A (en) 2018-11-06 2018-11-06 A kind of the microblog topic expression model generating method and device of Behavior-based control analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811315209.1A CN109558546A (en) 2018-11-06 2018-11-06 A kind of the microblog topic expression model generating method and device of Behavior-based control analysis

Publications (1)

Publication Number Publication Date
CN109558546A true CN109558546A (en) 2019-04-02

Family

ID=65865953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811315209.1A Pending CN109558546A (en) 2018-11-06 2018-11-06 A kind of the microblog topic expression model generating method and device of Behavior-based control analysis

Country Status (1)

Country Link
CN (1) CN109558546A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339762A (en) * 2020-02-14 2020-06-26 广州大学 Topic representation model construction method and device based on hybrid intelligence

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101571853A (en) * 2009-05-22 2009-11-04 哈尔滨工程大学 Evolution analysis device and method for contents of network topics
CN102708153A (en) * 2012-04-18 2012-10-03 中国信息安全测评中心 Self-adaption finding and predicting method and system for hot topics of online social network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101571853A (en) * 2009-05-22 2009-11-04 哈尔滨工程大学 Evolution analysis device and method for contents of network topics
CN102708153A (en) * 2012-04-18 2012-10-03 中国信息安全测评中心 Self-adaption finding and predicting method and system for hot topics of online social network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LU DENG: "Topic Detection Based on User Intention", 《2015 IEEE 15TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339762A (en) * 2020-02-14 2020-06-26 广州大学 Topic representation model construction method and device based on hybrid intelligence
CN111339762B (en) * 2020-02-14 2023-04-07 广州大学 Topic representation model construction method and device based on hybrid intelligence

Similar Documents

Publication Publication Date Title
Mohammad et al. Using hashtags to capture fine emotion categories from tweets
Cui et al. Emotion tokens: Bridging the gap among multilingual twitter sentiment analysis
CN110046228B (en) Short text topic identification method and system
Bai et al. Characterizing and predicting early reviewers for effective product marketing on e-commerce websites
CN104133897B (en) A kind of microblog topic source tracing method based on topic influence
CN107273348B (en) Topic and emotion combined detection method and device for text
CN106202053B (en) A kind of microblogging theme sentiment analysis method of social networks driving
CN108804701A (en) Personage's portrait model building method based on social networks big data
CN107103093B (en) Short text recommendation method and device based on user behavior and emotion analysis
Kershaw et al. Towards modelling language innovation acceptance in online social networks
Razek et al. Text-based intelligent learning emotion system
US9058328B2 (en) Search device, search method, search program, and computer-readable memory medium for recording search program
Narducci et al. A general architecture for an emotion-aware content-based recommender system
Wandabwa et al. Topical affinity in short text microblogs
JP6368264B2 (en) Contributor Analyzing Device, Program, and Method for Analyzing Contributor's Profile Item from Posted Sentence
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN109558546A (en) A kind of the microblog topic expression model generating method and device of Behavior-based control analysis
CN110324278A (en) Account main body consistency detecting method, device and equipment
Tutaysalgir et al. Clustering based personality prediction on Turkish tweets
CN107590742B (en) Behavior-based social network user attribute value inversion method
CN112487303B (en) Topic recommendation method based on social network user attributes
CN109522409A (en) A kind of topic expression model generating method and device that vocabulary distribution is sensitive
CN108920475A (en) A kind of short text similarity calculating method
CN110990592B (en) Online microblog burst topic detection method and detection device
WO2022173397A1 (en) A recommendation system using artificial intelligence algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190402