CN103631862A - Event characteristic evolution excavation method and system based on microblogs - Google Patents

Event characteristic evolution excavation method and system based on microblogs Download PDF

Info

Publication number
CN103631862A
CN103631862A CN201310532377.7A CN201310532377A CN103631862A CN 103631862 A CN103631862 A CN 103631862A CN 201310532377 A CN201310532377 A CN 201310532377A CN 103631862 A CN103631862 A CN 103631862A
Authority
CN
China
Prior art keywords
event
evolution
microblogging
limit
summit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310532377.7A
Other languages
Chinese (zh)
Other versions
CN103631862B (en
Inventor
邓镭
贾焰
邹鹏
杨树强
周斌
韩伟红
李爱平
韩毅
李莎莎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310532377.7A priority Critical patent/CN103631862B/en
Publication of CN103631862A publication Critical patent/CN103631862A/en
Application granted granted Critical
Publication of CN103631862B publication Critical patent/CN103631862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an event characteristic evolution excavation method and system based on microblogs. The method includes the steps that in a microblog time sequence, an evolution starting document set is selected, and graph models of documents are constructed on the microblog document set based on co-occurrence characteristics of vocabularies so as to obtain a knowledge network structure of events; according to literal characteristics of the vocabularies and the tendentious compatibility characteristic of the vocabularies, the microblog graph models are combined, and a micro evolution graph of characteristics of the events is constructed; clipping, segmentation and conversion are performed on the micro evolution graph of the events, and a macro evolution graph of the characteristics of the events is formed. According to the method, in the evolution law process for excavating the characteristics of the events, a graph excavating method based on a knowledge network of the events is adopted, the event characteristic evolution excavation method is improved in the succession aspect of knowledge as a whole, and interpretability of the excavating results is higher.

Description

Affair character evolution method for digging and system based on microblogging
Technical field
The present invention relates to text mining and topic detection and tracking field, particularly a kind of affair character based on microblogging text data develops and the method for excavating.
Background technology
Along with Web2.0 technology and in recent years flourish of application, online microblogging service becomes a kind of new Information Communication platform that has a large number of users, produces bulk information gradually.According to the 29th China Internet report statistics: by by the end of December, 2011, actual user's number of China's microblogging reaches 2.5 hundred million, goes up and increases by 296.0% an end of the year, and netizen's utilization rate is 48.7%.
Be different from the strong social networking service that are related to such as Facebook, the social networking relationships of microblogging service is normally unidirectional---and be that user does not need other subscriber authorisations just can pay close attention to them, receive the information that they produce.The good friend (friends) that the person that user pays close attention to is this user; Pay close attention to the bean vermicelli (followers) that certain user's person is this user, all blog articles (tweets) of user's issue will appear at (public timeline) on line common time, on all beans vermicelli of this user (followers) timeline, will show all message of this user.
Topic in reality or event are projected in the text space of microblogging, and all exactly users discuss the set of the blog article of associated topic, event.(in text analyzing field, sometimes to topic and these two concepts of event, will not distinguish, hereinafter all adopt this viewpoint.) topic in reality and event are in continuous evolution, correspondingly, the topic in microblogging text space and event are also in continuous evolution.The moment that the information that the moment that topic/event develops, the bean vermicelli in microblogging sent its follower forwards or comments on.Forward and comment in except the viewpoint in former blog article, narration show or the repetition of implicit expression, also can introduce new viewpoint and new narration, now topic will occur to a certain degree to change.From being forwarded for the first time former blog article or commenting on, the evolutionary process of topic has just started.Along with constantly carrying out of forwarding, comment on, the extension of topic is also in continuous extension, and topic constantly develops.The evolution of research topic/event in communication process, will follow the tracks of the slight change of topic/event information in propagating each time exactly, and then the variation of integrated survey topic/event in macroscopic view.
At present the research of topic/event information Propagation and evolution on microblogging is divided into following two classes.First kind research, by analyzing the behavioral primitive of topic/event propagation, is set up the mathematical model of topic Propagation and evolution, and evolutionary process is propagated in simulation, to answer topic/event why contagious problem.The simulation modeling theory of dissemination aspect is partial in this class research, to studying the propagation evolutionary process of a certain specific topics/event, there is no practical significance.Equations of The Second Kind research combines the community network information in microblogging with traditional topic/event model, communication process to topic/event in microblogging is carried out reasoning, this type of research finally can obtain two kinds of results, the travel path of the explicit and implicit expression of first topic/event in microblogging, it two is variations that topic/event model in communication process occurs.The basic step of this type of research is:
1, the text that same topic/event is discussed in microblogging is arranged according to sequential, keep its explicit forwarding relation, according to time order from front to back, and transfer sequence is processed, introduce if desired the concept of timeslice, the text of same timeslice is processed simultaneously.To not introducing timeslice concept, can be considered as every piece of document and occupy separately a timeslice;
2, set up the topic/event model of each timeslice, vector space model and probability model are used in now many considerations, if desired the topic model of this timeslice are split, and are decomposed into some sub-topics, to represent the different aspect of topic.
3, take 0 topic/event model is constantly benchmark, successively topic/the event model of each text in follow-up time sheet is investigated relatively the latter and the former similarity, its propagation relation of reasoning.Locality in view of information flow trend in microblogging, needs the relation between the user of two texts of generation to take into account in this step, if not significantly contact between two users is thought and had the probability of propagation relation little between text.
4, by step 3, each document can be considered a summit, and the propagation relation between document can be considered the limit between summit, therefore now can construct propagation tree or the propagation figure that produces text message.In this figure, portrayed the explicit/implicit travel path of topic/event information in microblogging.Along every paths, investigate the topic/event model on each summit, the Changing Pattern of this model is along the Evolution of the topic/event in this path.
From foregoing description, can find out, be to complete when setting up propagation model owing to investigating the evolutionary process of topic/event, thus the evolutionary process of topic/event model independently not, but depend on as topic models such as vector space or probability models.The effective expression mode that these topic models are collection of document, but lack the expression of topic evolution aspect, this causes topic/event EVOLUTION ANALYSIS result that said method obtains nothing more than word frequency or vocabulary vector rule over time, do not have the related information between vocabulary, aspect the domain knowledge of topic/event, there is no inheritance, aspect evolution, lacking interpretation.Between this, need a kind of new topic/affair character evolution method for digging.
Summary of the invention
The object of the invention is to overcome the defect of above-mentioned prior art, a kind of new affair character evolution method for digging and system based on microblogging are provided.
The object of the invention is to be achieved through the following technical solutions:
On the one hand, the invention provides a kind of affair character evolution method for digging based on microblogging, comprising:
Step 1 is chosen the microblogging that several represent event origin from the set of the microblogging text relevant to event to be analyzed, with the set of formation event evolution starting point microblogging;
Step 2, the graph model of tectonic event evolution starting point microblogging set, as initial event microcosmic evolution diagram; In described graph model, summit is the noun/verb appearing in each microblogging text of this event evolution starting point microblogging set, and the limit between two summits represents that word corresponding to these two summits appears in same microblogging jointly or co-occurrence distance is less than threshold value given in advance;
Step 3, to all the other each microbloggings in the set of the microblogging text relevant to event to be analyzed, builds the graph model of this microblogging and is joined in current event evolution microgram;
Step 4, the event microcosmic evolution diagram based on obtaining through step 3 obtains event Macro Evolution figure and based on event Macro Evolution figure, observes the evolution of affair character.
In said method, in described step 1, represent that the microblogging of event origin can have following characteristics: a) deliver the time early; B) be original microblogging, but not the microblogging that forwards or comment on.
In said method, the summit of graph model described in described step 2 can be by noun/verb corresponding to this summit, the set of the microblogging document that comprises noun/verb, the tlv triple that the tendentiousness scoring of this noun/verb forms represents, the adjective that wherein scoring of the tendentiousness of this noun/verb is this noun/verb of modification and the mean value of the corresponding tendentiousness scoring of adverbial word.
In said method, described step 2 can comprise:
Step 2-1) every microblogging text in the set of event evolution starting point microblogging is carried out to participle and part-of-speech tagging;
Step 2-2), to the adjective after participle and adverbial word, its tendentiousness scoring is set;
Step 2-3), for the noun after participle and verb, adjective and the corresponding tendentiousness scoring of adverbial word of modifying same noun/verb are averaged, as the tendentiousness scoring of this noun or verb;
Step 2-4) using noun and verb as summit, if word corresponding to any two summits jointly appears in same microblogging or co-occurrence distance is less than threshold value given in advance, between these two summits, create limit.
In said method, in described step 3, the graph model of constructed microblogging being joined to current event evolution microgram can comprise: to each limit in the graph model of pending microblogging:
If a) two of this limit summits have all been present in current event evolution microgram, and existing this limit in this event evolution microgram, the occurrence number counting on this limit is added up; If there is no this limit in this event evolution microgram, this limit is copied in this event evolution microgram;
B) if there is and only has a summit to appear in current event evolution microgram in this limit, by the summit in this event evolution microgram and limit do not copy in this event evolution microgram;
C) if two summits on this limit all not in current event evolution microgram, by He Liangge summit, this limit complete copy in this event evolution microgram.
In said method, described step 3 also can comprise the whether step in event evolution microgram of certain summit in the graph model that judges microblogging, it comprises: given certain summit in the graph model for microblogging, if include the summit that the word corresponding with this summit is identical in event evolution microgram, there is with the microblogging text to corresponding vertex relates in this event evolution microgram the relation that forwards or comment in this microblogging, and the scoring of the tendentiousness on these two summits is compatible, in judgement event evolution microgram, comprised this given summit, wherein, the tendentiousness tendentiousness scoring of corresponding vertex in compatible self-explanatory characters' part evolution microgram of marking is less than certain threshold value with the difference of this given summit tendentiousness scoring.
In said method, described step 4) can comprise carries out cutting and transforms to obtain event Macro Evolution figure event microcosmic evolution diagram.
In said method, described event microcosmic evolution diagram is carried out cutting and transforms comprising:
Step 4-1) the microblogging text relevant to event to be analyzed sorted by the time, this microblogging text sequence was cut into slices by the time, form the timeslice of desired particle size;
Step 4-2) in event Macro Evolution figure, create a summit, corresponding initial event microcosmic evolution diagram;
Step 4-3) for each timeslice, carry out the following step:
4-3-a) in event microcosmic evolution diagram, choose successively summit corresponding to each timeslice and limit, structure be take the minimum connected subgraph that this subgraph is base;
4-3-b) in event Macro Evolution figure, create a summit, corresponding to this minimum connected subgraph, if this minimum connected subgraph subgraph corresponding with other summit in event Macro Evolution figure intersects, create a limit that connects two subgraphs;
In said method, described step 4-3) also can comprise that this edge of two subgraphs of created connection gives weights, the weights on limit are the Jaccard coefficient of the corresponding subgraphs in two summits; Wherein, for any two vertex v and v ' in event Macro Evolution figure, the Jaccard coefficient calculations mode of its corresponding subgraph is:
Figure BDA0000406196070000041
wherein, G v∩ G v 'and G v∪ G v 'the common factor and the union that represent respectively the vertex set of the corresponding subgraph in two summits, function # () represents the element number in set.
In said method, described step 4 also can comprise the step that event microcosmic evolution diagram is carried out to beta pruning, it comprises in deletion event microcosmic evolution diagram that occurrence number is lower than the limit of given threshold value, then delete and the initial disconnected branch of event microcosmic evolution diagram, wherein the occurrence number on limit refers to that the word corresponding to two summits on this limit in the set of the microblogging text relevant to event to be analyzed appears at the number of times in same microblogging jointly.
Another aspect, the invention provides a kind of affair character evolution digging system based on microblogging, comprising:
For choosing from the set of the microblogging text relevant to event to be analyzed the microblogging that several represent event origin, with the device of formation event evolution starting point microblogging set;
For the graph model of tectonic event evolution starting point microblogging set, as the device of initial event microcosmic evolution diagram; In described graph model, summit is the noun/verb appearing in each microblogging text of this event evolution starting point microblogging set, and the limit between two summits represents that word corresponding to these two summits appears in same microblogging jointly or co-occurrence distance is less than threshold value given in advance;
All the other each microbloggings of set for to the microblogging text relevant to event to be analyzed, build the graph model of this microblogging and are joined the device in current event evolution microgram;
For event microcosmic evolution diagram based on last, obtain event Macro Evolution figure and based on event Macro Evolution figure, observe the device of the evolution of affair character.
Compared with prior art, the invention has the advantages that:
The graph model of employing event is basis, by constructing the structure of knowledge between vocabulary, thereby obtains having more at knowledge level the event evolutionary model of interpretation.On Event Graph Model, take knowledge network as unit tectonic event evolution diagram, promoted the inheritance of event knowledge.Weigh the feature of microblogging text, utilized statistic law, with the many advantages of the many participating users of amount of text, overcome wall scroll microblogging text few, the deficiency that feature is rare.
Accompanying drawing explanation
Below, describe by reference to the accompanying drawings embodiments of the invention in detail, wherein:
Fig. 1 is the affair character evolution method for digging schematic flow sheet based on microblogging according to the embodiment of the present invention.
Embodiment
In order to make object of the present invention, technical scheme and advantage are clearer, and below in conjunction with accompanying drawing, by specific embodiment, the present invention is described in more detail.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
In one embodiment of the invention, a kind of have higher resolution and the indicative affair character evolution method for digging based on microblogging are provided, surmount document oneself boundary, from the aspect of event knowledge, fine granularity event evolutionary process is excavated and followed the tracks of.Below in conjunction with Fig. 1, the concrete steps of the method are illustrated.
Step 1, the set of obtaining the microblogging text that same event is discussed, and it is some therefrom to choose evolution starting point microblogging.The starting point that wherein develops microblogging namely represents the microblogging of event origin, as the microblogging of event origin, must have following feature: a) deliver the time early; B) be original microblogging, but not forward or comment.According to one embodiment of present invention, step 1) can comprise the following steps:
Step 1-1, the set of obtaining the microblogging text that same event is discussed.For example, can adopt the mode of keyword search to obtain.
Step 1-2, to being discussed, the microblogging of same event sorts in chronological order, the microblogging text being about in this set was arranged by after arriving first by the microblogging time of delivering, and keep explicit forwarding between microblogging, comment relation (in the application, forwarding and comment are equal to and look it), this sequence can be designated as: D={d 1, d 2..., d n.Wherein, subscript 1~n can be used as again the time marking of the document, due to limitlessly detachable constantly, can think that a moment at most only can produce one piece of document.In this sequence, Rt:D * D → { 0,1} represents the forwarding relation between document, for document d to foundation forwarding indicator function i, d j, 0<i<j<n, if document d jforwarded document d i, Rt (d i, d j)=1, otherwise this transition formula evaluation is 0.At this, be related on basis, can set up again that function isRt:D → { 0,1} represents that each document is original document (0) or forwards document (1).In addition the version Rt:2 that separately has, the forwarding indicator function Rt being defined on collection of document d* 2 d→ { 0,1}, for collection of document D 1and D 2:
Figure BDA0000406196070000061
Step 1-3 chooses several evolution starting point microbloggings, as initial document sets D from this set 0.Consider may be not unique as the microblogging of event origin situation, therefore can using and forward the number of document as the qualifications of initial document sets.Initial document sets D 0candidate's scope D candidateseveral continuous document subsequences of aliging with microblogging sequence D mentioned above front portion, and:
D candidate = { d 1 , d 2 , . . . , d k } | &Sigma; i = 1 k isRt ( d i ) &le; &epsiv; start , D 0d candidatethe subset that in set, original microblogging forms.ε wherein startfor forwarding, limit threshold value, can limit this value is 5.And can get the maximum k that meets this inequality.D 0also referred to as the set of event evolution starting point microblogging.
Step 2, the graph model of tectonic event evolution starting point microblogging set, the i.e. knowledge network of event origin.
For one piece of microblogging text, according to one embodiment of present invention, can set up according to following steps the graph model of this microblogging: (1) carries out participle and part-of-speech tagging to text.(2) for the adjective in the vocabulary obtaining after participle and adverbial word, pass through inquiry tendentiousness database, for example tendentiousness dictionary, obtains its corresponding tendentiousness scoring.The tendentiousness scoring of adjective/adverbial word can be appended on its noun of modifying or verb summit as eigenwert.Wherein, tendentiousness scoring can be for example a real number between [1,1].More approaching-1 item represents that negative tendency degree is higher, otherwise more approach 1, represents that positive tendentiousness degree is higher.More simplifiedly, also can limit tendentiousness scoring { value in tri-values of 1,0,1} represents respectively negative, neutral and positive three class tendentiousness.(3) for the noun in the vocabulary obtaining after participle and verb, find adjective and the adverbial word of modifying them, the corresponding tendentiousness scoring of the qualifier of modifying same target is averaged, as the tendentiousness scoring of this noun or verb.Then, take noun and verb is that summit participates in the structure of graph model, also can be accompanied by time impress now simultaneously; Between the noun of specified requirements, verb summit, set up limit and connect meeting, so-called " specified requirements " refers to that word and word appear in identical sentence or the co-occurrence distance of word and word is less than assign thresholds.In addition, can also be to additional number of times that this is associated and the time impress at that time of occurring on limit.Wherein, when the co-occurrence distance of two words refers to that these two vocabulary appear in same microblogging, in the number of characters between them or vocabulary number.
For the set of event evolution starting point microblogging, according to one embodiment of present invention, can adopt step below to construct the graph model of this set:
Step 2-1, to initial document sets D 0every piece of document carry out participle and part-of-speech tagging.
Step 2-2, for the adjective after participle and adverbial word, inquiry tendentiousness dictionary, obtains its tendentiousness scoring, and as mentioned above, this scoring can be a real number between [1,1].During simplification, can limit this scoring { value in tri-values of 1,0,1} represents respectively negative, neutral and positive three class tendentiousness.The tendentiousness scoring of adjective and adverbial word implements on noun and verb the most at last.The tendentiousness scoring s (w) of vocabulary w can be the mean value of modification noun or the adjective of verb w or the scoring of the tendentiousness of adverbial word.
Step 2-3, constructs initial document sets D 0graph model G (0)=<V, E ∪ R, L v, L e>, and the starting point of constructing it as event microcosmic evolution diagram.Wherein V is vertex set, and E and R represent dissimilar limit set, and E is direct connection, and R is associated connection, L vthe labeling function of vertex set V, L eit is the labeling function of limit set E.
Summit V represents a noun or adjective, and the tlv triple that this summit can consist of vocabulary words face amount, vocabulary place collection of document and vocabulary tendentiousness again represents, therefore the labeling function on summit is designated as:
L v(v)=<w v, D v, s (w v) >, wherein, w vrepresent the vocabulary w corresponding with vertex v, D vthe set of the microblogging document that expression comprises this vocabulary w, s (w v) represent the tendentiousness scoring of vocabulary w, also can be called the tendentiousness eigenwert of this vertex v.
Limit E in figure represents between summit to have specific relation, and for example the corresponding vocabulary on two summits appears in same microblogging, or co-occurrence distance is less than preassigned threshold value.The labeling function on limit can be expressed as the common counting occurring of corresponding vocabulary on two summits, and target set during respective document, comprises the set of issuing time of the microblogging of these corresponding vocabulary in two summits simultaneously.Namely for e={v 1, v 2∈ E, have:
Figure BDA0000406196070000081
wherein, c (v 1, v 2) expression vertex v 1, v 2corresponding vocabulary appears at the number of times in same microblogging jointly; t v1v2represent to comprise the set of issuing time of the microblogging of these corresponding vocabulary in two summits, it comprises c (v simultaneously 1, v 2) individual document markers.
Like this, the graph model G of resulting event evolution starting point microblogging set (0)microcosmic evolutionary model also referred to as initial time event.
Step 3, in chronological order, to all the other, each microblogging is processed one by one, sets up the graph model of this microblogging, and it is added in the event model of previous moment, until all microblogging is disposed.Now obtain the microgram model that event develops.Wherein, according to one embodiment of present invention, the process that the graph model of pending microblogging is joined in existing graph model can be deferred to following steps:
For each limit in the graph model of pending microblogging:
If two summits on this limit all existed with existing graph model in, and in existing graph model, the occurrence number counter on ,Ze Gai limit, existing this limit is cumulative; If there is no this limit in existing graph model, this limit copied in existing graph model.
If there is and only has a summit to appear in existing graph model in this limit, by the summit in existing graph model and limit do not copy in existing graph model.
If two summits on this limit all not in existing graph model, by He Liangge summit, this limit complete copy in existing graph model.
Still take the D of microblogging set above and initial document sets D0 is example, to remaining document sequence D-D 0, get successively wherein document d i, press its graph model of method construct as discussed above, be designated as G i.Now the microcosmic evolutionary model of event is designated as G (i), by following steps, by G ibe incorporated into G (i), obtain G (i+1).
The graph model of pending microblogging is being joined in the process in existing graph model, need to judge whether certain summit has been included in existing graph model.For certain given summit, if include the summit that the vocabulary corresponding with this summit is identical in figure, the forwarding indicator function that relates to the document sets on two summits judges that value is true as 1(), and the tendentiousness eigenwert on two summits is compatible, in process decision chart, has comprised this given summit.Wherein, in the compatible finger figure of tendentiousness eigenwert, the tendentiousness eigenwert on summit and the difference of this given summit tendentiousness eigenwert are less than certain threshold value.Suppose defined function Eq v: V * V → 0,1}, judges whether two summits equate:
Figure BDA0000406196070000091
Defined function Mt again v: V * V → { 0,1}, represents that when this function value is 1 this is that a pair of vocabulary is identical, but collection of document does not have the summit of Evolvement (for example, forwarding or comment), now claims that two summits are relevant.Function definition is as follows:
Figure BDA0000406196070000092
Wherein, ε sfor the tendentiousness disparity threshold of appointment in advance, experience value is 0.3.
To document d igraph model G i, get wherein each limit e={v 1, v 2∈ E, with mark v, refer to wherein any one summit:
(a) if
Figure BDA0000406196070000094
v and v ' are merged, and look the two for same point:
D v′←D v′∪D v
s(v′)←(s(v′)+s(v))/2
(b) if
Figure BDA0000406196070000095
v is introduced in figure with the form on new summit, and add limit r={v ', v} gathers in R to limit.
(c) if condition a, b does not all meet, and directly vertex v is added to figure G (i)in.
If G now (i)in there is not limit e '={ v 1, v 2∈ E, add this limit; If there is this limit, e and e ' are merged:
c(v 1,v 2) e′←c(v 1,v 2) e′+c(v 1,v 2) e
t e &prime; v 1 v 2 &LeftArrow; t e &prime; v 1 v 2 &cup; t e v 1 v 2
Constantly repeat said process, until the pending all document process in collection of document are complete, the event microcosmic evolution diagram now obtaining is designated as G.
Step 4, carries out beta pruning, cutting and conversion to event microcosmic evolution diagram, finally obtains Macro-event evolution diagram.
Wherein, event microcosmic evolution diagram is carried out to beta pruning, can deletion event microcosmic evolution diagram in co-occurrence number of times lower than the limit of assign thresholds, and delete and initial figure G thereupon (0)disconnected branch.Event microcosmic evolution diagram is carried out to cutting and can comprise that the initial microblogging sequence to mentioning in step 1) divides by the time, difference can be divided into varigrained timeslice according to demand.According to one embodiment of present invention, the conversion of event microcosmic evolution diagram is referred to event microcosmic evolution diagram to be converted into Macro Evolution figure, it comprises: set up the ,Gai initial vertex, initial vertex of event Macro Evolution figure corresponding to expressing the subgraph of start-up portion in microcosmic evolution diagram; Then, investigate successively each timeslice, choose in microcosmic evolution diagram corresponding summit and the limit of timeslice therewith, in structure microcosmic evolution diagram, take the minimum connected subgraph that this subgraph is base, usining this subgraph joins in Macro Evolution figure as a summit, if the subgraph that this subgraph is corresponding with other summit intersects, in Macro Evolution figure, construct Jiang Liangge summit, a limit and be connected, and as eigenwert, give this limit using the Jaccard coefficient of two subgraphs.
Still with the D of microblogging set above, initial document sets D 0and remaining document sequence D-D 0for example, according to one embodiment of present invention, carry out the implementation of description of step 4.
Step 4-1, given threshold epsilon colimit the minimum co-occurrence number of times (jointly appearing at the number of times in same piece of writing microblogging) between vocabulary, or given required minimum co-occurrence number of times is divided by total given minimum co-occurrence frequency of number of files.Each limit in scan event microcosmic evolution diagram G, to e={v wherein 1, v 2∈ E, if c is (v 1, v 2) e '≤ ε co, from E, remove this limit.From initial figure G (0)set out, the disconnected summit with initial figure is deleted from vertex set by connected component in search graph.
Step 4-2, to document sequence D-D 0time division sheet.Now can divide as required different timeslices, comprise following methods:
(a) specify Fixed Time Interval, as divided with hour ,Tian Wei unit
(b) calculate initial document sets D 0time span, and as fixed value time division sheet
(c) according to the density degree of time in document sequence, carry out clustering, form the different timeslice in interval.
To dividing the time slice sequence obtaining in this step, be designated as T={T 1, T 2..., T m, each timeslice comprises one or several documents.
Step 4-3, creates event Macro Evolution figure
Figure BDA0000406196070000104
v wherein Ψvertex set in Macro Evolution figure, E Ψlimit set, vertex set V Ψlabeling function,
Figure BDA0000406196070000102
limit set E Ψlabeling function.Create vertex v 0∈ V Ψ, note
Figure BDA0000406196070000103
Step 4-4, investigates each timeslice in timeslice set T, successively to T i∈ T,
Choose in figure G at timeslice T iin point set and Bian Ji, be designated as respectively V and E.Here, in order to accelerate inquiry velocity, in the time of can utilizing additional two in summit mentioned above and limit, impress is selected point set and a Bian Ji in timeslice.
Very big connected component in mark vertex set V, take these very big connected components is base, constructs and comprises V minimum connected subgraph G in interior figure G v.According to one embodiment of present invention, the method comprises:
(a) with dijkstra's algorithm, solve the shortest path between any two that are in respectively in two very big connected components;
(b) from some shortest paths, select minimum one, all summits and limit in path are added to subgraph;
(c) repeat ab step until subgraph is communicated with completely.
Create vertex v → V Ψ, note
Figure BDA0000406196070000111
exhaustive V Ψ-each vertex v in v ', if create limit e={v, v ' } → E Ψ, and mark wherein equation the right represents the Jaccard coefficient of vertex v and v ', and computing formula is:
Figure BDA0000406196070000114
wherein, G v∩ G v 'and G v∪ G v 'the common factor and the union that represent respectively two microcosmic evolution diagram vertex sets, function # () represents the element number in set.
Repeating step 4-4 until all timeslices be all disposed.Now event Macro Evolution figure constructs complete.Then, can observe based on event Macro Evolution figure the evolution of affair character.
The microcosmic evolution diagram of event be take vocabulary as granularity, major embodiment along with the development and change of event, the continuous expansion of event knowledge, by the formation rule on limit, embody succession and the evolution of knowledge, thus aspect interpretation due to simple traditional evolution analysis method based on vocabulary similarity.But microcosmic evolution diagram number of nodes is many, annexation is complicated, is applicable to the observation that calculation in computing machine is but unfavorable for people.And refine based on microcosmic evolution diagram the Macro Evolution figure forming, take timeslice as granularity, the quantity on node and limit is all corresponding significantly to be reduced, and is applicable to people's observation.Meanwhile, can, by regulating the method for timeslice size to change observation granularity, realize the convergent-divergent to Macro Evolution figure again.Based on event Macro Evolution figure, can observe at an easy rate the evolution of affair character.
In yet another embodiment of the present invention, a kind of affair character evolution digging system based on microblogging is also provided, it comprises: the microblogging that several represent event origin is chosen in the set for the microblogging text from relevant to event to be analyzed, with the device of formation event evolution starting point microblogging set; For adopting the graph model of method construct event evolution starting point microblogging set mentioned above, as the device of initial event microcosmic evolution diagram; For adopting all the other each microbloggings of set of the method pair microblogging text relevant to event to be analyzed of discussion, build the graph model of this microblogging and joined the device in current event evolution microgram; Be used for the device that adopts the method mentioned above event microcosmic evolution diagram based on last to obtain event Macro Evolution figure and observe the evolution of affair character based on event Macro Evolution figure.
It should be noted last that, above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is modified or is equal to replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of claim scope of the present invention.

Claims (11)

1. the affair character evolution method for digging based on microblogging, comprises the following steps:
Step 1 is chosen the microblogging that several represent event origin from the set of the microblogging text relevant to event to be analyzed, with the set of formation event evolution starting point microblogging;
Step 2, the graph model of tectonic event evolution starting point microblogging set, as initial event microcosmic evolution diagram; In described graph model, summit is the noun/verb appearing in each microblogging text of this event evolution starting point microblogging set, and the limit between two summits represents that word corresponding to these two summits appears in same microblogging jointly or co-occurrence distance is less than threshold value given in advance;
Step 3, to all the other each microbloggings in the set of the microblogging text relevant to event to be analyzed, builds the graph model of this microblogging and is joined in current event evolution microgram;
Step 4, the event microcosmic evolution diagram based on obtaining through step 3 obtains event Macro Evolution figure and based on event Macro Evolution figure, observes the evolution of affair character.
2. method according to claim 1, represents in described step 1 that the microblogging of event origin has following characteristics: a) deliver the time early; B) be original microblogging, but not the microblogging that forwards or comment on.
3. method according to claim 1, the noun/verb corresponding to Yi Yougai summit, summit of graph model described in described step 2, the set of the microblogging document that comprises noun/verb, the tlv triple that the tendentiousness scoring of this noun/verb forms represents, the adjective that wherein scoring of the tendentiousness of this noun/verb is this noun/verb of modification and the mean value of the corresponding tendentiousness scoring of adverbial word.
4. method according to claim 3, described step 2 comprises:
Step 2-1) every microblogging text in the set of event evolution starting point microblogging is carried out to participle and part-of-speech tagging;
Step 2-2), to the adjective after participle and adverbial word, its tendentiousness scoring is set;
Step 2-3), for the noun after participle and verb, adjective and the corresponding tendentiousness scoring of adverbial word of modifying same noun/verb are averaged, as the tendentiousness scoring of this noun or verb;
Step 2-4) using noun and verb as summit, if word corresponding to any two summits jointly appears in same microblogging or co-occurrence distance is less than threshold value given in advance, between these two summits, create limit.
5. method according to claim 1, in described step 3 joins the graph model of constructed microblogging current event evolution microgram and comprises: to each limit in the graph model of pending microblogging:
If a) two of this limit summits have all been present in current event evolution microgram, and existing this limit in this event evolution microgram, the occurrence number counting on this limit is added up; If there is no this limit in this event evolution microgram, this limit is copied in this event evolution microgram;
B) if there is and only has a summit to appear in current event evolution microgram in this limit, by the summit in this event evolution microgram and limit do not copy in this event evolution microgram;
C) if two summits on this limit all not in current event evolution microgram, by He Liangge summit, this limit complete copy in this event evolution microgram.
6. method according to claim 5, described step 3 also comprises the whether step in event evolution microgram of certain summit in the graph model that judges microblogging, it comprises: given certain summit in the graph model for microblogging, if include the summit that the word corresponding with this summit is identical in event evolution microgram, there is with the microblogging text to corresponding vertex relates in this event evolution microgram the relation that forwards or comment in this microblogging, and the scoring of the tendentiousness on these two summits is compatible, in judgement event evolution microgram, comprised this given summit, wherein, the tendentiousness tendentiousness scoring of corresponding vertex in compatible self-explanatory characters' part evolution microgram of marking is less than certain threshold value with the difference of this given summit tendentiousness scoring.
7. method according to claim 1, described step 4) comprises carries out cutting and transforms to obtain event Macro Evolution figure event microcosmic evolution diagram.
8. method according to claim 7, described event microcosmic evolution diagram is carried out to cutting and conversion comprises:
Step 4-1) the microblogging text relevant to event to be analyzed sorted by the time, this microblogging text sequence was cut into slices by the time, form the timeslice of desired particle size;
Step 4-2) in event Macro Evolution figure, create a summit, corresponding initial event microcosmic evolution diagram;
Step 4-3) for each timeslice, carry out the following step:
4-3-a) in event microcosmic evolution diagram, choose successively summit corresponding to each timeslice and limit, structure be take the minimum connected subgraph that this subgraph is base;
4-3-b) in event Macro Evolution figure, create a summit, corresponding to this minimum connected subgraph, if this minimum connected subgraph subgraph corresponding with other summit in event Macro Evolution figure intersects, create a limit that connects two subgraphs.
9. method according to claim 8, described step 4-3) also comprise that this edge of two subgraphs of created connection gives weights, the weights on limit are the Jaccard coefficient of the corresponding subgraphs in two summits; Wherein, for any two vertex v and v ' in event Macro Evolution figure, the Jaccard coefficient calculations mode of its corresponding subgraph is: wherein, G v∩ G v 'and G v∪ G v 'the common factor and the union that represent respectively the vertex set of the corresponding subgraph in two summits, function # () represents the element number in set.
10. method according to claim 7, described step 4 also comprises the step that event microcosmic evolution diagram is carried out to beta pruning, it comprises in deletion event microcosmic evolution diagram that occurrence number is lower than the limit of given threshold value, then delete and the initial disconnected branch of event microcosmic evolution diagram, wherein the occurrence number on limit refers to that the word corresponding to two summits on this limit in the set of the microblogging text relevant to event to be analyzed appears at the number of times in same microblogging jointly.
11. 1 kinds of affair character evolution digging systems based on microblogging, comprising:
For choosing from the set of the microblogging text relevant to event to be analyzed the microblogging that several represent event origin, with the device of formation event evolution starting point microblogging set;
For the graph model of tectonic event evolution starting point microblogging set, as the device of initial event microcosmic evolution diagram; In described graph model, summit is the noun/verb appearing in each microblogging text of this event evolution starting point microblogging set, and the limit between two summits represents that word corresponding to these two summits appears in same microblogging jointly or co-occurrence distance is less than threshold value given in advance;
All the other each microbloggings of set for to the microblogging text relevant to event to be analyzed, build the graph model of this microblogging and are joined the device in current event evolution microgram;
For event microcosmic evolution diagram based on last, obtain event Macro Evolution figure and based on event Macro Evolution figure, observe the device of the evolution of affair character.
CN201310532377.7A 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs Active CN103631862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310532377.7A CN103631862B (en) 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN2012104337138 2012-11-02
CN201210433713.8 2012-11-02
CN201210433713 2012-11-02
CN201310532377.7A CN103631862B (en) 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs

Publications (2)

Publication Number Publication Date
CN103631862A true CN103631862A (en) 2014-03-12
CN103631862B CN103631862B (en) 2017-01-11

Family

ID=50212904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310532377.7A Active CN103631862B (en) 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs

Country Status (1)

Country Link
CN (1) CN103631862B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104536956A (en) * 2014-07-23 2015-04-22 中国科学院计算技术研究所 A Microblog platform based event visualization method and system
CN104899908A (en) * 2015-06-12 2015-09-09 百度在线网络技术(北京)有限公司 Method and device for generating evolution diagram of event group
CN104933129A (en) * 2015-06-12 2015-09-23 百度在线网络技术(北京)有限公司 Event context acquisition method and system based on micro-blogs
CN106708947A (en) * 2016-11-25 2017-05-24 成都寻道科技有限公司 Big data-based web article forwarding recognition method
CN109145224A (en) * 2018-08-20 2019-01-04 电子科技大学 Social networks event-order serie relationship analysis method
CN110472105A (en) * 2019-08-06 2019-11-19 电子科技大学 A kind of social networks event evolution method for tracing divided based on the time
CN110781317A (en) * 2019-10-29 2020-02-11 北京明略软件系统有限公司 Method and device for constructing event map and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102214241A (en) * 2011-07-05 2011-10-12 清华大学 Method for detecting burst topic in user generation text stream based on graph clustering
US20110295903A1 (en) * 2010-05-28 2011-12-01 Drexel University System and method for automatically generating systematic reviews of a scientific field
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
US20110295903A1 (en) * 2010-05-28 2011-12-01 Drexel University System and method for automatically generating systematic reviews of a scientific field
CN102214241A (en) * 2011-07-05 2011-10-12 清华大学 Method for detecting burst topic in user generation text stream based on graph clustering
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104536956A (en) * 2014-07-23 2015-04-22 中国科学院计算技术研究所 A Microblog platform based event visualization method and system
CN104933129B (en) * 2015-06-12 2019-04-30 百度在线网络技术(北京)有限公司 Event train of thought acquisition methods and system based on microblogging
CN104899908A (en) * 2015-06-12 2015-09-09 百度在线网络技术(北京)有限公司 Method and device for generating evolution diagram of event group
CN104933129A (en) * 2015-06-12 2015-09-23 百度在线网络技术(北京)有限公司 Event context acquisition method and system based on micro-blogs
CN104899908B (en) * 2015-06-12 2018-09-11 百度在线网络技术(北京)有限公司 The method and apparatus for generating event group evolution diagram
US10324989B2 (en) 2015-06-12 2019-06-18 Baidu Online Network Technology (Beijing) Co., Ltd Microblog-based event context acquiring method and system
CN106708947A (en) * 2016-11-25 2017-05-24 成都寻道科技有限公司 Big data-based web article forwarding recognition method
CN106708947B (en) * 2016-11-25 2020-06-09 成都寻道科技有限公司 Web article forwarding and identifying method based on big data
CN109145224A (en) * 2018-08-20 2019-01-04 电子科技大学 Social networks event-order serie relationship analysis method
CN109145224B (en) * 2018-08-20 2021-11-23 电子科技大学 Social network event time sequence relation analysis method
CN110472105A (en) * 2019-08-06 2019-11-19 电子科技大学 A kind of social networks event evolution method for tracing divided based on the time
CN110781317A (en) * 2019-10-29 2020-02-11 北京明略软件系统有限公司 Method and device for constructing event map and electronic equipment
CN110781317B (en) * 2019-10-29 2022-03-01 北京明略软件系统有限公司 Method and device for constructing event map and electronic equipment

Also Published As

Publication number Publication date
CN103631862B (en) 2017-01-11

Similar Documents

Publication Publication Date Title
CN103631862B (en) Event characteristic evolution excavation method and system based on microblogs
Oprea et al. isarcasm: A dataset of intended sarcasm
US10162816B1 (en) Computerized system and method for automatically transforming and providing domain specific chatbot responses
CN104991956B (en) Microblogging based on theme probabilistic model is propagated group and is divided and account liveness appraisal procedure
Efron Information search and retrieval in microblogs
Amplayo et al. Cold-start aware user and product attention for sentiment classification
CN105512245A (en) Enterprise figure building method based on regression model
US20130297581A1 (en) Systems and methods for customized filtering and analysis of social media content collected over social networks
CN104536956A (en) A Microblog platform based event visualization method and system
CN105912656A (en) Construction method of commodity knowledge graph
Nasution et al. Social network extraction based on Web. A comparison of superficial methods
Ou et al. Exploiting community emotion for microblog event detection
Trung et al. Sentiment analysis based on fuzzy propagation in online social networks: a case study on TweetScope
CN106503858A (en) A kind of method that trains for predicting the model of social network user forwarding message
Moussaoui et al. A possibilistic framework for the detection of terrorism‐related Twitter communities in social media
CN104199947A (en) Important person speech supervision and incidence relation excavating method
CN104217026B (en) A kind of Chinese micro-blog tendentiousness search method based on graph model
Celli et al. Long chains or stable communities? The role of emotional stability in Twitter conversations
CN106372147B (en) Heterogeneous topic network construction and visualization method based on text network
CN105760410B (en) A kind of microblogging semanteme expansion model and method based on forwarding comment
Contreras et al. Lexicon-based Sentiment Analysis with Pattern Matching Application using Regular Expression in Automata
Feng et al. A unified microblog user similarity model for online friend recommendation
Michelle et al. Topic sensitive information diffusion modelling in online social networks
CN109522389A (en) Document method for pushing, device and storage medium
Yang et al. Comparison and modelling of country-level micro-blog user behaviour and activity in cyber-physical-social systems using weibo and twitter data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant