CN103631862B - Event characteristic evolution excavation method and system based on microblogs - Google Patents

Event characteristic evolution excavation method and system based on microblogs Download PDF

Info

Publication number
CN103631862B
CN103631862B CN201310532377.7A CN201310532377A CN103631862B CN 103631862 B CN103631862 B CN 103631862B CN 201310532377 A CN201310532377 A CN 201310532377A CN 103631862 B CN103631862 B CN 103631862B
Authority
CN
China
Prior art keywords
event
evolution
microblogging
limit
summit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310532377.7A
Other languages
Chinese (zh)
Other versions
CN103631862A (en
Inventor
邓镭
贾焰
邹鹏
杨树强
周斌
韩伟红
李爱平
韩毅
李莎莎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310532377.7A priority Critical patent/CN103631862B/en
Publication of CN103631862A publication Critical patent/CN103631862A/en
Application granted granted Critical
Publication of CN103631862B publication Critical patent/CN103631862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention provides an event characteristic evolution excavation method and system based on microblogs. The method includes the steps that in a microblog time sequence, an evolution starting document set is selected, and graph models of documents are constructed on the microblog document set based on co-occurrence characteristics of vocabularies so as to obtain a knowledge network structure of events; according to literal characteristics of the vocabularies and the tendentious compatibility characteristic of the vocabularies, the microblog graph models are combined, and a micro evolution graph of characteristics of the events is constructed; clipping, segmentation and conversion are performed on the micro evolution graph of the events, and a macro evolution graph of the characteristics of the events is formed. According to the method, in the evolution law process for excavating the characteristics of the events, a graph excavating method based on a knowledge network of the events is adopted, the event characteristic evolution excavation method is improved in the succession aspect of knowledge as a whole, and interpretability of the excavating results is higher.

Description

Affair character evolution excavation method based on microblogging and system
Technical field
The present invention relates to text mining and topic detection and tracking field, particularly to a kind of based on microblogging text data Affair character develops and the method excavated.
Background technology
Along with Web2.0 technology and application in recent years flourish, online microblogging service has been increasingly becoming one and has had greatly Amount user, the new Information Communication platform of generation bulk information.According to the 29th China Internet report statistics: by 2011 12 At the end of month, actual user's number of China's microblogging reaches 2.5 hundred million, relatively goes up and will increase 296.0% an end of the year, and netizen's utilization rate is 48.7%.
Being different from the strong relation social networking service such as Facebook, the social networking relationships of microblogging service is typically uni-directional I.e. user need not other users and authorize and just can pay close attention to them, receive the information of their generation.The people that user pays close attention to It is referred to as the good friend (friends) of this user;Paying close attention to the vermicelli that person is this user (followers) of certain user, user issues All blog articles (tweets) will appear in (public timeline) on line common time, all vermicellis of this user (followers) all message of this user will be shown on timeline.
Topic or event in reality are projected in the text space of microblogging, it is simply that all users discuss associated topic, thing The set of the blog article of part.(in text analyzing field, sometimes topic and event the two concept are not distinguished, the most all adopt Use this viewpoint.) topic in reality and event constantly developing, correspondingly, topic and event in microblogging text space also exist Constantly develop.The moment that topic/event develops i.e. when the vermicelli in microblogging the information that its follower is sent forwarded or The moment of comment.Forward and in comment except the viewpoint in former blog article, narration are shown or in addition to the repetition of implicit expression, also can draw Entering new viewpoint and new narration, now topic will occur to a certain degree to change.It is forwarded or comments for the first time after former blog article Discussing, the evolutionary process of topic begins to.Along with forward, comment on is constantly carried out, the extension of topic is also constantly extending, words Topic constantly develops.Research topic/event evolution in communication process, it is simply that topic/event information to be followed the tracks of passes each time Slight change in broadcasting, and then integrated survey topic/event is in change macroscopically.
The research propagated topic on microblogging/event information at present and develop is divided into two categories below.First kind research is passed through Analyzing the behavioral primitive of topic/event propagation, set up the mathematical model that topic is propagated and developed, evolutionary process is propagated in simulation, with Answer the most contagious problem of topic/event.The simulation modeling that dissemination aspect is partial in this kind of research is theoretical, to research The propagation evolutionary process of a certain specific topics/event there is no practical significance.Equations of The Second Kind is studied the community network information in microblogging Combining with traditional topic/event model, make inferences topic/event communication process in microblogging, this type of is studied Can obtain two kinds of results eventually, the first topic/the event explicit and propagation path of implicit expression in microblogging, it two is topic/thing The part change that model occurs in communication process.The basic step of this type of research is:
1, the text that same topic/event is discussed in microblogging is arranged according to sequential, keep its explicit forwarding relation, according to Time order from front to back, and transfer sequence processes, and introduces the concept of timeslice if desired, the literary composition to same timeslice This processes simultaneously.To not introducing timeslice concept, every document can be considered as and individually occupy a timeslice;
2, setting up the topic/event model of each timeslice, now many considerations use vector space model and probabilistic model, If desired the topic model of this time sheet is split, be decomposed into some sub-topics, to represent the different aspect of topic.
3, on the basis of the topic/event model in 0 moment, successively to the topic of each text in subsequent time slice/ Event model is investigated, and compares the latter's similarity with the former, its propagation relation of reasoning.In view of flow of information trend in microblogging Locality, needs in this step to take into account the relation produced between the user of two texts, if between two users the most significantly Contact, then it is assumed that the probability having propagation relation between text is little.
4, by step 3, each document can be considered a summit, and the propagation relation between document can be considered the limit between summit, because of This now can construct propagation tree or the propagation figure producing text message.This figure features topic/event information in microblogging Explicit/implicit propagation path.Investigate the topic/event model on each summit along every paths, the Changing Pattern of this model is Evolution along the topic/event in this path.
It can be seen that be to set up the same of propagation model owing to investigating the evolutionary process of topic/event from foregoing description Time complete, so the evolutionary process of topic/event does not has independent model, and be to rely on such as vector space or probability mould The topic model such as type.These topic model are the effective expression modes of collection of document, but lack the expression in terms of topic evolution, this Cause topic/event EVOLUTION ANALYSIS result that said method obtains nothing more than word frequency or vocabulary vector rule over time, The related information between vocabulary, does not has inheritance in terms of the domain knowledge of topic/event, and lacking in terms of evolution can Explanatory.Between this, need a kind of new topic/affair character evolution excavation method.
Summary of the invention
It is an object of the invention to overcome the defect of above-mentioned prior art, it is provided that a kind of new affair character based on microblogging Evolution excavation method and system.
It is an object of the invention to be achieved through the following technical solutions:
On the one hand, the invention provides a kind of affair character evolution excavation method based on microblogging, including:
Step 1, chooses several from the set of the microblogging text relevant to event to be analyzed and represents the micro-of event origin Rich, to constitute event evolution starting point microblogging set;
Step 2, the graph model of tectonic event evolution starting point microblogging set, as initial event microcosmic evolution diagram;Described Noun/verb during summit is each microblogging text occurring in this event evolution starting point microblogging set in graph model, two summits Between limit represent that the word of the two vertex correspondence occurs in same microblogging jointly or co-occurrence distance is less than previously given threshold value;
Step 3, to remaining each bar microblogging in the set of the microblogging text relevant to event to be analyzed, builds this microblogging Graph model also adds it in current event evolution microgram;
Step 4, obtains event Macro Evolution figure and based on event macroscopic view based on the event microcosmic evolution diagram obtained through step 3 Evolution diagram observes the evolution of affair character.
In said method, described step 1 represents the microblogging of event origin and can have the feature that and a) deliver the time early; B) it is original microblogging, and the non-forwarded or microblogging of comment.
In said method, the summit of graph model described in described step 2 can be wrapped by the noun/verb of this vertex correspondence The set of the microblogging document containing noun/verb, the tlv triple that the tendentiousness scoring of this noun/verb is constituted represents, wherein should It is average that tendentiousness corresponding to the adjective that tendentiousness scoring is this noun/verb of modification of noun/verb and adverbial word is marked Value.
In said method, described step 2 comprises the steps that
Step 2-1) every microblogging text in event evolution starting point microblogging set is carried out participle and part-of-speech tagging;
Step 2-2) to the adjective after participle and adverbial word, the scoring of its tendentiousness is set;
Step 2-3) for the noun after participle and verb, by right for the adjective and adverbial word institute modifying same noun/verb The tendentiousness scoring answered is averaged, and the tendentiousness as this noun or verb is marked;
Step 2-4) using noun and verb as summit, if the word of any two vertex correspondence occurs in same microblogging jointly In or co-occurrence distance less than previously given threshold value, then between the two summit, create limit.
In said method, the graph model of constructed microblogging is joined by described step 3 current event evolution microcosmic Figure comprises the steps that each limit in the graph model to pending microblogging:
If during a) two summits on this limit are all present in current event evolution microgram, and this event evolution microgram In this limit existing, then the occurrence number to this limit counts and adds up;If this event evolution microgram there is no this limit, then by this While copy in this event evolution microgram;
If b) this limit having and only one of which summit occurring in current event evolution microgram, then will be not in this event Summit and limit in evolution microgram copy in this event evolution microgram;
If c) two summits on this limit are not the most in current event evolution microgram, then by complete to this limit and two summits Copy in this event evolution microgram.
In said method, described step 3 may also include whether certain summit in the graph model judging microblogging develops in event Step in microgram, comprising: for certain summit given in the graph model of microblogging, if event evolution microgram is wrapped Containing the summit identical with the word of this vertex correspondence, this microblogging and the microblogging that corresponding vertex in this event evolution microgram is related to There is the relation forwarding or commenting in text, and the tendentiousness scoring on the two summit is compatible, then judge event evolution microgram In comprised this given summit, wherein, the tendentiousness tendentiousness of corresponding vertex in compatible self-explanatory characters' part evolution microgram of marking is commented Divide the difference of the summit tendentiousness scoring given with this less than certain threshold value.
In said method, described step 4) can include event microcosmic evolution diagram carrying out cutting and converting with acquisition event Macro Evolution figure.
In said method, described event microcosmic evolution diagram is carried out cutting and conversion comprises the steps that
Step 4-1) the microblogging text relevant to event to be analyzed is temporally ranked up, this microblogging text sequence is pressed Time cuts into slices, and forms the timeslice of desired particle size;
Step 4-2) in event Macro Evolution figure, create a summit, corresponding initial event microcosmic evolution diagram;
Step 4-3) the following step performs for each timeslice:
4-3-a) in event microcosmic evolution diagram, choose summit corresponding to each timeslice and limit successively, construct with this son Figure is the minimum connected subgraph of base;
4-3-b) in event Macro Evolution figure, create a summit, corresponding to this minimum connected subgraph, if this minimum is even Logical subgraph intersects with the subgraph of other vertex correspondence in event Macro Evolution figure, then create a limit connecting two subgraphs;
In said method, described step 4-3) this edge that may also include created two subgraphs of connection gives weights, The weights on limit are the Jaccard coefficient of two vertex correspondence subgraphs;Wherein for any two vertex v in event Macro Evolution figure and V ', the Jaccard coefficient calculations mode of its corresponding subgraph is:Wherein, Gv∩Gv′And Gv ∪Gv′Representing common factor and the union of the vertex set of two vertex correspondence subgraphs respectively, function # () represents the element in set Number.
In said method, described step 4 may also include the step that event microcosmic evolution diagram carries out beta pruning, and it includes deleting In event microcosmic evolution diagram, occurrence number is less than the limit of given threshold value, then deletes and does not connects with initial event microcosmic evolution diagram Branch, wherein the occurrence number on limit refers to two summits pair on this limit in the set of the microblogging text relevant to event to be analyzed The word answered occurs in the number of times in same microblogging jointly.
Another aspect, the invention provides a kind of affair character evolution digging system based on microblogging, including:
Several microbloggings representing event origin are chosen in the set from the microblogging text relevant to event to be analyzed, To constitute the device of event evolution starting point microblogging set;
For the graph model of tectonic event evolution starting point microblogging set, as the device of initial event microcosmic evolution diagram; Noun/verb during summit is each microblogging text occurring in this event evolution starting point microblogging set in described graph model, two Limit between summit represents that the word of the two vertex correspondence occurs in same microblogging jointly or co-occurrence distance is less than previously given Threshold value;
In the set to the microblogging text relevant to event to be analyzed, remaining each bar microblogging, builds the figure of this microblogging Model also adds it to the device in current event evolution microgram;
For obtaining event Macro Evolution figure based on last event microcosmic evolution diagram and seeing based on event Macro Evolution figure Examine the device of the evolution of affair character.
Compared with prior art, it is an advantage of the current invention that:
Based on the graph model of employing event, by the knowledge structure between structure vocabulary, thus obtain at knowledge level The more event evolutionary model of interpretability.With knowledge network for unit tectonic event evolution diagram on Event Graph Model, promote The inheritance of event knowledge.Weigh the feature of microblogging text, utilized statistic law, with many excellent of the many participating users of amount of text Point overcomes wall scroll microblogging text few, the deficiency that feature is rare.
Accompanying drawing explanation
Hereinafter, describe embodiments of the invention in detail in conjunction with accompanying drawing, wherein:
Fig. 1 is the affair character evolution excavation method schematic flow sheet based on microblogging according to the embodiment of the present invention.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage are clearer, below in conjunction with accompanying drawing by concrete real The present invention is described in more detail to execute example.Should be appreciated that specific embodiment described herein only in order to explain the present invention, It is not intended to limit the present invention.
In one embodiment of the invention, it is provided that a kind of have higher resolution and explanatory thing based on microblogging Part Character evolution method for digging, surmounts document oneself boundary, from the aspect of event knowledge, develops event to fine granularity Cheng Jinhang excavates and follows the tracks of.Below in conjunction with Fig. 1, the concrete steps of the method are illustrated.
Step 1, obtains the set of the microblogging text that same event is discussed, and it is some therefrom to choose evolution starting point microblogging.Its Middle evolution starting point microblogging namely represents the microblogging of event origin, and the microblogging as event origin must have the feature that a) Deliver the time early;B) it is original microblogging, and non-forwarded or comment.According to one embodiment of present invention, under step 1) can include Row step:
Step 1-1, obtains the set of the microblogging text that same event is discussed.Such as, the mode of keyword search can be used Obtain.
Step 1-2, is ranked up in chronological order to the microblogging that same event is discussed, will microblogging text in this set Deliver the time by microblogging to be arranged by after arriving first, and keep between microblogging explicit forward, comment relation (will forward in the application It is regarded with comment equivalent), this sequence can be designated as: D={d1,d2,...,dn}.Wherein, subscript 1~n again can be as the document Time marking, due to the limitlessly detachable in moment, it is believed that moment the most only can produce a document.This sequence is built Vertical forwarding indicator function Rt:D × D → 0,1}, represent the forwarding relation between document, for document di, dj, 0 < i < j < n, if document djForwarded document di, then Rt (di,dj)=1, otherwise this transition formula evaluation is 0.On the basis of this relation, function can be set up again IsRt:D → 0,1}, represent that each document is original document (0) or forwards document (1).Additionally, it is defined otherwise at collection of document On the version Rt:2 of forwarding indicator function RtD×2D→ { 0,1}, for collection of document D1And D2:
Step 1-3, chooses several evolution starting point microbloggings, as initial document collection D from this set0.In view of conduct The microblogging of event origin may not unique situation, therefore can be to forward the number restriction bar as initial document collection of document Part.Initial document collection D0Candidate's scope DcandidateBe the continuous print that aligns with microblogging sequence D front portion mentioned above several Document subsequence, and:
D candidate = { d 1 , d 2 , . . . , d k } | &Sigma; i = 1 k isRt ( d i ) &le; &epsiv; start , D0It is DcandidateThe son that in set, original microblogging is constituted Collection.Wherein εstartLimit threshold value for forwarding, this value can be limited as 5.And the maximum k meeting this inequality can be got.D0Also It is referred to as event evolution starting point microblogging set.
Step 2, the graph model of tectonic event evolution starting point microblogging set, i.e. the knowledge network of event origin.
For a microblogging text, according to one embodiment of present invention, can be set up this according to following steps micro- Rich graph model: (1) carries out participle and part-of-speech tagging to text.(2) for the adjective in the vocabulary that obtains after participle and pair Word, by inquiry tendentiousness data base, such as tendentiousness dictionary, obtains its corresponding tendentiousness scoring.Can by adjective/ The tendentiousness scoring of adverbial word is attached on its noun modified or verb summit as eigenvalue.Wherein, tendentiousness scoring example As being a real number between [-1,1].Negative tendency degree is represented the highest closer to-1, otherwise closer to 1 Then represent front tendentiousness degree the highest.More simplifiedly, it is also possible to limit tendentiousness scoring-1,0,1} tri-value in value, Represent negative, neutral and front three class tendentiousness respectively.(3) for the noun in the vocabulary that obtains after participle and verb, find Modify their adjective and adverbial word, the tendentiousness scoring corresponding to the qualifier of modification same target is averaged, as The tendentiousness scoring of this noun or verb.Then, the structure being summit participation graph model with noun and verb, simultaneously can also be attached With time impress now;Setting up limit connection between the noun of specified requirements, verb summit meeting, so-called " specified requirements " refers to Word and word occur in identical sentence or the co-occurrence distance of word and word is less than appointment threshold value.Further, it is also possible to attached on limit Add and number of times that this associates and time impress at that time occur.Wherein, the co-occurrence distance of two words refers to that these two vocabulary occur in together Time in one microblogging, the number of characters being between them or vocabulary number.
For event evolution starting point microblogging set, according to one embodiment of present invention, following step can be used Suddenly the graph model of this set is constructed:
Step 2-1, to initial document collection D0Every document carry out participle and part-of-speech tagging.
Step 2-2, for the adjective after participle and adverbial word, inquires about tendentiousness dictionary, obtains the scoring of its tendentiousness, as above Literary composition is mentioned, this scoring can be the real number between [-1,1].During simplification, this scoring can be limited at {-1,0,1} tri- Value in value, represents negative, neutral and front three class tendentiousness respectively.The tendentiousness scoring of adjective and adverbial word is implemented the most at last On noun and verb.Tendentiousness scoring s (w) of vocabulary w can be modification noun or the adjective of verb w or the tendency of adverbial word Property scoring meansigma methods.
Step 2-3, constructs initial document collection D0Graph model G(0)=<V,E∪R,Lv,Le>, and by it as event microcosmic The starting point of evolution diagram structure.Wherein V is vertex set, E and R represents the set of different types of limit, and E is to be directly connected to, and R is association Connect, LvIt is the labeling function of vertex set V, LeIt it is the labeling function of limit set E.
Summit V represents a noun or adjective, this summit again can by vocabulary words face amount, vocabulary place collection of document and The tlv triple that vocabulary tendentiousness is constituted represents, is therefore designated as by the labeling function on summit:
Lv(v)=<wv,Dv,s(wv) >, wherein, wvRepresent the vocabulary w, D corresponding with vertex vvRepresent and comprise the micro-of this vocabulary w The set of blog article shelves, s (wv) represent that the tendentiousness of vocabulary w is marked, it is also possible to it is referred to as the tendentiousness eigenvalue of this vertex v.
Limit E in figure represents has specific relation between summit, the corresponding vocabulary on such as two summits occurs in same In microblogging, or co-occurrence distance is less than preassigned threshold value.The labeling function on limit is represented by the corresponding vocabulary on two summits The common counting occurred, and target set during respective document, comprise the issue of the microblogging of the two vertex correspondence vocabulary the most simultaneously The set of time.Namely for e={v1,v2∈ E, have:
Wherein, c (v1,v2) represent vertex v1, v2Corresponding vocabulary occurs in same jointly Number of times in microblogging;tv1v2Representing the set of the issuing time of the microblogging simultaneously comprising the two vertex correspondence vocabulary, it comprises c (v1,v2) individual document markers.
So, the graph model G of obtained event evolution starting point microblogging set(0)The also referred to as microcosmic of initial time event Evolutionary model.
Step 3, in chronological order, processes remaining each bar microblogging one by one, sets up the graph model of this microblogging, and will Addition previous moment event model in, until all microblogging is disposed.Now obtain the microgram model that event develops. Wherein, according to one embodiment of present invention, the graph model of pending microblogging is joined the process in existing graph model Can defer to following steps:
Each the limit in the graph model of pending microblogging:
If the two of this limit summits the most existed with in existing graph model, and this limit existing, then this limit in existing graph model Occurrence number enumerator add up;If existing graph model there is no this limit, then this limit is copied in existing graph model.
If this limit having and only one of which summit occurring in existing graph model, then by the not summit in existing graph model Copy in existing graph model with limit.
If the two of this limit summits are not the most in existing graph model, then by this limit and two summit complete copy to existing figure In model.
Still as a example by the D of microblogging set above and initial document collection D0, to remaining document sequence D-D0, take its Chinese successively Shelves di, by its graph model of method construct as discussed above, it is designated as Gi.Now the microcosmic evolutionary model of event is designated as G(i), pass through Following steps, by GiIt is incorporated into G(i), obtain G(i+1)
During being joined in existing graph model by the graph model of pending microblogging, need to judge certain summit Whether already contained in existing graph model.For certain given summit, if figure includes the word with this vertex correspondence Converging identical summit, the forwarding indicator function of the document sets relating to two summits judges that value is true as 1(), and the tendency on two summits Property eigenvalue compatible, then process decision chart has comprised this given summit.Wherein, the inclining of summit in tendentiousness eigenvalue compatible finger figure Tropism eigenvalue is less than certain threshold value with the difference of this given summit tendentiousness eigenvalue.Assume defined function Eqv:V×V→ 0,1}, judge that two summits are the most equal:
Re-define function Mtv: V × V → 0,1}, represent that when this function value is 1 this is that a pair vocabulary is identical, but civilian Shelves set does not have the summit of Evolvement (such as, forward or comment on), now claims two summits relevant.Function defines such as Under:
Wherein, εsFor the tendentiousness disparity threshold specified in advance, experience value is 0.3.
To document diGraph model Gi, take each of which limit e={v1,v2∈ E, with mark v refer to wherein any one Summit:
If (a)Then v with v ' is merged, and regards the two as same point:
Dv′←Dv′∪Dv
s(v′)←(s(v′)+s(v))/2
If (b)Then v is introduced in figure with the form on new summit, and adds limit r={v ', V} gathers in R to limit.
C if, () condition a, b is all unsatisfactory for, then vertex v directly adds figure G(i)In.
If now G(i)In there is not limit e '={ v1,v2∈ E, then add this limit;If there is this limit, then e with e ' is closed And:
c(v1,v2)e′←c(v1,v2)e′+c(v1,v2)e
t e &prime; v 1 v 2 &LeftArrow; t e &prime; v 1 v 2 &cup; t e v 1 v 2
Constantly repeat said process, until by complete for the pending all document process in collection of document, now obtaining Event microcosmic evolution diagram be designated as G.
Step 4, carries out beta pruning, cutting and conversion to event microcosmic evolution diagram, finally gives Macro-event evolution diagram.
Wherein, event microcosmic evolution diagram is carried out beta pruning, can be with co-occurrence number of times in deletion event microcosmic evolution diagram less than referring to Determine the limit of threshold value, and delete therewith and initial figure G(0)Disconnected branch.Event microcosmic evolution diagram is carried out cutting can include right The initial microblogging sequence mentioned in step 1) temporally divides, and difference can be divided into the varigrained time according to demand Sheet.According to one embodiment of present invention, the conversion to event microcosmic evolution diagram refers to event microcosmic evolution diagram is converted into macroscopic view Evolution diagram, comprising: set up the initial vertex of event Macro Evolution figure, this initial vertex is expressed corresponding in microcosmic evolution diagram The subgraph of initial portion;Then, investigate each timeslice successively, choose summit corresponding with this time sheet in microcosmic evolution diagram and Limit, constructs the minimum connected subgraph with this subgraph as base in microcosmic evolution diagram, joins macroscopic view using this subgraph as a summit In evolution diagram, if this subgraph intersects with the subgraph of other vertex correspondence, then in Macro Evolution figure, construct a limit by two tops Point is connected, and gives this limit using the Jaccard coefficient of two subgraphs as eigenvalue.
Still with the D of microblogging set above, initial document collection D0And remaining document sequence D-D0As a example by, according to the present invention's One embodiment, illustrates the execution process of step 4.
Step 4-1, given threshold epsiloncoThe minimum co-occurrence number of times limited between vocabulary (occurs in same piece microblogging the most jointly Number of times), or given required minimum co-occurrence number of times is divided by the given minimum co-occurrence frequency of total number of files.Scan event microcosmic evolution diagram Each limit in G, to e={v therein1,v2∈ E, if c is (v1,v2)e′≤εco, then from E, remove this limit.From initial figure G(0)Go out Send out, connected component in search graph, delete from vertex set and scheme disconnected summit with initial.
Step 4-2, to document sequence D-D0Divide timeslice.Now can divide different timeslices as required, including Following methods:
A () specifies Fixed Time Interval, as divided in units of hour, day
B () calculates initial document collection D0Time span, and as fixed value divide timeslice
C () carries out clustering according to the density degree of time in document sequence, form the timeslice that interval is different.
It is designated as T={T to this step divides the time slice sequence obtained1,T2,...,Tm, each timeslice comprises one Or several documents.
Step 4-3, creates event Macro Evolution figureWherein VΨIt it is summit in Macro Evolution figure Set, EΨIt is limit set,It is vertex set VΨLabeling function,It is that E is gathered on limitΨLabeling function.Create vertex v0∈ VΨ, note
Step 4-4, investigates each timeslice in timeslice set T, successively to Ti∈ T,
Choose in figure G in timeslice TiIn point set and Bian Ji, be designated as V and E respectively.Here, in order to accelerate inquiry velocity, When can utilize additional two in summit mentioned above and limit, impress selects the point set in a timeslice and Bian Ji.
Very big connected component in labelling vertex set V, with these very big connected components as base, constructs and comprises V Figure G in minimum connected subgraph GV.According to one embodiment of present invention, the method includes:
A () solves the shortest path between any two being in respectively in two very big connected components with dijkstra's algorithm Footpath;
B () selects of minimum from some shortest paths, summits all in path and limit are added subgraph;
C () repeats ab step until subgraph connects completely.
Create vertex v → VΨ, noteExhaustive VΨEach vertex v ' in-v, if Then create limit e={v, v ' } → EΨ, and labellingWherein represent vertex v and v ' on the right of equation Jaccard coefficient, computing formula is:
Wherein, Gv∩Gv′And Gv∪Gv′Represent two microcosmic evolution diagram tops respectively The common factor of some set and union, function # () represents the element number in set.
Repeat step 4-4 until all timeslices are all disposed.Now event Macro Evolution figure structure is complete.Then, The evolution of affair character can be observed based on event Macro Evolution figure.
The microcosmic evolution diagram of event with vocabulary as granularity, major embodiment along with the development and change of event, event knowledge Constantly extension, by the structure rule on limit, embodies succession and the evolution of knowledge, thus owing to being based purely in terms of interpretability Traditional evolution analysis method of vocabulary similarity.But microcosmic evolution diagram number of nodes is many, annexation is complicated, it is adaptable to computer In calculation be but unfavorable for the observation of people.And based on microcosmic evolution diagram refine Macro Evolution figure with timeslice as granularity, The quantity on node and limit is greatly decreased the most accordingly, is suitable for the observation of people.Meanwhile, again can be by the method for regulating time sheet size Change observation granularity, it is achieved the scaling to Macro Evolution figure.Event can be easily observed based on event Macro Evolution figure The evolution of feature.
In yet another embodiment of the present invention, additionally provide a kind of affair character evolution digging system based on microblogging, Comprising: for choosing several microbloggings representing event origin from the set of the microblogging text relevant to event to be analyzed, To constitute the device of event evolution starting point microblogging set;For using method as discussed above tectonic event evolution starting point microblogging collection The graph model closed, as the device of initial event microcosmic evolution diagram;For using the method pair of upper discussion and event to be analyzed Remaining each bar microblogging in the set of relevant microblogging text, builds the graph model of this microblogging and adds it to current event Device in evolution microgram;For using method as discussed above to obtain event macroscopic view based on last event microcosmic evolution diagram Evolution diagram the device of evolution based on event Macro Evolution figure observation affair character.
It should be noted last that, above example is only in order to illustrate technical scheme and unrestricted.Although ginseng According to embodiment, the present invention is described in detail, it will be understood by those within the art that, the technical side to the present invention Case is modified or equivalent, and without departure from the spirit and scope of technical solution of the present invention, it all should be contained in the present invention Right in the middle of.

Claims (7)

1. an affair character evolution excavation method based on microblogging, comprises the following steps:
Step 1, chooses several microbloggings representing event origin from the set of the microblogging text relevant to event to be analyzed, with Composition event evolution starting point microblogging set, wherein represents the microblogging of event origin and has the feature that and a) deliver the time early;B) be Original microblogging, and the non-forwarded or microblogging of comment;
Step 2, the graph model of tectonic event evolution starting point microblogging set, as initial event microcosmic evolution diagram;Described artwork Noun/verb during summit is each microblogging text occurring in this event evolution starting point microblogging set in type, between two summits While represent that the word of the two vertex correspondence occurs in same microblogging jointly or co-occurrence distance is less than previously given threshold value;
Step 3, to remaining each bar microblogging in the set of the microblogging text relevant to event to be analyzed, builds the artwork of this microblogging Type also adds it in current event evolution microgram;
Step 4, carries out cutting and converts to obtain event Macro Evolution figure base the event microcosmic evolution diagram obtained through step 3 The evolution of affair character is observed in event Macro Evolution figure;
Wherein, the graph model of constructed microblogging is joined by described step 3 current event evolution microgram to include:
Each the limit in the graph model of pending microblogging:
If during a) two summits on this limit are all present in current event evolution microgram, and in this event evolution microgram Have this limit, then the occurrence number to this limit counts and adds up;If this event evolution microgram there is no this limit, then this limit is multiple Make in this event evolution microgram;
If b) this limit having and only one of which summit occurring in current event evolution microgram, then will not develop in this event Summit and limit in microgram copy in this event evolution microgram;
If c) two summits on this limit are not the most in current event evolution microgram, then by this limit and two summit complete copy In this event evolution microgram;
Wherein, described in described step 4, event microcosmic evolution diagram is carried out cutting and conversion includes:
Step 4-1) the microblogging text relevant to event to be analyzed is temporally ranked up, to this microblogging text sequence temporally Cut into slices, form the timeslice of desired particle size;
Step 4-2) in event Macro Evolution figure, create a summit, corresponding initial event microcosmic evolution diagram;
Step 4-3) the following step performs for each timeslice:
4-3-a) choosing summit corresponding to each timeslice and limit in event microcosmic evolution diagram successively, structure with this subgraph is The minimum connected subgraph of base;
4-3-b) in event Macro Evolution figure, create a summit, corresponding to this minimum connected subgraph, if this minimum connection Figure intersects with the subgraph of other vertex correspondence in event Macro Evolution figure, then create a limit connecting two subgraphs.
Method the most according to claim 1, the summit of graph model described in described step 2 is with by the name of this vertex correspondence Word/verb, comprises the set of the microblogging document of noun/verb, and the tlv triple that the tendentiousness scoring of this noun/verb is constituted carrys out table Showing, wherein the tendentiousness scoring of this noun/verb tendentiousness corresponding to the modification adjective of this noun/verb and adverbial word is commented The meansigma methods divided.
Method the most according to claim 2, described step 2 includes:
Step 2-1) every microblogging text in event evolution starting point microblogging set is carried out participle and part-of-speech tagging;
Step 2-2) to the adjective after participle and adverbial word, the scoring of its tendentiousness is set;
Step 2-3) for the noun after participle and verb, will modify corresponding to adjective and the adverbial word of same noun/verb Tendentiousness scoring is averaged, and the tendentiousness as this noun or verb is marked;
Step 2-4) using noun and verb as summit, if the word of any two vertex correspondence occurs in same microblogging jointly or Co-occurrence distance less than previously given threshold value, then creates limit between the two summit.
Method the most according to claim 3, described step 3 also includes judging in the graph model of microblogging that whether certain summit exists Step in event evolution microgram, comprising: for certain summit given in the graph model of microblogging, if event develops micro- Including the summit identical with the word of this vertex correspondence in sight figure, this microblogging relates to corresponding vertex in this event evolution microgram And microblogging text exist forward or comment relation, and the two summit tendentiousness scoring compatible, then judge event drill Change and microgram has comprised this given summit, wherein, corresponding vertex in the tendentiousness compatible self-explanatory characters' part evolution microgram of scoring Tendentiousness scoring is less than certain threshold value with the difference of this given summit tendentiousness scoring.
Method the most according to claim 1, described step 4-3) also include this edge of created two subgraphs of connection Giving weights, the weights on limit are the Jaccard coefficient of two vertex correspondence subgraphs;Wherein in event Macro Evolution figure wantonly two Individual vertex v and v ', the Jaccard coefficient calculations mode of its corresponding subgraph is:Wherein, Gv ∩Gv'And Gv∪Gv'Representing common factor and the union of the vertex set of two vertex correspondence subgraphs respectively, function # () represents in set Element number.
Method the most according to claim 1, described step 4 also includes the step that event microcosmic evolution diagram carries out beta pruning, It includes that in deletion event microcosmic evolution diagram, occurrence number is less than the limit of given threshold value, then deletes and drills with initial event microcosmic Changing and scheme disconnected branch, wherein the occurrence number on limit refers to this limit in the set of the microblogging text relevant to event to be analyzed The word of two vertex correspondence occurs in the number of times in same microblogging jointly.
7. an affair character evolution digging system based on microblogging, including:
Several microbloggings representing event origin are chosen, with structure in the set from the microblogging text relevant to event to be analyzed The device of one-tenth event evolution starting point microblogging set, wherein represents the microblogging of event origin and has the feature that and a) deliver the time early; B) it is original microblogging, and the non-forwarded or microblogging of comment;
For the graph model of tectonic event evolution starting point microblogging set, as the device of initial event microcosmic evolution diagram;Described Noun/verb during summit is each microblogging text occurring in this event evolution starting point microblogging set in graph model, two summits Between limit represent that the word of the two vertex correspondence occurs in same microblogging jointly or co-occurrence distance is less than previously given threshold value;
In the set to the microblogging text relevant to event to be analyzed, remaining each bar microblogging, builds the graph model of this microblogging And add it to the device in current event evolution microgram;
For last event microcosmic evolution diagram being carried out cutting and converting to obtain event Macro Evolution figure grand based on event See the device that evolution diagram observes the evolution of affair character;
Wherein, the graph model of constructed microblogging joins current event evolution microgram to include:
Each the limit in the graph model of pending microblogging:
If during a) two summits on this limit are all present in current event evolution microgram, and in this event evolution microgram Have this limit, then the occurrence number to this limit counts and adds up;If this event evolution microgram there is no this limit, then this limit is multiple Make in this event evolution microgram;
If b) this limit having and only one of which summit occurring in current event evolution microgram, then will not develop in this event Summit and limit in microgram copy in this event evolution microgram;
If c) two summits on this limit are not the most in current event evolution microgram, then by this limit and two summit complete copy In this event evolution microgram;
Wherein, described event microcosmic evolution diagram is carried out cutting and conversion includes:
The microblogging text relevant to event to be analyzed is temporally ranked up, this microblogging text sequence is temporally cut Sheet, forms the timeslice of desired particle size;
A summit, corresponding initial event microcosmic evolution diagram is created in event Macro Evolution figure;For each timeslice Execution the following step:
I) in event microcosmic evolution diagram, choose summit corresponding to each timeslice and limit successively, construct with this subgraph as base Minimum connected subgraph;
Ii) in event Macro Evolution figure create a summit, corresponding to this minimum connected subgraph, if this minimum connected subgraph and In event Macro Evolution figure, the subgraph of other vertex correspondence intersects, then create a limit connecting two subgraphs.
CN201310532377.7A 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs Active CN103631862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310532377.7A CN103631862B (en) 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN2012104337138 2012-11-02
CN201210433713 2012-11-02
CN201210433713.8 2012-11-02
CN201310532377.7A CN103631862B (en) 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs

Publications (2)

Publication Number Publication Date
CN103631862A CN103631862A (en) 2014-03-12
CN103631862B true CN103631862B (en) 2017-01-11

Family

ID=50212904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310532377.7A Active CN103631862B (en) 2012-11-02 2013-10-31 Event characteristic evolution excavation method and system based on microblogs

Country Status (1)

Country Link
CN (1) CN103631862B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104536956A (en) * 2014-07-23 2015-04-22 中国科学院计算技术研究所 A Microblog platform based event visualization method and system
CN104933129B (en) * 2015-06-12 2019-04-30 百度在线网络技术(北京)有限公司 Event train of thought acquisition methods and system based on microblogging
CN104899908B (en) * 2015-06-12 2018-09-11 百度在线网络技术(北京)有限公司 The method and apparatus for generating event group evolution diagram
CN106708947B (en) * 2016-11-25 2020-06-09 成都寻道科技有限公司 Web article forwarding and identifying method based on big data
CN109145224B (en) * 2018-08-20 2021-11-23 电子科技大学 Social network event time sequence relation analysis method
CN110472105A (en) * 2019-08-06 2019-11-19 电子科技大学 A kind of social networks event evolution method for tracing divided based on the time
CN110781317B (en) * 2019-10-29 2022-03-01 北京明略软件系统有限公司 Method and device for constructing event map and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102214241A (en) * 2011-07-05 2011-10-12 清华大学 Method for detecting burst topic in user generation text stream based on graph clustering
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8566360B2 (en) * 2010-05-28 2013-10-22 Drexel University System and method for automatically generating systematic reviews of a scientific field

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102214241A (en) * 2011-07-05 2011-10-12 清华大学 Method for detecting burst topic in user generation text stream based on graph clustering
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model

Also Published As

Publication number Publication date
CN103631862A (en) 2014-03-12

Similar Documents

Publication Publication Date Title
CN103631862B (en) Event characteristic evolution excavation method and system based on microblogs
Bontcheva et al. Making sense of social media streams through semantics: a survey
Zhang et al. Event detection and popularity prediction in microblogging
Wang et al. SentiView: Sentiment analysis and visualization for internet popular topics
CN106484764A (en) User&#39;s similarity calculating method based on crowd portrayal technology
Hahmann et al. Twitter location (sometimes) matters: Exploring the relationship between georeferenced tweet content and nearby feature classes
US20130304818A1 (en) Systems and methods for discovery of related terms for social media content collection over social networks
CN103793489B (en) Method for discovering topics of communities in on-line social network
CN103279887B (en) A kind of microblogging based on information theory propagates visual analysis method
Andryani et al. Social media analytics: data utilization of social media for research
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
CN105912656A (en) Construction method of commodity knowledge graph
CN104536956A (en) A Microblog platform based event visualization method and system
CN110457404A (en) Social media account-classification method based on complex heterogeneous network
CN112771564A (en) Artificial intelligence engine that generates semantic directions for web sites to map identities for automated entity seeking
CN104765729A (en) Cross-platform micro-blogging community account matching method
CN105183717A (en) OSN user emotion analysis method based on random forest and user relationship
CN105721279A (en) Relationship circle excavation method and system of telecommunication network users
Ou et al. Exploiting community emotion for microblog event detection
CN107239512A (en) The microblogging comment spam recognition methods of relational network figure is commented in a kind of combination
CN105378717A (en) Method for user categorization in social media, computer program, and computer
CN104346408A (en) Method and equipment for labeling network user
Jabeur et al. Uprising microblogs: A Bayesian network retrieval model for tweet search
CN109992784A (en) A kind of heterogeneous network building and distance metric method for merging multi-modal information
CN106503858A (en) A kind of method that trains for predicting the model of social network user forwarding message

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant