CN102411611B - Instant interactive text oriented event identifying and tracking method - Google Patents

Instant interactive text oriented event identifying and tracking method Download PDF

Info

Publication number
CN102411611B
CN102411611B CN 201110312540 CN201110312540A CN102411611B CN 102411611 B CN102411611 B CN 102411611B CN 201110312540 CN201110312540 CN 201110312540 CN 201110312540 A CN201110312540 A CN 201110312540A CN 102411611 B CN102411611 B CN 102411611B
Authority
CN
China
Prior art keywords
words
stamp
wheel
feature
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201110312540
Other languages
Chinese (zh)
Other versions
CN102411611A (en
Inventor
田锋
郑庆华
张惠三
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN 201110312540 priority Critical patent/CN102411611B/en
Publication of CN102411611A publication Critical patent/CN102411611A/en
Application granted granted Critical
Publication of CN102411611B publication Critical patent/CN102411611B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses an instant interactive text oriented event identifying and tracking method, which comprises the following two steps that: I, in the sorting stage of a turntalking theme category, the turntalking content is represented by an adaptive language feature gather representing mode and the turntalking theme category is sorted by a potential semantic analysis mode that is acquired by training with a supervision layering possibility; II, in the turntalking event identifying and tracking stage, the starting, the continuing and the ending of the event are judged according to the theme category of the turntalking, the time difference of the front and back turntalking and the compactness of the talking staff of the front and back turntalking on the social network grade, wherein (1) the invention sets forward a theory of adaptively adjusting the turntalking compactness degree valve value Th according to the fluctuation size of the time sequence data after the current turntalking, then the adaptive language feature gather calculation is executed; and (2) the potential semantic analysis mode with the supervision layering possibility is updated in certain time in the implementation. The provided method is an online identifying and tracing algorithm.

Description

A kind of event recognition and tracking towards the real-time interaction text
Technical field
The present invention relates to a kind of information retrieval, extraction and management and natural language processing technique, particularly relate to a kind of event recognition towards online real-time interaction text and tracking.
Background technology
Along with Internet technology use increasingly extensive, based on the network application development of interactive text, become that people obtain and one of the Main Means that releases news, such as the typical mutual text application such as Internet chatroom, microblogging.Contain a large amount of abundant information resources in these texts, how to realize event in these mutual text application is searched, organized and utilize by subject categories, becoming the task of top priority.Such as automatic recognition network learner's emotion change events, thereby regulate its learning efficiency; Identify the responsive accident of various societies or new events etc.The applicant is new through looking into, and does not retrieve the patent that the present invention is correlated with.But look for several pieces of similar articles, be respectively:
1) based on the Message-text cluster research of frequent mode.Hu Jixiang, Postgraduate School, Chinese Academy of Sciences's (Institute of Computing Technology).
2) be used for the weighing computation method CDTF_IDF of chat vocabulary.Gao Peng, Cao Xianbin, Computer Simulation, 2007.12.
Article 1) author found frequent mode (being referred to as crucial frequent mode) comprised the more semantic informations such as word order and contiguous context to mutual text feature extract key, propose a kind of guideless feature selecting algorithm based on frequent mode, be applied to text classification and cluster.
Article 2) contents supervision mainly for the chatroom uses, by calculated off-line vocabulary respectively in the different pieces of information source weights and gather and emphasis vocabulary improved the mode such as weight and calculate the term weight of chat data, thereby reach the purpose of identification chatroom theme.
Look into newly according to above-mentioned, existing similar technique mainly contains the different of following several respects from the inventive method:
1. the research object of prior art is with whole news (event) or paragraph, and this method is for words wheel rank.
2. prior art is the off-line subject identifying method, and this method is online event recognition method.
3. the result of prior art identification is only for which kind of theme whether whole news (event) or paragraph belong to, and the generation of relevant news (event), the i.e. recognition and tracking of subject matter level; And this method mainly is to find whether the event that the online interaction both sides discuss is consistent, and whether this event complete (beginning and finish), and the people of participation has those, namely to recognition and tracking single, concrete event.
4. aspect the character representation of mutual text, the prior art collected offline is only calculated for the words-frequency feature of Present News (event), and this method has been found the Time Dependent characteristic, introduces the gathering of all the words wheel features in the time threshold and carries out subject classification.
5. existing method is take unsupervised probability latent semantic analysis method as main, and this method is for the hierarchical model of theme, proposed to have supervision, layering PLSA topic model training method, and regularly upgrade topic model.
Summary of the invention
For existing problem among aforementioned related art and the present invention relatively, the invention provides a kind of event recognition towards online real-time interaction text and tracking, comprise the steps:
The first step: words wheel level subject categories sorting phase:
(1) in the real-time interaction text, the speech Speech that once inputs take the user takes turns Turn as words, is expressed as with five-tuple:
T i=(i,id,role,stamp,content)
Wherein, T iRepresent i words wheel, and
Figure GDA0000118310890000021
It is Positive Integer Set; Id represents to distinguish unique indications of speaker; Role represents speaker's role, its minute two classification: speaker Speaker and recipient recipient; Stamp represents to talk about the timestamp that wheel occurs; Content represents once to talk about all texts of making a speech in the wheel;
T so i.stamp just represent the time that i words wheel occurs, T i.content the content that just represents i words wheel, described mutual text are the words wheels that comes from same chatroom or the discussion group;
(2) to current words wheel T iContent T i.content carry out the text pre-service, according to feature lexicon extraction Feature Words wherein, computational language proper vector
Figure GDA0000118310890000022
W wherein Ih, 0<h≤n represents that h Feature Words is at T i.content the number of times that occurs in, the number of n representation feature word; Described feature lexicon extracts from training data;
(3) if words wheel T iBeing the first words wheel that occurs in the system, also is T 1, forward (5) to; Otherwise, carry out (4);
(4) calculate words wheel T iThe self-adaptation language feature assemble vector W T i = ( w T i 1 , w T i 2 , . . . , w T i h ′ , . . . w T in ) , Wherein
Figure GDA0000118310890000024
0<h '≤n represents the number of times that the individual Feature Words of h ' occurs, the number of n representation feature word in this language feature is assembled;
(5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories;
Second step, words wheel level event recognition and tracking phase:
(1) according to subject categories under the words wheel, the mistiming that front and back words wheel occurs and the front and back tightness of words wheel speaker on the community network level are judged current words wheel T iThe whether beginning of event, continuity and end;
(2) if words wheel T iBe the event end statement, namely formed a complete event, so mark T iBe the words wheel of End Event, otherwise be labeled as the not words wheel of End Event;
(3) judge whether to arrive the regular update time; If arrive, then to there being supervision layering probability latent semantic analysis model to carry out model modification; Otherwise, finishing algorithm, described regular update refers to that the complete event that will newly identify each the end of month joins in the training set, trains again to model;
The computation process that the described self-adaptation language feature of the step of the first step (4) is assembled vector is:
Step1: calculate current words wheel T iAfter the generation, at the time interval [T i.stamp-Δ T, T i.stamp] the frequency V (T that the words wheel occurs in i):
V ( T i ) = C ( T 1 . stamp , T i . stamp ) &Delta;T , T i . stamp - T 1 . stamp < &Delta;T C ( T i . stamp - &Delta;T , T i . stamp ) &Delta;T , T i . stamp - T 1 . stamp &GreaterEqual; &Delta;T
Wherein, C (T 1.stamp, T i.stamp) be illustrated in the time interval [T 1.stamp, T i.stamp] the words wheel number of times that occurs altogether in, C (T i.stamp-Δ T, T i.stamp) be illustrated in the time interval [T i.stamp-Δ T, T i.stamp] the words wheel number of times that occurs altogether in, Δ T be one regular time the interval, initialization Δ T=1 hour;
Step2: the size of the adaptive density threshold Th that determines to be pressed for time: calculate first Th ', that is:
Th &prime; = ( 1 - &Delta;v ) &times; Th , V ( T i ) &GreaterEqual; ( 1 + &Delta;v ) Th , ( 1 - &Delta;v ) &times; V ( T i - 1 ) < V ( T i ) < ( 1 + &Delta;v ) &times; V ( T i - 1 ) ( 1 + &Delta;v ) &times; Th , V ( T i ) &le; ( 1 - &Delta;v ) &times; V ( T i - 1 )
Then Th=Th ', namely update time threshold value, wherein during initialization, Δ v is set to 0.3, threshold value Th=6 hour, utilizes above thought to reach the purpose of adaptive change time threshold Th size;
Step3: make T={T g..., T i| 0<g≤i} represents the time interval [T i.stamp-Th, T i.stamp] the words wheel set that occurs in, so T iLanguage feature assemble vectorial sums of language feature that vector just be all words wheels among the T, that is:
W T i = W T g + W T g + 1 + , . . . , + W T i
The step (3) of the step of the first step (5), second step is described supervision layering probability latent semantic analysis model, and its training process is as follows:
Step1, according to the hierarchical nature of theme to training data set carry out hierarchical classification tissue, formation is a tree structure after the theme layering, is denoted as:
Figure GDA0000118310890000041
Wherein level represents the residing level of subject categories, and k represents current theme
Figure GDA0000118310890000042
K the sub-subject categories that belongs in the last layer theme, a kRepresent current theme If the sub-topics number that comprises is a k=0, so
Figure GDA0000118310890000044
Be exactly the leafy node of theme, be designated as
Figure GDA0000118310890000045
Otherwise
Figure GDA0000118310890000046
Comprise exactly the mother node of sub-topics, be denoted as Described mon_topics refers to include the node set of sub-topics, and leaf_topics refers to the leafy node set; When level=0, note
Figure GDA0000118310890000048
The subject categories of expression top layer, wherein a 0Expression top layer subject categories number;
So, the organizational process of training data is as follows:
Step1.1, generating feature term vector W, process is as follows:
The sum of the independent word that occurs in Step1.1.1, the set of statistics training data forms a Feature Words vector after the deletion stop words
Figure GDA0000118310890000049
Wherein
Figure GDA00001183108900000410
Represent the number of times that f Feature Words occurs in training data, The number of representation feature word; Described stop words comprises: symbol, auxiliary word, preposition, conjunction, interjection, onomatopoeia, number;
Step1.1.2, utilize TFIDF algorithm pair
Figure GDA00001183108900000412
Carry out the calculating of term weight function, and sort by weight is descending, the deletion weight obtains the Feature Words vector after less than 0.1 Feature Words and is W={w 1, w 2... w F '..., w n, w wherein F 'Represent the individual Feature Words of f ' weight size in training data, the number of n representation feature word;
Step1.2, generation co-occurrence matrix N, process is as follows:
Step1.2.1, with all belong to theme in the training data
Figure GDA00001183108900000413
Document form a collection of document
Figure GDA00001183108900000414
M wherein kExpression document number;
Step1.2.2, the dimension size is n * m so kThe co-occurrence matrix N of word and document just is: N=(c (w r, d s)) Rs, wherein, c (w r, d s) r number of times that Feature Words occurs in s document of expression;
Step2, in the set of this training dataset, according to top-down mode, successively train corresponding probability latent semantic analysis model, process is as follows:
Step2.1, utilize the TFIDF algorithm element of matrix N to be carried out the calculating of weight, generate a new co-occurrence matrix
Step2.2, utilize probability latent semantic analysis algorithm to co-occurrence matrix
Figure GDA00001183108900000416
Learn, generating size is the WZ=(p (w of n * Q r, z q)) RqAnd size is Q * m kDZ=(p (z q, d s)) QsTwo matrixes, wherein z q∈ Z=(z 1, z 2..., z Q), Z represents latent semantic space, Q represents the size of latent semantic space; P (w r, z q) represent that r Feature Words is at potential semantic z qOn probability size; P (z q, d s) represent that s document is at potential semantic z qOn probability size;
Step3, utilize multi-class support vector machine SVM (Support Vector Machine) sorter respectively the DZ corresponding to probability latent semantic analysis model of each layer training gained to be trained, what generate each layer correspondence has a supervision probability latent semantic analysis category of model device When level=0, sorter is
Figure GDA0000118310890000052
In the said method, the process that the step of the first step (5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories is:
Step1: calculate current words wheel T iLanguage feature assemble vector
Figure GDA0000118310890000053
The WZ that utilization has supervision layering probability latent semantic analysis Algorithm Learning to obtain will Be mapped on the latent semantic space Z, namely utilize latent semantic space Z to represent T iThe language feature content of assembling, that is:
Z T i = W T i &times; WZ
Figure GDA0000118310890000056
Represent current words wheel T iLanguage feature be gathered in probability distribution on the latent semantic space Z, that is to say Feature Words vector W is carried out Feature Dimension Reduction;
Step2: that utilizes that training obtains has a supervision layering probability latent semantic analysis category of model device
Figure GDA0000118310890000057
To T iCarry out the subject categories classification;
Step3: if T iSubject categories belong to mon_topics, level increases by 1 so, forwards Step4 to; Otherwise, mark T iSubject categories be the subject categories of identifying, finish;
Step4: utilize corresponding To T iCarry out the subject categories classification, forward Step3 to.
The detailed process of step in the described second step (1) is as follows:
Step1: search and obtain [T i.stamp-Th, T i.stamp] generation in the time interval and words wheel set T={T that be not the event end g..., T i| 0<g≤i};
Step2: if T only contains element T i, mark T so iBe the initial sentence of a new event, algorithm finishes; Otherwise, make l=i-1, carry out Step3;
Step3: judge T iWith T lSubject categories whether identical;
Step4: if T iWith T lSubject categories identical, so with T iBelong to T lIn the affiliated event, algorithm finishes; Otherwise make l=l-1, carry out Step5;
Step5: if l 〉=g so, forwards Step3 to; Otherwise, forward Step6 to;
Step6: if T iAffiliated event be empty, make so l '=i-1, forward Step7 to; Otherwise, finish algorithm;
Step7: calculate T i.id with T L '.id the tightness d on the community network level;
Step8: if d>0.5, so with T iBelong to T L 'In the affiliated event, algorithm finishes; Otherwise make l '=l '-1, carry out Step9;
Step9: if l ' 〉=g so, forwards Step7 to; Otherwise, mark T iBe the initial sentence of a new events, finish algorithm.
The computing method of described community network tightness are:
d ( T i . id , T i - 1 . id ) = IO ( T i . id , T i - 1 . id ) I ( T i . id ) + O ( T i , id ) + I ( T i - 1 . id ) + O ( T i - 1 . id )
I (T wherein i.id) expression T i.id in-degree sum, O (T i.id) expression T i.id out-degree sum, T I-1.id similar; IO (T i.id, T I-1.id) expression T i.id to T I-1.id number of times and T talk I-1.id to T iThe number of times sum of .id speaking, the statistics of out-degree, in-degree is the summation of historical data, the tightness of community network is upgraded once per month.
Description of drawings
Fig. 1 event recognition of the present invention and trace flow figure.
Fig. 2 incremental training process flow diagram flow chart.
Fig. 3 talks about wheel subject categories classification process figure.
Fig. 4 talks about wheel level event recognition and following principle figure.
The example that Fig. 5 community network tightness is calculated.Wherein Fig. 5 a is raw-data map, and Fig. 5 b is the digraph after transforming.
Embodiment
For a more clear understanding of the present invention, the present invention is described in further detail below in conjunction with accompanying drawing.
1, the present invention adopts is that the identification of advanced jargon wheel subject categories is talked about wheel level event recognition again and come the words that user in the real-time interaction text inputs are carried out event recognition and tracking with the mechanism of tracking, and its process flow diagram as shown in Figure 1.
Research purpose: the Turn of user's input is belonged in the corresponding event.
Research background: the real-time interaction text is than single piece of documents such as blog, comment, novels, and it also has own unique characteristic of speech sounds except having inherited the characteristics such as ambiguousness that natural language text has and non-standard:
(1) interactivity;
(2) time series characteristic, great majority are close to real-time, interactives, and the theme between the words wheel has time dependence, and namely with the nearer talk of theme time of occurrence, the possibility that both are correlated with is larger;
(3) content of each words wheel is few, and sentence is short, will inevitably cause like this feature sparse;
(4) interactive mode is complicated, for example one to one, the interactive mode of one-to-many, multi-to-multi;
(5) language performance is various informative in the mutual text, and language performance is succinct, and misspelling, term lack of standardization and noise are a lot.
These have brought larger challenge all for the treatment technology of mutual text.
For the kind specific character that mutual text self has, the strategy of a kind of " two steps were walked " has been proposed.Specific works mechanism is as follows:
The first step is carried out the classification of subject categories, in this step, identifies the subject categories under each words wheel, and such as: politics, economic, culture, education, science and technology etc., process is as follows:
(1) in the real-time interaction text, the speech Speech that once inputs take the user takes turns Turn as words, is expressed as with five-tuple:
T i=(i,id,role,stamp,content)
Wherein, T iRepresent i words wheel, and
Figure GDA0000118310890000071
It is Positive Integer Set; Id represents to distinguish unique indications of speaker; Role represents speaker's role, its minute two classification: speaker Speaker and recipient recipient; Stamp represents to talk about the timestamp that wheel occurs; Content represents once to talk about all texts of making a speech in the wheel;
T so i.stamp just represent the time that i words wheel occurs, T i.content the content that just represents i words wheel, described mutual text are the words wheels that comes from same chatroom or the discussion group;
(2) to current words wheel T iContent T i.content carry out the text pre-service, according to feature lexicon extraction Feature Words wherein, computational language proper vector
Figure GDA0000118310890000072
W wherein Ih, 0<h≤n represents that h Feature Words is at T i.content the number of times that occurs in, the number of n representation feature word; Described feature lexicon extracts from training data;
(3) if words wheel T iBeing the first words wheel that occurs in the system, also is T 1, forward (5) to; Otherwise, carry out (4);
(4) calculate words wheel T iThe self-adaptation language feature assemble vector W T i = ( w T i 1 , w T i 2 , . . . , w T i h &prime; , . . . w T in ) , Wherein
Figure GDA0000118310890000074
0<h '≤n represents the number of times that the individual Feature Words of h ' occurs, the number of n representation feature word in this language feature is assembled;
(5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories;
Second step, words wheel level event recognition and tracking phase:
(1) according to subject categories under the words wheel, the mistiming that front and back words wheel occurs and the front and back tightness of words wheel speaker on the community network level are judged current words wheel T iThe whether beginning of event, continuity and end;
(2) if words wheel T iBe the event end statement, namely formed a complete event, so mark T iBe the words wheel of End Event, otherwise be labeled as the not words wheel of End Event;
(3) judge whether to arrive the regular update time; If arrive, then to there being supervision layering probability latent semantic analysis model to carry out model modification; Otherwise, finishing algorithm, described regular update refers to that the complete event that will newly identify each the end of month joins in the training set, trains model, again referring to accompanying drawing 2.
2, the subject categories classification mechanism of words wheel
Research purpose: those words wheels that will be relevant with current words wheel flock together, and utilize the language feature of words wheel to assemble vector as the language feature vector of current words wheel.
Research background: the content of each words wheel is few in the mutual text, and sentence is short, will inevitably cause like this feature sparse, assembles vector by the language feature that calculates current words wheel so, can overcome to a certain extent content few, short these shortcomings of sentence.
Subject categories classification the present invention of dialogue wheel has adopted a kind of adaptive subject categories classification mechanism, its process flow diagram as shown in Figure 3, concrete working mechanism is as follows:
(1) calculates current words wheel T iLanguage feature assemble vector, process is as follows
Step1: calculate current words wheel T iAfter the generation, at the time interval [T i.stamp-Δ T, T i.stamp] the frequency V (T that the words wheel occurs in i):
V ( T i ) = C ( T 1 . stamp , T i . stamp ) &Delta;T , T i . stamp - T 1 . stamp < &Delta;T C ( T i . stamp - &Delta;T , T i . stamp ) &Delta;T , T i . stamp - T 1 . stamp &GreaterEqual; &Delta;T
Wherein, C (T 1.stamp, T i.stamp) be illustrated in the time interval [T 1.stamp, T i.stamp] the words wheel number of times that occurs altogether in, C (T i.stamp-Δ T, T i.stamp) be illustrated in the time interval [T i.stamp-Δ T, T i.stamp] the words wheel number of times that occurs altogether in, Δ T be one regular time the interval, initialization Δ T=1 hour;
Step2: the size of the adaptive density threshold Th that determines to be pressed for time: calculate first Th ', that is:
Th &prime; = ( 1 - &Delta;v ) &times; Th , V ( T i ) &GreaterEqual; ( 1 + &Delta;v ) Th , ( 1 - &Delta;v ) &times; V ( T i - 1 ) < V ( T i ) < ( 1 + &Delta;v ) &times; V ( T i - 1 ) ( 1 + &Delta;v ) &times; Th , V ( T i ) &le; ( 1 - &Delta;v ) &times; V ( T i - 1 )
Then Th=Th ', namely update time threshold value, wherein during initialization, Δ v is set to 0.3, threshold value Th=6 hour, utilizes above thought to reach the purpose of adaptive change time threshold Th size;
Step3: make T={T g..., T i| 0<g≤i} represents the time interval [T i.stamp-Th, T i.stamp] the words wheel set that occurs in, so T iLanguage feature assemble vectorial sums of language feature that vector just be all words wheels among the T, that is:
W T i = W T g + W T g + 1 + , . . . , + W T i
(2) training has supervision layering probability latent semantic analysis model
Research purpose: training data is carried out laminated tissue and training according to the layering theme.
Research background: there is hierarchical nature in theme, and these exist in real world applications in a large number, classifies with the subject of the Ministry of Education during for example book classification is learned.From abstract theme near fine-grained core event, training data is organized and trained according to this model, can effectively solve the nonequilibrium behavior of data.
Detailed process is as follows
Step1, according to the hierarchical nature of theme to training data set carry out hierarchical classification tissue, formation is a tree structure after the theme layering, is denoted as:
Figure GDA0000118310890000092
Wherein level represents the residing level of subject categories, and k represents current theme
Figure GDA0000118310890000093
K the sub-subject categories that belongs in the last layer theme, a kRepresent current theme
Figure GDA0000118310890000094
If the sub-topics number that comprises is a k=0, so Be exactly the leafy node of theme, be designated as
Figure GDA0000118310890000096
Otherwise
Figure GDA0000118310890000097
Comprise exactly the mother node of sub-topics, be denoted as
Figure GDA0000118310890000098
Described mon_topics refers to include the node set of sub-topics, and leaf_topics refers to the leafy node set; When level=0, note
Figure GDA0000118310890000099
The subject categories of expression top layer, wherein a 0Expression top layer subject categories number;
So, the organizational process of training data is as follows:
Step1.1, generating feature term vector W, process is as follows:
The sum of the independent word that occurs in Step1.1.1, the set of statistics training data forms a Feature Words vector after the deletion stop words Wherein
Figure GDA00001183108900000911
Represent the number of times that f Feature Words occurs in training data, The number of representation feature word; Described stop words comprises: symbol, auxiliary word, preposition, conjunction, interjection, onomatopoeia, number;
Step1.1.2, utilize TFIDF algorithm pair
Figure GDA00001183108900000913
Carry out the calculating of term weight function, and sort by weight is descending, the deletion weight obtains the Feature Words vector after less than 0.1 Feature Words and is W={w 1, w 2... w F '..., w n, w wherein F 'Represent the individual Feature Words of f ' weight size in training data, the number of n representation feature word;
Step1.2, generation co-occurrence matrix N, process is as follows:
Step1.2.1, with all belong to theme in the training data
Figure GDA0000118310890000101
Document form a collection of document
Figure GDA0000118310890000102
M wherein kExpression document number;
Step1.2.2, the dimension size is n * m so kThe co-occurrence matrix N of word and document just is: N=(c (w r, d s)) Rs, wherein, c (w r, d s) r number of times that Feature Words occurs in s document of expression;
Step2, in the set of this training dataset, according to top-down mode, successively train corresponding probability latent semantic analysis model, process is as follows:
Probability latent semantic analysis model is a kind of topic model, and its principle is as follows:
Make D={d 1, d 2..., d mThe expression collection of document, W={w 1, w 2..., w nThe representation feature set of words, wherein, m represents the document number, n representation feature word number is ignored the order that word occurs in document, so can generate the co-occurrence matrix N=(c (w of a m * n r, d s)) Rs, c (w here r, d s) r number of times that Feature Words occurs in s document of expression; Definition joint density model is:
p(d,w)=p(d)p(w|d), p ( w | d ) = &Sigma; z &Element; Z p ( w | z ) p ( z | d )
Z ∈ Z=(z wherein 1, z 2..., z Q) be latent semantic space, Q is the size of latent semantic space;
Being interpreted as of model so:
The probability that p (d) expression document occurs in data centralization, p (w|z) expression is when having determined semantic z, the probability that relevant word w occurs is respectively much, semantic distribution situation in document of p (z|d) expression, utilize above these definition, just can form a generation model, utilize it to produce new data:
(1) at first selects a document d according to distribution p (d) random sampling;
(2) behind the selected document, the semantic z that sampling selects document to express according to p (z|d);
(3) behind the selected semanteme, select the word of document according to p (w|z);
Process according to above theoretical training probability latent semantic analysis model topic model is:
Step2.1, utilize the TFIDF algorithm element of matrix N to be carried out the calculating of weight, generate a new co-occurrence matrix
Figure GDA0000118310890000104
Step2.2, utilize probability latent semantic analysis algorithm to co-occurrence matrix
Figure GDA0000118310890000105
Learn, generating size is the WZ=(p (w of n * Q r, z q)) RqAnd size is Q * m kDZ=(p (z q, d s)) QsTwo matrixes, wherein z q∈ Z=(z 1, z 2..., z Q), Z represents latent semantic space, Q represents the size of latent semantic space; P (w r, z q) represent that r Feature Words is at potential semantic z qOn probability size; P (z q, d s) represent that s document is at potential semantic z qOn probability size;
Step3, utilize multi-class support vector machine SVM (Support Vector Machine) sorter respectively the DZ corresponding to probability latent semantic analysis model of each layer training gained to be trained, what generate each layer correspondence has a supervision probability latent semantic analysis category of model device
Figure GDA0000118310890000111
When level=0, sorter is What adopt in the experiment is the LIBSVM multicategory classification device that professor Lin Zhiren of Taiwan Univ. writes.
(3) hierarchical classification mechanism
Research purpose: utilizing has supervision layering probability latent semantic analysis model to talk with wheel to classify
Research background: the dialogue wheel carries out hierarchical classification, proceeds so classification if the classification of words wheel belongs to the mother node theme; Otherwise stop the subject categories of mark words wheel.The hierarchical classification process is as follows:
Step1: calculate current words wheel T iLanguage feature assemble vector
Figure GDA0000118310890000113
The WZ that utilization has supervision layering probability latent semantic analysis Algorithm Learning to obtain will Be mapped on the latent semantic space Z, namely utilize latent semantic space Z to represent T iThe language feature content of assembling, that is:
Z T i = W T i &times; WZ
Figure GDA0000118310890000116
Represent current words wheel T iLanguage feature be gathered in probability distribution on the latent semantic space Z, that is to say Feature Words vector W is carried out Feature Dimension Reduction;
Step2: that utilizes that training obtains has a supervision layering probability latent semantic analysis category of model device
Figure GDA0000118310890000117
To T iCarry out the subject categories classification;
Step3: if T iSubject categories belong to mon_topics, level increases by 1 so, forwards Step4 to; Otherwise, mark T iSubject categories be the subject categories of identifying, finish;
Step4: utilize corresponding
Figure GDA0000118310890000118
To T iCarry out the subject categories classification, forward Step3 to.
3, identification and the splicing mechanism of words wheel level event
Research purpose: will talk about wheel and belong in the corresponding event.
Research background: according to subject categories under the words wheel, recognition and tracking the present invention to concrete event has adopted a kind of combination words to take turns affiliated subject categories, the front and back words are taken turns the mistiming and the front and back that occur and talked about beginning, continuity and the end mechanism of taking turns the tightness of speaker on the community network level and coming decision event, its principle as shown in Figure 4, specific works mechanism is as follows:
Step1: search and obtain [T i.stamp-Th, T i.stamp] generation in the time interval and words wheel set T={T that be not the event end g..., T i| 0<g≤i};
Step2: if T only contains element T i, mark T so iBe the initial sentence of a new event, algorithm finishes; Otherwise, make l=i-1, carry out Step3;
Step3: judge T iWith T lSubject categories whether identical;
Step4: if T iWith T lSubject categories identical, so with T iBelong to T lIn the affiliated event, algorithm finishes; Otherwise make l=l-1, carry out Step5;
Step5: if l 〉=g so, forwards Step3 to; Otherwise, forward Step6 to;
Step6: if T iAffiliated event be empty, make so l '=i-1, forward Step7 to; Otherwise, finish algorithm;
Step7: calculate T i.id with T L '.id the tightness d on the community network level;
Step8: if d>0.5, so with T iBelong to T L 'In the affiliated event, algorithm finishes; Otherwise make l '=l '-1, carry out Step9;
Step9: if l ' 〉=g so, forwards Step7 to; Otherwise, mark T iBe the initial sentence of a new events, finish algorithm.
The computing method of described community network tightness are:
d ( T i . id , T i - 1 . id ) = IO ( T i . id , T i - 1 . id ) I ( T i . id ) + O ( T i , id ) + I ( T i - 1 . id ) + O ( T i - 1 . id )
I (T wherein i.id) expression T i.id in-degree sum, O (T i.id) expression T i.id out-degree sum, T I-1.id similar; IO (T i.id, T I-1.id) expression T i.id to T I-1.id number of times and T talk I-1.id to T iThe number of times sum of .id speaking, the statistics of out-degree, in-degree is the summation of historical data, the tightness of community network is upgraded once per month.
For the ease of better understanding the computing method of community network tightness, give one example here and set forth, referring to accompanying drawing 5, raw data is converted into digraph, the tightness of the A that calculates so and the community network of B is:
d ( A , B ) = 5 5 + 5 + 3 + 4 = 0.294

Claims (2)

1. event recognition and tracking towards a real-time interaction text is characterized in that: comprise the steps:
The first step: words wheel level subject categories sorting phase:
(1) in the real-time interaction text, the speech Speech that once inputs take the user takes turns Turn as words, is expressed as with five-tuple:
T i=(i,id,role,stamp,content)
Wherein, T iRepresent i words wheel, and
Figure FDA00002091528300011
Figure FDA00002091528300012
It is Positive Integer Set; Id represents to distinguish unique indications of speaker; Role represents speaker's role, its minute two classification: speaker Speaker and recipient recipient; Stamp represents to talk about the timestamp that wheel occurs; Content represents once to talk about all texts of making a speech in the wheel;
T so i.stamp just represent the time that i words wheel occurs, T i.content the content that just represents i words wheel, described mutual text are the words wheels that comes from same chatroom or the discussion group;
(2) to current words wheel T iContent T i.content carry out the text pre-service, according to feature lexicon extraction Feature Words wherein, computational language proper vector
Figure FDA00002091528300013
W wherein Ih, 0<h≤n represents that h Feature Words is at T i.content the number of times that occurs in, the number of n representation feature word; Described feature lexicon extracts from training data;
(3) if words wheel T iBeing the first words wheel that occurs in the system, also is T 1, forward (5) to; Otherwise, carry out (4);
(4) calculate words wheel T iThe self-adaptation language feature assemble vector
Figure FDA00002091528300014
Wherein
Figure FDA00002091528300015
0<h'≤n represents the number of times that h' Feature Words occurs, the number of n representation feature word in this language feature is assembled;
(5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories;
Second step, words wheel level event recognition and tracking phase:
(1) according to subject categories under the words wheel, the mistiming that front and back words wheel occurs and the front and back tightness of words wheel speaker on the community network level are judged current words wheel T iThe whether beginning of event, continuity and end;
(2) if words wheel T iBe the event end statement, namely formed a complete event, so mark T iBe the words wheel of End Event, otherwise be labeled as the not words wheel of End Event;
(3) judge whether to arrive the regular update time; If arrive, then to there being supervision layering probability latent semantic analysis model to carry out model modification; Otherwise, finishing algorithm, described regular update refers to that the complete event that will newly identify each the end of month joins in the training set, trains again to model;
The computation process that the described self-adaptation language feature of the step of the first step (4) is assembled vector is:
Step1: calculate current words wheel T iAfter the generation, at the time interval [T i.stamp-△ T, T i.stamp] the frequency V (T that the words wheel occurs in i):
V ( T i ) = C ( T 1 . stamp , T i . stamp ) &Delta;T , T i . stamp - T 1 . stamp < &Delta;T C ( T i . stamp - &Delta;T , T i . stamp ) &Delta;T , T i . stamp - T 1 . stamp &GreaterEqual; &Delta;T
Wherein, C (T 1.stamp, T i.stamp) be illustrated in the time interval [T 1.stamp, T i.stamp] the words wheel number of times that occurs altogether in, C (T i.stamp-△ T, T i.stamp) be illustrated in the time interval [T i.stamp-△ T, T i.stamp] the words wheel number of times that occurs altogether in, △ T be one regular time the interval, initialization △ T=1 hour;
Step2: the size of the adaptive density threshold Th that determines to be pressed for time: calculate first Th', that is:
Th &prime; = ( 1 - &Delta;v ) &times; Th , V ( T i ) &GreaterEqual; ( 1 + &Delta;v ) &times; V ( T i - 1 ) Th , ( 1 - &Delta;v ) &times; V ( T i - 1 ) < V ( T i ) < ( 1 + &Delta;v ) &times; V ( T i - 1 ) ( 1 + &Delta;v ) &times; Th , V ( T i ) &le; ( 1 - &Delta;v ) &times; V ( T i - 1 )
Then Th=Th', namely update time threshold value, wherein during initialization, Δ v is set to 0.3, threshold value Th=6 hour, utilizes above thought to reach the purpose of adaptive change time threshold Th size;
Step3: order The expression time interval [T i.stamp-Th, T i.stamp] the words wheel set that occurs in, so T iLanguage feature assemble vector and just be
Figure FDA00002091528300024
In the language feature vector sum of all words wheels, that is:
Figure FDA00002091528300025
The step (3) of the step of the first step (5), second step is described supervision layering probability latent semantic analysis model, and its training process is as follows:
Step1, according to the hierarchical nature of theme to training data set carry out hierarchical classification tissue, formation is a tree structure after the theme layering, is denoted as:
Wherein level represents the residing level of subject categories, and k represents current theme K the sub-subject categories that belongs in the last layer theme, a kRepresent current theme
Figure FDA00002091528300028
If the sub-topics number that comprises is a k=0, so Be exactly the leafy node of theme, be designated as
Figure FDA000020915283000210
Otherwise Comprise exactly the mother node of sub-topics, be denoted as Described mon_topics refers to include the node set of sub-topics, and leaf_topics refers to the leafy node set; When level=0, note
Figure FDA000020915283000213
The subject categories of expression top layer, wherein a 0Expression top layer subject categories number;
So, the organizational process of training data is as follows:
Step1.1, generating feature term vector W, process is as follows:
The sum of the independent word that occurs in Step1.1.1, the set of statistics training data forms a Feature Words vector after the deletion stop words
Figure FDA00002091528300031
Wherein
Figure FDA00002091528300032
Represent the number of times that f Feature Words occurs in training data,
Figure FDA00002091528300033
The number of representation feature word; Described stop words comprises: symbol, auxiliary word, preposition, conjunction, interjection, onomatopoeia, number;
Step1.1.2, utilize TFIDF algorithm pair
Figure FDA00002091528300034
Carry out the calculating of term weight function, and sort by weight is descending, the deletion weight obtains the Feature Words vector after less than 0.1 Feature Words and is W={w 1, w 2... w F '..., w n, w wherein F 'Represent f' Feature Words weight size in training data, the number of n representation feature word;
Step1.2, generation co-occurrence matrix N, process is as follows:
Step1.2.1, with all belong to theme in the training data
Figure FDA00002091528300035
Document form a collection of document
Figure FDA00002091528300036
M wherein kExpression document number;
Step1.2.2, the dimension size is n * m so kThe co-occurrence matrix N of word and document just is: N=(c (w r, d s)) Rs, wherein, c (w r, d s) r number of times that Feature Words occurs in s document of expression;
Step2, in the set of this training dataset, according to top-down mode, successively train corresponding probability latent semantic analysis model, process is as follows:
Step2.1, utilize the TFIDF algorithm element of matrix N to be carried out the calculating of weight, generate a new co-occurrence matrix
Figure FDA00002091528300037
Step2.2, utilize probability latent semantic analysis algorithm to co-occurrence matrix
Figure FDA00002091528300038
Learn, generating size is the WZ=(p (w of n * Q r, z q)) RqAnd size is Q * m kDZ=(p (z q, d s)) QsTwo matrixes, wherein z q∈ Z=(z 1, z 2..., z Q), Z represents latent semantic space, Q represents the size of latent semantic space; P (w r, z q) represent that r Feature Words is at potential semantic z qOn probability size; P (z q, d s) represent that s document is at potential semantic z qOn probability size;
Step3, utilizing multi-class support vector machine SVM(Support Vector Machine) sorter trains the DZ corresponding to probability latent semantic analysis model of each layer training gained respectively, and what generate each layer correspondence has a supervision probability latent semantic analysis category of model device
Figure FDA00002091528300039
When level=0, sorter is
Figure FDA000020915283000310
The process that step in the described first step (5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories is:
Step1: calculate current words wheel T iLanguage feature assemble vector
Figure FDA00002091528300041
The WZ that utilization has supervision layering probability latent semantic analysis Algorithm Learning to obtain will
Figure FDA00002091528300042
Be mapped on the latent semantic space Z, namely utilize latent semantic space Z to represent T iThe language feature content of assembling, that is:
Figure FDA00002091528300043
Figure FDA00002091528300044
Represent current words wheel T iLanguage feature be gathered in probability distribution on the latent semantic space Z, that is to say Feature Words vector W is carried out Feature Dimension Reduction;
Step2: that utilizes that training obtains has a supervision layering probability latent semantic analysis category of model device
Figure FDA00002091528300045
To T iCarry out the subject categories classification;
Step3: if T iSubject categories belong to mon_topics, level increases by 1 so, forwards Step4 to; Otherwise, mark T iSubject categories be the subject categories of identifying, finish;
Step4: utilize corresponding To T iCarry out the subject categories classification, forward Step3 to;
The detailed process of step in the described second step (1) is as follows:
Step1: search and obtain [T i.stamp-Th, T i.stamp] generation in the time interval and words wheel set that be not the event end
Step2: if
Figure FDA00002091528300048
Only contain element T i, mark T so iBe the initial sentence of a new event, algorithm finishes; Otherwise, make l=i-1, carry out Step3;
Step3: judge T iWith T lSubject categories whether identical;
Step4: if T iWith T lSubject categories identical, so with T iBelong to T lIn the affiliated event, algorithm finishes; Otherwise make l=l-1, carry out Step5;
Step5: if l 〉=g so, forwards Step3 to; Otherwise, forward Step6 to;
Step6: if T iAffiliated event be empty, make so l'=i-1, forward Step7 to; Otherwise, finish algorithm;
Step7: calculate T i.id with T L '.id the tightness d on the community network level;
Step8: if d〉0.5, so with T iBelong to T L 'In the affiliated event, algorithm finishes; Otherwise make l'=l'-1, carry out Step9;
Step9: if l' 〉=g so, forwards Step7 to; Otherwise, mark T iBe the initial sentence of a new events, finish algorithm.
2. a kind of event recognition and tracking towards the real-time interaction text as claimed in claim 1, it is characterized in that: the computing method of described community network tightness are:
d ( T i . id , T i - 1 . id ) = IO ( T i . id , T i - 1 . id ) I ( T i . id ) + O ( t i . id ) + I ( T i - 1 . id ) + O ( T i - 1 . id )
I (T wherein i.id) expression T i.id in-degree sum, O (T i.id) expression T i.id out-degree sum, T I-1.id similar; IO (T i.id, T I-1.id) expression T i.id to T I-1.id number of times and T talk I-1.id to T iThe number of times sum of .id speaking, the statistics of out-degree, in-degree is the summation of historical data, the tightness of community network is upgraded once per month.
CN 201110312540 2011-10-15 2011-10-15 Instant interactive text oriented event identifying and tracking method Expired - Fee Related CN102411611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110312540 CN102411611B (en) 2011-10-15 2011-10-15 Instant interactive text oriented event identifying and tracking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110312540 CN102411611B (en) 2011-10-15 2011-10-15 Instant interactive text oriented event identifying and tracking method

Publications (2)

Publication Number Publication Date
CN102411611A CN102411611A (en) 2012-04-11
CN102411611B true CN102411611B (en) 2013-01-02

Family

ID=45913682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110312540 Expired - Fee Related CN102411611B (en) 2011-10-15 2011-10-15 Instant interactive text oriented event identifying and tracking method

Country Status (1)

Country Link
CN (1) CN102411611B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156228B (en) * 2014-04-01 2017-11-10 兰州工业学院 A kind of embedded feature database of client filtering short message and update method
US9547471B2 (en) * 2014-07-03 2017-01-17 Microsoft Technology Licensing, Llc Generating computer responses to social conversational inputs
US10460720B2 (en) 2015-01-03 2019-10-29 Microsoft Technology Licensing, Llc. Generation of language understanding systems and methods
CN104881399B (en) * 2015-05-15 2017-10-27 中国科学院自动化研究所 Event recognition method and system based on probability soft logic PSL
CN106021508A (en) * 2016-05-23 2016-10-12 武汉大学 Sudden event emergency information mining method based on social media
CN106844765B (en) * 2017-02-22 2019-12-20 中国科学院自动化研究所 Significant information detection method and device based on convolutional neural network
CN107145516B (en) * 2017-04-07 2021-03-19 北京捷通华声科技股份有限公司 Text clustering method and system
CN107862081B (en) * 2017-11-29 2021-07-16 四川无声信息技术有限公司 Network information source searching method and device and server
CN110246049A (en) * 2018-03-09 2019-09-17 北大方正集团有限公司 Topic detecting method, device, equipment and readable storage medium storing program for executing
CN108427752A (en) * 2018-03-13 2018-08-21 浙江大学城市学院 A kind of article meaning of one's words mask method using event based on isomery article
CN113626573B (en) * 2021-08-11 2022-09-27 北京深维智信科技有限公司 Sales session objection and response extraction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6424971B1 (en) * 1999-10-29 2002-07-23 International Business Machines Corporation System and method for interactive classification and analysis of data
CN1403959A (en) * 2001-09-07 2003-03-19 联想(北京)有限公司 Content filter based on text content characteristic similarity and theme correlation degree comparison
CN1535433A (en) * 2001-07-04 2004-10-06 库吉萨姆媒介公司 Category based, extensible and interactive system for document retrieval
JP2009146397A (en) * 2007-11-19 2009-07-02 Omron Corp Important sentence extraction method, important sentence extraction device, important sentence extraction program and recording medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6424971B1 (en) * 1999-10-29 2002-07-23 International Business Machines Corporation System and method for interactive classification and analysis of data
CN1535433A (en) * 2001-07-04 2004-10-06 库吉萨姆媒介公司 Category based, extensible and interactive system for document retrieval
CN1403959A (en) * 2001-09-07 2003-03-19 联想(北京)有限公司 Content filter based on text content characteristic similarity and theme correlation degree comparison
JP2009146397A (en) * 2007-11-19 2009-07-02 Omron Corp Important sentence extraction method, important sentence extraction device, important sentence extraction program and recording medium

Also Published As

Publication number Publication date
CN102411611A (en) 2012-04-11

Similar Documents

Publication Publication Date Title
CN102411611B (en) Instant interactive text oriented event identifying and tracking method
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN114064918B (en) Multi-modal event knowledge graph construction method
CN110413986A (en) A kind of text cluster multi-document auto-abstracting method and system improving term vector model
CN104679738B (en) Internet hot words mining method and device
CN103984681A (en) News event evolution analysis method based on time sequence distribution information and topic model
CN103226580A (en) Interactive-text-oriented topic detection method
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN111723295B (en) Content distribution method, device and storage medium
CN112307351A (en) Model training and recommending method, device and equipment for user behavior
Huang et al. A topic BiLSTM model for sentiment classification
CN105760499A (en) Method for analyzing and predicting online public opinion based on LDA topic models
CN109063147A (en) Online course forum content recommendation method and system based on text similarity
CN110457711A (en) A kind of social media event topic recognition methods based on descriptor
CN104881399A (en) Event identification method and system based on probability soft logic PSL
CN103412878A (en) Document theme partitioning method based on domain knowledge map community structure
CN116578705A (en) Microblog emotion classification method based on pre-training language model and integrated neural network
Vlad et al. UPB@ DANKMEMES: Italian memes analysis-employing visual models and graph convolutional networks for meme identification and hate speech detection
Huang et al. Multi-granular document-level sentiment topic analysis for online reviews
CN110795533A (en) Long text-oriented theme detection method
CN103744958A (en) Webpage classification algorithm based on distributed computation
CN113869054A (en) Deep learning-based electric power field project feature identification method
Pathuri et al. Feature based sentimental analysis for prediction of mobile reviews using hybrid bag-boost algorithm
CN113435190B (en) Chapter relation extraction method integrating multilevel information extraction and noise reduction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130102

Termination date: 20151015

EXPY Termination of patent right or utility model