CN102411611B

CN102411611B - Instant interactive text oriented event identifying and tracking method

Info

Publication number: CN102411611B
Application number: CN 201110312540
Authority: CN
Inventors: 田锋; 郑庆华; 张惠三
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2011-10-15
Filing date: 2011-10-15
Publication date: 2013-01-02
Anticipated expiration: 2031-10-15
Also published as: CN102411611A

Abstract

The invention discloses an instant interactive text oriented event identifying and tracking method, which comprises the following two steps that: I, in the sorting stage of a turntalking theme category, the turntalking content is represented by an adaptive language feature gather representing mode and the turntalking theme category is sorted by a potential semantic analysis mode that is acquired by training with a supervision layering possibility; II, in the turntalking event identifying and tracking stage, the starting, the continuing and the ending of the event are judged according to the theme category of the turntalking, the time difference of the front and back turntalking and the compactness of the talking staff of the front and back turntalking on the social network grade, wherein (1) the invention sets forward a theory of adaptively adjusting the turntalking compactness degree valve value Th according to the fluctuation size of the time sequence data after the current turntalking, then the adaptive language feature gather calculation is executed; and (2) the potential semantic analysis mode with the supervision layering possibility is updated in certain time in the implementation. The provided method is an online identifying and tracing algorithm.

Description

A kind of event recognition and tracking towards the real-time interaction text

Technical field

The present invention relates to a kind of information retrieval, extraction and management and natural language processing technique, particularly relate to a kind of event recognition towards online real-time interaction text and tracking.

Background technology

Along with Internet technology use increasingly extensive, based on the network application development of interactive text, become that people obtain and one of the Main Means that releases news, such as the typical mutual text application such as Internet chatroom, microblogging.Contain a large amount of abundant information resources in these texts, how to realize event in these mutual text application is searched, organized and utilize by subject categories, becoming the task of top priority.Such as automatic recognition network learner's emotion change events, thereby regulate its learning efficiency; Identify the responsive accident of various societies or new events etc.The applicant is new through looking into, and does not retrieve the patent that the present invention is correlated with.But look for several pieces of similar articles, be respectively:

1) based on the Message-text cluster research of frequent mode.Hu Jixiang, Postgraduate School, Chinese Academy of Sciences's (Institute of Computing Technology).

2) be used for the weighing computation method CDTF_IDF of chat vocabulary.Gao Peng, Cao Xianbin, Computer Simulation, 2007.12.

Article 1) author found frequent mode (being referred to as crucial frequent mode) comprised the more semantic informations such as word order and contiguous context to mutual text feature extract key, propose a kind of guideless feature selecting algorithm based on frequent mode, be applied to text classification and cluster.

Article 2) contents supervision mainly for the chatroom uses, by calculated off-line vocabulary respectively in the different pieces of information source weights and gather and emphasis vocabulary improved the mode such as weight and calculate the term weight of chat data, thereby reach the purpose of identification chatroom theme.

Look into newly according to above-mentioned, existing similar technique mainly contains the different of following several respects from the inventive method:

1. the research object of prior art is with whole news (event) or paragraph, and this method is for words wheel rank.

2. prior art is the off-line subject identifying method, and this method is online event recognition method.

3. the result of prior art identification is only for which kind of theme whether whole news (event) or paragraph belong to, and the generation of relevant news (event), the i.e. recognition and tracking of subject matter level; And this method mainly is to find whether the event that the online interaction both sides discuss is consistent, and whether this event complete (beginning and finish), and the people of participation has those, namely to recognition and tracking single, concrete event.

4. aspect the character representation of mutual text, the prior art collected offline is only calculated for the words-frequency feature of Present News (event), and this method has been found the Time Dependent characteristic, introduces the gathering of all the words wheel features in the time threshold and carries out subject classification.

5. existing method is take unsupervised probability latent semantic analysis method as main, and this method is for the hierarchical model of theme, proposed to have supervision, layering PLSA topic model training method, and regularly upgrade topic model.

Summary of the invention

For existing problem among aforementioned related art and the present invention relatively, the invention provides a kind of event recognition towards online real-time interaction text and tracking, comprise the steps:

The first step: words wheel level subject categories sorting phase:

(1) in the real-time interaction text, the speech Speech that once inputs take the user takes turns Turn as words, is expressed as with five-tuple:

T _i＝(i，id，role，stamp，content)

Wherein, T _iRepresent i words wheel, and

It is Positive Integer Set; Id represents to distinguish unique indications of speaker; Role represents speaker's role, its minute two classification: speaker Speaker and recipient recipient; Stamp represents to talk about the timestamp that wheel occurs; Content represents once to talk about all texts of making a speech in the wheel;

T so _i.stamp just represent the time that i words wheel occurs, T _i.content the content that just represents i words wheel, described mutual text are the words wheels that comes from same chatroom or the discussion group;

(2) to current words wheel T _iContent T _i.content carry out the text pre-service, according to feature lexicon extraction Feature Words wherein, computational language proper vector

W wherein _Ih, 0＜h≤n represents that h Feature Words is at T _i.content the number of times that occurs in, the number of n representation feature word; Described feature lexicon extracts from training data;

(3) if words wheel T _iBeing the first words wheel that occurs in the system, also is T ₁, forward (5) to; Otherwise, carry out (4);

(4) calculate words wheel T _iThe self-adaptation language feature assemble vector

W_{T_{i}} = (w_{T_{i} 1}, w_{T_{i 2}}, . . ., w_{T_{i h^{'}}}, . . . w_{T_{in}}),

Wherein

0＜h '≤n represents the number of times that the individual Feature Words of h ' occurs, the number of n representation feature word in this language feature is assembled;

(5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories;

Second step, words wheel level event recognition and tracking phase:

(1) according to subject categories under the words wheel, the mistiming that front and back words wheel occurs and the front and back tightness of words wheel speaker on the community network level are judged current words wheel T _iThe whether beginning of event, continuity and end;

(2) if words wheel T _iBe the event end statement, namely formed a complete event, so mark T _iBe the words wheel of End Event, otherwise be labeled as the not words wheel of End Event;

(3) judge whether to arrive the regular update time; If arrive, then to there being supervision layering probability latent semantic analysis model to carry out model modification; Otherwise, finishing algorithm, described regular update refers to that the complete event that will newly identify each the end of month joins in the training set, trains again to model;

The computation process that the described self-adaptation language feature of the step of the first step (4) is assembled vector is:

Step1: calculate current words wheel T _iAfter the generation, at the time interval [T _i.stamp-Δ T, T _i.stamp] the frequency V (T that the words wheel occurs in _i):

V (T_{i}) = \{\begin{matrix} \frac{C (T_{1} . stamp, T_{i} . stamp)}{ΔT}, & T_{i} . stamp - T_{1} . stamp < ΔT \\ \frac{C (T_{i} . stamp - ΔT, T_{i} . stamp)}{ΔT}, & T_{i} . stamp - T_{1} . stamp &GreaterEqual; ΔT \end{matrix}

Wherein, C (T ₁.stamp, T _i.stamp) be illustrated in the time interval [T ₁.stamp, T _i.stamp] the words wheel number of times that occurs altogether in, C (T _i.stamp-Δ T, T _i.stamp) be illustrated in the time interval [T _i.stamp-Δ T, T _i.stamp] the words wheel number of times that occurs altogether in, Δ T be one regular time the interval, initialization Δ T=1 hour;

Step2: the size of the adaptive density threshold Th that determines to be pressed for time: calculate first Th ', that is:

{Th}^{'} = \{\begin{matrix} (1 - Δv) \times Th, & V (T_{i}) &GreaterEqual; (1 + Δv) \\ Th, & (1 - Δv) \times V (T_{i - 1}) < V (T_{i}) < (1 + Δv) \times V (T_{i - 1}) \\ (1 + Δv) \times Th, & V (T_{i}) \leq (1 - Δv) \times V (T_{i - 1}) \end{matrix}

Then Th=Th ', namely update time threshold value, wherein during initialization, Δ v is set to 0.3, threshold value Th=6 hour, utilizes above thought to reach the purpose of adaptive change time threshold Th size;

Step3: make T={T _g..., T _i| 0＜g≤i} represents the time interval [T _i.stamp-Th, T _i.stamp] the words wheel set that occurs in, so T _iLanguage feature assemble vectorial sums of language feature that vector just be all words wheels among the T, that is:

W_{T_{i}} = W_{T_{g}} + W_{T_{g + 1}} +, . . ., + W_{T_{i}}

The step (3) of the step of the first step (5), second step is described supervision layering probability latent semantic analysis model, and its training process is as follows:

Step1, according to the hierarchical nature of theme to training data set carry out hierarchical classification tissue, formation is a tree structure after the theme layering, is denoted as:

Wherein level represents the residing level of subject categories, and k represents current theme

K the sub-subject categories that belongs in the last layer theme, a _kRepresent current theme If the sub-topics number that comprises is a _k=0, so

Be exactly the leafy node of theme, be designated as

Otherwise

Comprise exactly the mother node of sub-topics, be denoted as Described mon_topics refers to include the node set of sub-topics, and leaf_topics refers to the leafy node set; When level=0, note

The subject categories of expression top layer, wherein a ₀Expression top layer subject categories number;

So, the organizational process of training data is as follows:

Step1.1, generating feature term vector W, process is as follows:

The sum of the independent word that occurs in Step1.1.1, the set of statistics training data forms a Feature Words vector after the deletion stop words

Wherein

Represent the number of times that f Feature Words occurs in training data, The number of representation feature word; Described stop words comprises: symbol, auxiliary word, preposition, conjunction, interjection, onomatopoeia, number;

Step1.1.2, utilize TFIDF algorithm pair

Carry out the calculating of term weight function, and sort by weight is descending, the deletion weight obtains the Feature Words vector after less than 0.1 Feature Words and is W={w ₁, w ₂... w _{F '}..., w _n, w wherein _{F '}Represent the individual Feature Words of f ' weight size in training data, the number of n representation feature word;

Step1.2, generation co-occurrence matrix N, process is as follows:

Step1.2.1, with all belong to theme in the training data

Document form a collection of document

M wherein _kExpression document number;

Step1.2.2, the dimension size is n * m so _kThe co-occurrence matrix N of word and document just is: N=(c (w _r, d _s)) _Rs, wherein, c (w _r, d _s) r number of times that Feature Words occurs in s document of expression;

Step2, in the set of this training dataset, according to top-down mode, successively train corresponding probability latent semantic analysis model, process is as follows:

Step2.1, utilize the TFIDF algorithm element of matrix N to be carried out the calculating of weight, generate a new co-occurrence matrix

Step2.2, utilize probability latent semantic analysis algorithm to co-occurrence matrix

Learn, generating size is the WZ=(p (w of n * Q _r, z _q)) _RqAnd size is Q * m _kDZ=(p (z _q, d _s)) _QsTwo matrixes, wherein z _q∈ Z=(z ₁, z ₂..., z _Q), Z represents latent semantic space, Q represents the size of latent semantic space; P (w _r, z _q) represent that r Feature Words is at potential semantic z _qOn probability size; P (z _q, d _s) represent that s document is at potential semantic z _qOn probability size;

Step3, utilize multi-class support vector machine SVM (Support Vector Machine) sorter respectively the DZ corresponding to probability latent semantic analysis model of each layer training gained to be trained, what generate each layer correspondence has a supervision probability latent semantic analysis category of model device When level=0, sorter is

In the said method, the process that the step of the first step (5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories is:

Step1: calculate current words wheel T _iLanguage feature assemble vector

The WZ that utilization has supervision layering probability latent semantic analysis Algorithm Learning to obtain will Be mapped on the latent semantic space Z, namely utilize latent semantic space Z to represent T _iThe language feature content of assembling, that is:

Z_{T_{i}} = W_{T_{i}} \times WZ

Represent current words wheel T _iLanguage feature be gathered in probability distribution on the latent semantic space Z, that is to say Feature Words vector W is carried out Feature Dimension Reduction;

Step2: that utilizes that training obtains has a supervision layering probability latent semantic analysis category of model device

To T _iCarry out the subject categories classification;

Step3: if T _iSubject categories belong to mon_topics, level increases by 1 so, forwards Step4 to; Otherwise, mark T _iSubject categories be the subject categories of identifying, finish;

Step4: utilize corresponding To T _iCarry out the subject categories classification, forward Step3 to.

The detailed process of step in the described second step (1) is as follows:

Step1: search and obtain [T _i.stamp-Th, T _i.stamp] generation in the time interval and words wheel set T={T that be not the event end _g..., T _i| 0＜g≤i};

Step2: if T only contains element T _i, mark T so _iBe the initial sentence of a new event, algorithm finishes; Otherwise, make l=i-1, carry out Step3;

Step3: judge T _iWith T _lSubject categories whether identical;

Step4: if T _iWith T _lSubject categories identical, so with T _iBelong to T _lIn the affiliated event, algorithm finishes; Otherwise make l=l-1, carry out Step5;

Step5: if l 〉=g so, forwards Step3 to; Otherwise, forward Step6 to;

Step6: if T _iAffiliated event be empty, make so l '=i-1, forward Step7 to; Otherwise, finish algorithm;

Step7: calculate T _i.id with T _{L '}.id the tightness d on the community network level;

Step8: if d＞0.5, so with T _iBelong to T _{L '}In the affiliated event, algorithm finishes; Otherwise make l '=l '-1, carry out Step9;

Step9: if l ' 〉=g so, forwards Step7 to; Otherwise, mark T _iBe the initial sentence of a new events, finish algorithm.

The computing method of described community network tightness are:

d (T_{i} . id, T_{i - 1} . id) = \frac{IO (T_{i} . id, T_{i - 1} . id)}{I (T_{i} . id) + O (T_{i}, id) + I (T_{i - 1} . id) + O (T_{i - 1} . id)}

I (T wherein _i.id) expression T _i.id in-degree sum, O (T _i.id) expression T _i.id out-degree sum, T _I-1.id similar; IO (T _i.id, T _I-1.id) expression T _i.id to T _I-1.id number of times and T talk _I-1.id to T _iThe number of times sum of .id speaking, the statistics of out-degree, in-degree is the summation of historical data, the tightness of community network is upgraded once per month.

Description of drawings

Fig. 1 event recognition of the present invention and trace flow figure.

Fig. 2 incremental training process flow diagram flow chart.

Fig. 3 talks about wheel subject categories classification process figure.

Fig. 4 talks about wheel level event recognition and following principle figure.

The example that Fig. 5 community network tightness is calculated.Wherein Fig. 5 a is raw-data map, and Fig. 5 b is the digraph after transforming.

Embodiment

For a more clear understanding of the present invention, the present invention is described in further detail below in conjunction with accompanying drawing.

1, the present invention adopts is that the identification of advanced jargon wheel subject categories is talked about wheel level event recognition again and come the words that user in the real-time interaction text inputs are carried out event recognition and tracking with the mechanism of tracking, and its process flow diagram as shown in Figure 1.

Research purpose: the Turn of user's input is belonged in the corresponding event.

Research background: the real-time interaction text is than single piece of documents such as blog, comment, novels, and it also has own unique characteristic of speech sounds except having inherited the characteristics such as ambiguousness that natural language text has and non-standard:

(1) interactivity;

(2) time series characteristic, great majority are close to real-time, interactives, and the theme between the words wheel has time dependence, and namely with the nearer talk of theme time of occurrence, the possibility that both are correlated with is larger;

(3) content of each words wheel is few, and sentence is short, will inevitably cause like this feature sparse;

(4) interactive mode is complicated, for example one to one, the interactive mode of one-to-many, multi-to-multi;

(5) language performance is various informative in the mutual text, and language performance is succinct, and misspelling, term lack of standardization and noise are a lot.

These have brought larger challenge all for the treatment technology of mutual text.

For the kind specific character that mutual text self has, the strategy of a kind of " two steps were walked " has been proposed.Specific works mechanism is as follows:

The first step is carried out the classification of subject categories, in this step, identifies the subject categories under each words wheel, and such as: politics, economic, culture, education, science and technology etc., process is as follows:

T _i＝(i，id，role，stamp，content)

Wherein, T _iRepresent i words wheel, and

W_{T_{i}} = (w_{T_{i} 1}, w_{T_{i 2}}, . . ., w_{T_{i h^{'}}}, . . . w_{T_{in}}),

Wherein

Second step, words wheel level event recognition and tracking phase:

(3) judge whether to arrive the regular update time; If arrive, then to there being supervision layering probability latent semantic analysis model to carry out model modification; Otherwise, finishing algorithm, described regular update refers to that the complete event that will newly identify each the end of month joins in the training set, trains model, again referring to accompanying drawing 2.

2, the subject categories classification mechanism of words wheel

Research purpose: those words wheels that will be relevant with current words wheel flock together, and utilize the language feature of words wheel to assemble vector as the language feature vector of current words wheel.

Research background: the content of each words wheel is few in the mutual text, and sentence is short, will inevitably cause like this feature sparse, assembles vector by the language feature that calculates current words wheel so, can overcome to a certain extent content few, short these shortcomings of sentence.

Subject categories classification the present invention of dialogue wheel has adopted a kind of adaptive subject categories classification mechanism, its process flow diagram as shown in Figure 3, concrete working mechanism is as follows:

(1) calculates current words wheel T _iLanguage feature assemble vector, process is as follows

V (T_{i}) = \{\begin{matrix} \frac{C (T_{1} . stamp, T_{i} . stamp)}{ΔT}, & T_{i} . stamp - T_{1} . stamp < ΔT \\ \frac{C (T_{i} . stamp - ΔT, T_{i} . stamp)}{ΔT}, & T_{i} . stamp - T_{1} . stamp &GreaterEqual; ΔT \end{matrix}

{Th}^{'} = \{\begin{matrix} (1 - Δv) \times Th, & V (T_{i}) &GreaterEqual; (1 + Δv) \\ Th, & (1 - Δv) \times V (T_{i - 1}) < V (T_{i}) < (1 + Δv) \times V (T_{i - 1}) \\ (1 + Δv) \times Th, & V (T_{i}) \leq (1 - Δv) \times V (T_{i - 1}) \end{matrix}

W_{T_{i}} = W_{T_{g}} + W_{T_{g + 1}} +, . . ., + W_{T_{i}}

(2) training has supervision layering probability latent semantic analysis model

Research purpose: training data is carried out laminated tissue and training according to the layering theme.

Research background: there is hierarchical nature in theme, and these exist in real world applications in a large number, classifies with the subject of the Ministry of Education during for example book classification is learned.From abstract theme near fine-grained core event, training data is organized and trained according to this model, can effectively solve the nonequilibrium behavior of data.

Detailed process is as follows

K the sub-subject categories that belongs in the last layer theme, a _kRepresent current theme

If the sub-topics number that comprises is a _k=0, so Be exactly the leafy node of theme, be designated as

Otherwise

Comprise exactly the mother node of sub-topics, be denoted as

Described mon_topics refers to include the node set of sub-topics, and leaf_topics refers to the leafy node set; When level=0, note

So, the organizational process of training data is as follows:

Step1.1, generating feature term vector W, process is as follows:

The sum of the independent word that occurs in Step1.1.1, the set of statistics training data forms a Feature Words vector after the deletion stop words Wherein

Step1.1.2, utilize TFIDF algorithm pair

Step1.2, generation co-occurrence matrix N, process is as follows:

Step1.2.1, with all belong to theme in the training data

Document form a collection of document

M wherein _kExpression document number;

Probability latent semantic analysis model is a kind of topic model, and its principle is as follows:

Make D={d ₁, d ₂..., d _mThe expression collection of document, W={w ₁, w ₂..., w _nThe representation feature set of words, wherein, m represents the document number, n representation feature word number is ignored the order that word occurs in document, so can generate the co-occurrence matrix N=(c (w of a m * n _r, d _s)) _Rs, c (w here _r, d _s) r number of times that Feature Words occurs in s document of expression; Definition joint density model is:

p(d，w)＝p(d)p(w|d)，

p (w | d) = \underset{z &Element; Z}{Σ} p (w | z) p (z | d)

Z ∈ Z=(z wherein ₁, z ₂..., z _Q) be latent semantic space, Q is the size of latent semantic space;

Being interpreted as of model so:

The probability that p (d) expression document occurs in data centralization, p (w|z) expression is when having determined semantic z, the probability that relevant word w occurs is respectively much, semantic distribution situation in document of p (z|d) expression, utilize above these definition, just can form a generation model, utilize it to produce new data:

(1) at first selects a document d according to distribution p (d) random sampling;

(2) behind the selected document, the semantic z that sampling selects document to express according to p (z|d);

(3) behind the selected semanteme, select the word of document according to p (w|z);

Process according to above theoretical training probability latent semantic analysis model topic model is:

Step3, utilize multi-class support vector machine SVM (Support Vector Machine) sorter respectively the DZ corresponding to probability latent semantic analysis model of each layer training gained to be trained, what generate each layer correspondence has a supervision probability latent semantic analysis category of model device

When level=0, sorter is What adopt in the experiment is the LIBSVM multicategory classification device that professor Lin Zhiren of Taiwan Univ. writes.

(3) hierarchical classification mechanism

Research purpose: utilizing has supervision layering probability latent semantic analysis model to talk with wheel to classify

Research background: the dialogue wheel carries out hierarchical classification, proceeds so classification if the classification of words wheel belongs to the mother node theme; Otherwise stop the subject categories of mark words wheel.The hierarchical classification process is as follows:

Step1: calculate current words wheel T _iLanguage feature assemble vector

Z_{T_{i}} = W_{T_{i}} \times WZ

To T _iCarry out the subject categories classification;

Step4: utilize corresponding

To T _iCarry out the subject categories classification, forward Step3 to.

3, identification and the splicing mechanism of words wheel level event

Research purpose: will talk about wheel and belong in the corresponding event.

Research background: according to subject categories under the words wheel, recognition and tracking the present invention to concrete event has adopted a kind of combination words to take turns affiliated subject categories, the front and back words are taken turns the mistiming and the front and back that occur and talked about beginning, continuity and the end mechanism of taking turns the tightness of speaker on the community network level and coming decision event, its principle as shown in Figure 4, specific works mechanism is as follows:

Step3: judge T _iWith T _lSubject categories whether identical;

Step5: if l 〉=g so, forwards Step3 to; Otherwise, forward Step6 to;

The computing method of described community network tightness are:

d (T_{i} . id, T_{i - 1} . id) = \frac{IO (T_{i} . id, T_{i - 1} . id)}{I (T_{i} . id) + O (T_{i}, id) + I (T_{i - 1} . id) + O (T_{i - 1} . id)}

For the ease of better understanding the computing method of community network tightness, give one example here and set forth, referring to accompanying drawing 5, raw data is converted into digraph, the tightness of the A that calculates so and the community network of B is:

d (A, B) = \frac{5}{5 + 5 + 3 + 4} = 0.294

Claims

1. event recognition and tracking towards a real-time interaction text is characterized in that: comprise the steps:

The first step: words wheel level subject categories sorting phase:

T _i=(i,id,role,stamp,content)

Wherein, T _iRepresent i words wheel, and

Wherein

0＜h'≤n represents the number of times that h' Feature Words occurs, the number of n representation feature word in this language feature is assembled;

Second step, words wheel level event recognition and tracking phase:

Step1: calculate current words wheel T _iAfter the generation, at the time interval [T _i.stamp-△ T, T _i.stamp] the frequency V (T that the words wheel occurs in _i):

V (T_{i}) = \{\begin{matrix} \frac{C (T_{1} . stamp, T_{i} . stamp)}{ΔT}, & T_{i} . stamp - T_{1} . stamp < ΔT \\ \frac{C (T_{i} . stamp - ΔT, T_{i} . stamp)}{ΔT}, & T_{i} . stamp - T_{1} . stamp &GreaterEqual; ΔT \end{matrix}

Wherein, C (T ₁.stamp, T _i.stamp) be illustrated in the time interval [T ₁.stamp, T _i.stamp] the words wheel number of times that occurs altogether in, C (T _i.stamp-△ T, T _i.stamp) be illustrated in the time interval [T _i.stamp-△ T, T _i.stamp] the words wheel number of times that occurs altogether in, △ T be one regular time the interval, initialization △ T=1 hour;

Step2: the size of the adaptive density threshold Th that determines to be pressed for time: calculate first Th', that is:

{Th}^{'} = \{\begin{matrix} (1 - Δv) \times Th, & V (T_{i}) &GreaterEqual; (1 + Δv) \times V (T_{i - 1}) \\ Th, & (1 - Δv) \times V (T_{i - 1}) < V (T_{i}) < (1 + Δv) \times V (T_{i - 1}) \\ (1 + Δv) \times Th, & V (T_{i}) \leq (1 - Δv) \times V (T_{i - 1}) \end{matrix}

Then Th=Th', namely update time threshold value, wherein during initialization, Δ v is set to 0.3, threshold value Th=6 hour, utilizes above thought to reach the purpose of adaptive change time threshold Th size;

Step3: order The expression time interval [T _i.stamp-Th, T _i.stamp] the words wheel set that occurs in, so T _iLanguage feature assemble vector and just be

In the language feature vector sum of all words wheels, that is:

Wherein level represents the residing level of subject categories, and k represents current theme K the sub-subject categories that belongs in the last layer theme, a _kRepresent current theme

Otherwise Comprise exactly the mother node of sub-topics, be denoted as Described mon_topics refers to include the node set of sub-topics, and leaf_topics refers to the leafy node set; When level=0, note

So, the organizational process of training data is as follows:

Step1.1, generating feature term vector W, process is as follows:

Wherein

Represent the number of times that f Feature Words occurs in training data,

The number of representation feature word; Described stop words comprises: symbol, auxiliary word, preposition, conjunction, interjection, onomatopoeia, number;

Step1.1.2, utilize TFIDF algorithm pair

Carry out the calculating of term weight function, and sort by weight is descending, the deletion weight obtains the Feature Words vector after less than 0.1 Feature Words and is W={w ₁, w ₂... w _{F '}..., w _n, w wherein _{F '}Represent f' Feature Words weight size in training data, the number of n representation feature word;

Step1.2, generation co-occurrence matrix N, process is as follows:

Step1.2.1, with all belong to theme in the training data

Document form a collection of document

M wherein _kExpression document number;

Step3, utilizing multi-class support vector machine SVM(Support Vector Machine) sorter trains the DZ corresponding to probability latent semantic analysis model of each layer training gained respectively, and what generate each layer correspondence has a supervision probability latent semantic analysis category of model device

When level=0, sorter is

The process that step in the described first step (5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories is:

Step1: calculate current words wheel T _iLanguage feature assemble vector

The WZ that utilization has supervision layering probability latent semantic analysis Algorithm Learning to obtain will

Be mapped on the latent semantic space Z, namely utilize latent semantic space Z to represent T _iThe language feature content of assembling, that is:

To T _iCarry out the subject categories classification;

Step4: utilize corresponding To T _iCarry out the subject categories classification, forward Step3 to;

The detailed process of step in the described second step (1) is as follows:

Step1: search and obtain [T _i.stamp-Th, T _i.stamp] generation in the time interval and words wheel set that be not the event end

Step2: if

Only contain element T _i, mark T so _iBe the initial sentence of a new event, algorithm finishes; Otherwise, make l=i-1, carry out Step3;

Step3: judge T _iWith T _lSubject categories whether identical;

Step5: if l 〉=g so, forwards Step3 to; Otherwise, forward Step6 to;

Step6: if T _iAffiliated event be empty, make so l'=i-1, forward Step7 to; Otherwise, finish algorithm;

Step8: if d〉0.5, so with T _iBelong to T _{L '}In the affiliated event, algorithm finishes; Otherwise make l'=l'-1, carry out Step9;

Step9: if l' 〉=g so, forwards Step7 to; Otherwise, mark T _iBe the initial sentence of a new events, finish algorithm.

2. a kind of event recognition and tracking towards the real-time interaction text as claimed in claim 1, it is characterized in that: the computing method of described community network tightness are:

d (T_{i} . id, T_{i - 1} . id) = \frac{IO (T_{i} . id, T_{i - 1} . id)}{I (T_{i} . id) + O (t_{i} . id) + I (T_{i - 1} . id) + O (T_{i - 1} . id)}