CN102411611B - Instant interactive text oriented event identifying and tracking method - Google Patents
Instant interactive text oriented event identifying and tracking method Download PDFInfo
- Publication number
- CN102411611B CN102411611B CN 201110312540 CN201110312540A CN102411611B CN 102411611 B CN102411611 B CN 102411611B CN 201110312540 CN201110312540 CN 201110312540 CN 201110312540 A CN201110312540 A CN 201110312540A CN 102411611 B CN102411611 B CN 102411611B
- Authority
- CN
- China
- Prior art keywords
- words
- stamp
- wheel
- feature
- event
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses an instant interactive text oriented event identifying and tracking method, which comprises the following two steps that: I, in the sorting stage of a turntalking theme category, the turntalking content is represented by an adaptive language feature gather representing mode and the turntalking theme category is sorted by a potential semantic analysis mode that is acquired by training with a supervision layering possibility; II, in the turntalking event identifying and tracking stage, the starting, the continuing and the ending of the event are judged according to the theme category of the turntalking, the time difference of the front and back turntalking and the compactness of the talking staff of the front and back turntalking on the social network grade, wherein (1) the invention sets forward a theory of adaptively adjusting the turntalking compactness degree valve value Th according to the fluctuation size of the time sequence data after the current turntalking, then the adaptive language feature gather calculation is executed; and (2) the potential semantic analysis mode with the supervision layering possibility is updated in certain time in the implementation. The provided method is an online identifying and tracing algorithm.
Description
Technical field
The present invention relates to a kind of information retrieval, extraction and management and natural language processing technique, particularly relate to a kind of event recognition towards online real-time interaction text and tracking.
Background technology
Along with Internet technology use increasingly extensive, based on the network application development of interactive text, become that people obtain and one of the Main Means that releases news, such as the typical mutual text application such as Internet chatroom, microblogging.Contain a large amount of abundant information resources in these texts, how to realize event in these mutual text application is searched, organized and utilize by subject categories, becoming the task of top priority.Such as automatic recognition network learner's emotion change events, thereby regulate its learning efficiency; Identify the responsive accident of various societies or new events etc.The applicant is new through looking into, and does not retrieve the patent that the present invention is correlated with.But look for several pieces of similar articles, be respectively:
1) based on the Message-text cluster research of frequent mode.Hu Jixiang, Postgraduate School, Chinese Academy of Sciences's (Institute of Computing Technology).
2) be used for the weighing computation method CDTF_IDF of chat vocabulary.Gao Peng, Cao Xianbin, Computer Simulation, 2007.12.
Article 1) author found frequent mode (being referred to as crucial frequent mode) comprised the more semantic informations such as word order and contiguous context to mutual text feature extract key, propose a kind of guideless feature selecting algorithm based on frequent mode, be applied to text classification and cluster.
Article 2) contents supervision mainly for the chatroom uses, by calculated off-line vocabulary respectively in the different pieces of information source weights and gather and emphasis vocabulary improved the mode such as weight and calculate the term weight of chat data, thereby reach the purpose of identification chatroom theme.
Look into newly according to above-mentioned, existing similar technique mainly contains the different of following several respects from the inventive method:
1. the research object of prior art is with whole news (event) or paragraph, and this method is for words wheel rank.
2. prior art is the off-line subject identifying method, and this method is online event recognition method.
3. the result of prior art identification is only for which kind of theme whether whole news (event) or paragraph belong to, and the generation of relevant news (event), the i.e. recognition and tracking of subject matter level; And this method mainly is to find whether the event that the online interaction both sides discuss is consistent, and whether this event complete (beginning and finish), and the people of participation has those, namely to recognition and tracking single, concrete event.
4. aspect the character representation of mutual text, the prior art collected offline is only calculated for the words-frequency feature of Present News (event), and this method has been found the Time Dependent characteristic, introduces the gathering of all the words wheel features in the time threshold and carries out subject classification.
5. existing method is take unsupervised probability latent semantic analysis method as main, and this method is for the hierarchical model of theme, proposed to have supervision, layering PLSA topic model training method, and regularly upgrade topic model.
Summary of the invention
For existing problem among aforementioned related art and the present invention relatively, the invention provides a kind of event recognition towards online real-time interaction text and tracking, comprise the steps:
The first step: words wheel level subject categories sorting phase:
(1) in the real-time interaction text, the speech Speech that once inputs take the user takes turns Turn as words, is expressed as with five-tuple:
T
i=(i,id,role,stamp,content)
Wherein, T
iRepresent i words wheel, and
It is Positive Integer Set; Id represents to distinguish unique indications of speaker; Role represents speaker's role, its minute two classification: speaker Speaker and recipient recipient; Stamp represents to talk about the timestamp that wheel occurs; Content represents once to talk about all texts of making a speech in the wheel;
T so
i.stamp just represent the time that i words wheel occurs, T
i.content the content that just represents i words wheel, described mutual text are the words wheels that comes from same chatroom or the discussion group;
(2) to current words wheel T
iContent T
i.content carry out the text pre-service, according to feature lexicon extraction Feature Words wherein, computational language proper vector
W wherein
Ih, 0<h≤n represents that h Feature Words is at T
i.content the number of times that occurs in, the number of n representation feature word; Described feature lexicon extracts from training data;
(3) if words wheel T
iBeing the first words wheel that occurs in the system, also is T
1, forward (5) to; Otherwise, carry out (4);
(4) calculate words wheel T
iThe self-adaptation language feature assemble vector
Wherein
0<h '≤n represents the number of times that the individual Feature Words of h ' occurs, the number of n representation feature word in this language feature is assembled;
(5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories;
Second step, words wheel level event recognition and tracking phase:
(1) according to subject categories under the words wheel, the mistiming that front and back words wheel occurs and the front and back tightness of words wheel speaker on the community network level are judged current words wheel T
iThe whether beginning of event, continuity and end;
(2) if words wheel T
iBe the event end statement, namely formed a complete event, so mark T
iBe the words wheel of End Event, otherwise be labeled as the not words wheel of End Event;
(3) judge whether to arrive the regular update time; If arrive, then to there being supervision layering probability latent semantic analysis model to carry out model modification; Otherwise, finishing algorithm, described regular update refers to that the complete event that will newly identify each the end of month joins in the training set, trains again to model;
The computation process that the described self-adaptation language feature of the step of the first step (4) is assembled vector is:
Step1: calculate current words wheel T
iAfter the generation, at the time interval [T
i.stamp-Δ T, T
i.stamp] the frequency V (T that the words wheel occurs in
i):
Wherein, C (T
1.stamp, T
i.stamp) be illustrated in the time interval [T
1.stamp, T
i.stamp] the words wheel number of times that occurs altogether in, C (T
i.stamp-Δ T, T
i.stamp) be illustrated in the time interval [T
i.stamp-Δ T, T
i.stamp] the words wheel number of times that occurs altogether in, Δ T be one regular time the interval, initialization Δ T=1 hour;
Step2: the size of the adaptive density threshold Th that determines to be pressed for time: calculate first Th ', that is:
Then Th=Th ', namely update time threshold value, wherein during initialization, Δ v is set to 0.3, threshold value Th=6 hour, utilizes above thought to reach the purpose of adaptive change time threshold Th size;
Step3: make T={T
g..., T
i| 0<g≤i} represents the time interval [T
i.stamp-Th, T
i.stamp] the words wheel set that occurs in, so T
iLanguage feature assemble vectorial sums of language feature that vector just be all words wheels among the T, that is:
The step (3) of the step of the first step (5), second step is described supervision layering probability latent semantic analysis model, and its training process is as follows:
Step1, according to the hierarchical nature of theme to training data set carry out hierarchical classification tissue, formation is a tree structure after the theme layering, is denoted as:
Wherein level represents the residing level of subject categories, and k represents current theme
K the sub-subject categories that belongs in the last layer theme, a
kRepresent current theme
If the sub-topics number that comprises is a
k=0, so
Be exactly the leafy node of theme, be designated as
Otherwise
Comprise exactly the mother node of sub-topics, be denoted as
Described mon_topics refers to include the node set of sub-topics, and leaf_topics refers to the leafy node set; When level=0, note
The subject categories of expression top layer, wherein a
0Expression top layer subject categories number;
So, the organizational process of training data is as follows:
Step1.1, generating feature term vector W, process is as follows:
The sum of the independent word that occurs in Step1.1.1, the set of statistics training data forms a Feature Words vector after the deletion stop words
Wherein
Represent the number of times that f Feature Words occurs in training data,
The number of representation feature word; Described stop words comprises: symbol, auxiliary word, preposition, conjunction, interjection, onomatopoeia, number;
Step1.1.2, utilize TFIDF algorithm pair
Carry out the calculating of term weight function, and sort by weight is descending, the deletion weight obtains the Feature Words vector after less than 0.1 Feature Words and is W={w
1, w
2... w
F '..., w
n, w wherein
F 'Represent the individual Feature Words of f ' weight size in training data, the number of n representation feature word;
Step1.2, generation co-occurrence matrix N, process is as follows:
Step1.2.1, with all belong to theme in the training data
Document form a collection of document
M wherein
kExpression document number;
Step1.2.2, the dimension size is n * m so
kThe co-occurrence matrix N of word and document just is: N=(c (w
r, d
s))
Rs, wherein, c (w
r, d
s) r number of times that Feature Words occurs in s document of expression;
Step2, in the set of this training dataset, according to top-down mode, successively train corresponding probability latent semantic analysis model, process is as follows:
Step2.1, utilize the TFIDF algorithm element of matrix N to be carried out the calculating of weight, generate a new co-occurrence matrix
Step2.2, utilize probability latent semantic analysis algorithm to co-occurrence matrix
Learn, generating size is the WZ=(p (w of n * Q
r, z
q))
RqAnd size is Q * m
kDZ=(p (z
q, d
s))
QsTwo matrixes, wherein z
q∈ Z=(z
1, z
2..., z
Q), Z represents latent semantic space, Q represents the size of latent semantic space; P (w
r, z
q) represent that r Feature Words is at potential semantic z
qOn probability size; P (z
q, d
s) represent that s document is at potential semantic z
qOn probability size;
Step3, utilize multi-class support vector machine SVM (Support Vector Machine) sorter respectively the DZ corresponding to probability latent semantic analysis model of each layer training gained to be trained, what generate each layer correspondence has a supervision probability latent semantic analysis category of model device
When level=0, sorter is
In the said method, the process that the step of the first step (5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories is:
Step1: calculate current words wheel T
iLanguage feature assemble vector
The WZ that utilization has supervision layering probability latent semantic analysis Algorithm Learning to obtain will
Be mapped on the latent semantic space Z, namely utilize latent semantic space Z to represent T
iThe language feature content of assembling, that is:
Represent current words wheel T
iLanguage feature be gathered in probability distribution on the latent semantic space Z, that is to say Feature Words vector W is carried out Feature Dimension Reduction;
Step2: that utilizes that training obtains has a supervision layering probability latent semantic analysis category of model device
To T
iCarry out the subject categories classification;
Step3: if T
iSubject categories belong to mon_topics, level increases by 1 so, forwards Step4 to; Otherwise, mark T
iSubject categories be the subject categories of identifying, finish;
Step4: utilize corresponding
To T
iCarry out the subject categories classification, forward Step3 to.
The detailed process of step in the described second step (1) is as follows:
Step1: search and obtain [T
i.stamp-Th, T
i.stamp] generation in the time interval and words wheel set T={T that be not the event end
g..., T
i| 0<g≤i};
Step2: if T only contains element T
i, mark T so
iBe the initial sentence of a new event, algorithm finishes; Otherwise, make l=i-1, carry out Step3;
Step3: judge T
iWith T
lSubject categories whether identical;
Step4: if T
iWith T
lSubject categories identical, so with T
iBelong to T
lIn the affiliated event, algorithm finishes; Otherwise make l=l-1, carry out Step5;
Step5: if l 〉=g so, forwards Step3 to; Otherwise, forward Step6 to;
Step6: if T
iAffiliated event be empty, make so l '=i-1, forward Step7 to; Otherwise, finish algorithm;
Step7: calculate T
i.id with T
L '.id the tightness d on the community network level;
Step8: if d>0.5, so with T
iBelong to T
L 'In the affiliated event, algorithm finishes; Otherwise make l '=l '-1, carry out Step9;
Step9: if l ' 〉=g so, forwards Step7 to; Otherwise, mark T
iBe the initial sentence of a new events, finish algorithm.
The computing method of described community network tightness are:
I (T wherein
i.id) expression T
i.id in-degree sum, O (T
i.id) expression T
i.id out-degree sum, T
I-1.id similar; IO (T
i.id, T
I-1.id) expression T
i.id to T
I-1.id number of times and T talk
I-1.id to T
iThe number of times sum of .id speaking, the statistics of out-degree, in-degree is the summation of historical data, the tightness of community network is upgraded once per month.
Description of drawings
Fig. 1 event recognition of the present invention and trace flow figure.
Fig. 2 incremental training process flow diagram flow chart.
Fig. 3 talks about wheel subject categories classification process figure.
Fig. 4 talks about wheel level event recognition and following principle figure.
The example that Fig. 5 community network tightness is calculated.Wherein Fig. 5 a is raw-data map, and Fig. 5 b is the digraph after transforming.
Embodiment
For a more clear understanding of the present invention, the present invention is described in further detail below in conjunction with accompanying drawing.
1, the present invention adopts is that the identification of advanced jargon wheel subject categories is talked about wheel level event recognition again and come the words that user in the real-time interaction text inputs are carried out event recognition and tracking with the mechanism of tracking, and its process flow diagram as shown in Figure 1.
Research purpose: the Turn of user's input is belonged in the corresponding event.
Research background: the real-time interaction text is than single piece of documents such as blog, comment, novels, and it also has own unique characteristic of speech sounds except having inherited the characteristics such as ambiguousness that natural language text has and non-standard:
(1) interactivity;
(2) time series characteristic, great majority are close to real-time, interactives, and the theme between the words wheel has time dependence, and namely with the nearer talk of theme time of occurrence, the possibility that both are correlated with is larger;
(3) content of each words wheel is few, and sentence is short, will inevitably cause like this feature sparse;
(4) interactive mode is complicated, for example one to one, the interactive mode of one-to-many, multi-to-multi;
(5) language performance is various informative in the mutual text, and language performance is succinct, and misspelling, term lack of standardization and noise are a lot.
These have brought larger challenge all for the treatment technology of mutual text.
For the kind specific character that mutual text self has, the strategy of a kind of " two steps were walked " has been proposed.Specific works mechanism is as follows:
The first step is carried out the classification of subject categories, in this step, identifies the subject categories under each words wheel, and such as: politics, economic, culture, education, science and technology etc., process is as follows:
(1) in the real-time interaction text, the speech Speech that once inputs take the user takes turns Turn as words, is expressed as with five-tuple:
T
i=(i,id,role,stamp,content)
Wherein, T
iRepresent i words wheel, and
It is Positive Integer Set; Id represents to distinguish unique indications of speaker; Role represents speaker's role, its minute two classification: speaker Speaker and recipient recipient; Stamp represents to talk about the timestamp that wheel occurs; Content represents once to talk about all texts of making a speech in the wheel;
T so
i.stamp just represent the time that i words wheel occurs, T
i.content the content that just represents i words wheel, described mutual text are the words wheels that comes from same chatroom or the discussion group;
(2) to current words wheel T
iContent T
i.content carry out the text pre-service, according to feature lexicon extraction Feature Words wherein, computational language proper vector
W wherein
Ih, 0<h≤n represents that h Feature Words is at T
i.content the number of times that occurs in, the number of n representation feature word; Described feature lexicon extracts from training data;
(3) if words wheel T
iBeing the first words wheel that occurs in the system, also is T
1, forward (5) to; Otherwise, carry out (4);
(4) calculate words wheel T
iThe self-adaptation language feature assemble vector
Wherein
0<h '≤n represents the number of times that the individual Feature Words of h ' occurs, the number of n representation feature word in this language feature is assembled;
(5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories;
Second step, words wheel level event recognition and tracking phase:
(1) according to subject categories under the words wheel, the mistiming that front and back words wheel occurs and the front and back tightness of words wheel speaker on the community network level are judged current words wheel T
iThe whether beginning of event, continuity and end;
(2) if words wheel T
iBe the event end statement, namely formed a complete event, so mark T
iBe the words wheel of End Event, otherwise be labeled as the not words wheel of End Event;
(3) judge whether to arrive the regular update time; If arrive, then to there being supervision layering probability latent semantic analysis model to carry out model modification; Otherwise, finishing algorithm, described regular update refers to that the complete event that will newly identify each the end of month joins in the training set, trains model, again referring to accompanying drawing 2.
2, the subject categories classification mechanism of words wheel
Research purpose: those words wheels that will be relevant with current words wheel flock together, and utilize the language feature of words wheel to assemble vector as the language feature vector of current words wheel.
Research background: the content of each words wheel is few in the mutual text, and sentence is short, will inevitably cause like this feature sparse, assembles vector by the language feature that calculates current words wheel so, can overcome to a certain extent content few, short these shortcomings of sentence.
Subject categories classification the present invention of dialogue wheel has adopted a kind of adaptive subject categories classification mechanism, its process flow diagram as shown in Figure 3, concrete working mechanism is as follows:
(1) calculates current words wheel T
iLanguage feature assemble vector, process is as follows
Step1: calculate current words wheel T
iAfter the generation, at the time interval [T
i.stamp-Δ T, T
i.stamp] the frequency V (T that the words wheel occurs in
i):
Wherein, C (T
1.stamp, T
i.stamp) be illustrated in the time interval [T
1.stamp, T
i.stamp] the words wheel number of times that occurs altogether in, C (T
i.stamp-Δ T, T
i.stamp) be illustrated in the time interval [T
i.stamp-Δ T, T
i.stamp] the words wheel number of times that occurs altogether in, Δ T be one regular time the interval, initialization Δ T=1 hour;
Step2: the size of the adaptive density threshold Th that determines to be pressed for time: calculate first Th ', that is:
Then Th=Th ', namely update time threshold value, wherein during initialization, Δ v is set to 0.3, threshold value Th=6 hour, utilizes above thought to reach the purpose of adaptive change time threshold Th size;
Step3: make T={T
g..., T
i| 0<g≤i} represents the time interval [T
i.stamp-Th, T
i.stamp] the words wheel set that occurs in, so T
iLanguage feature assemble vectorial sums of language feature that vector just be all words wheels among the T, that is:
(2) training has supervision layering probability latent semantic analysis model
Research purpose: training data is carried out laminated tissue and training according to the layering theme.
Research background: there is hierarchical nature in theme, and these exist in real world applications in a large number, classifies with the subject of the Ministry of Education during for example book classification is learned.From abstract theme near fine-grained core event, training data is organized and trained according to this model, can effectively solve the nonequilibrium behavior of data.
Detailed process is as follows
Step1, according to the hierarchical nature of theme to training data set carry out hierarchical classification tissue, formation is a tree structure after the theme layering, is denoted as:
Wherein level represents the residing level of subject categories, and k represents current theme
K the sub-subject categories that belongs in the last layer theme, a
kRepresent current theme
If the sub-topics number that comprises is a
k=0, so
Be exactly the leafy node of theme, be designated as
Otherwise
Comprise exactly the mother node of sub-topics, be denoted as
Described mon_topics refers to include the node set of sub-topics, and leaf_topics refers to the leafy node set; When level=0, note
The subject categories of expression top layer, wherein a
0Expression top layer subject categories number;
So, the organizational process of training data is as follows:
Step1.1, generating feature term vector W, process is as follows:
The sum of the independent word that occurs in Step1.1.1, the set of statistics training data forms a Feature Words vector after the deletion stop words
Wherein
Represent the number of times that f Feature Words occurs in training data,
The number of representation feature word; Described stop words comprises: symbol, auxiliary word, preposition, conjunction, interjection, onomatopoeia, number;
Step1.1.2, utilize TFIDF algorithm pair
Carry out the calculating of term weight function, and sort by weight is descending, the deletion weight obtains the Feature Words vector after less than 0.1 Feature Words and is W={w
1, w
2... w
F '..., w
n, w wherein
F 'Represent the individual Feature Words of f ' weight size in training data, the number of n representation feature word;
Step1.2, generation co-occurrence matrix N, process is as follows:
Step1.2.1, with all belong to theme in the training data
Document form a collection of document
M wherein
kExpression document number;
Step1.2.2, the dimension size is n * m so
kThe co-occurrence matrix N of word and document just is: N=(c (w
r, d
s))
Rs, wherein, c (w
r, d
s) r number of times that Feature Words occurs in s document of expression;
Step2, in the set of this training dataset, according to top-down mode, successively train corresponding probability latent semantic analysis model, process is as follows:
Probability latent semantic analysis model is a kind of topic model, and its principle is as follows:
Make D={d
1, d
2..., d
mThe expression collection of document, W={w
1, w
2..., w
nThe representation feature set of words, wherein, m represents the document number, n representation feature word number is ignored the order that word occurs in document, so can generate the co-occurrence matrix N=(c (w of a m * n
r, d
s))
Rs, c (w here
r, d
s) r number of times that Feature Words occurs in s document of expression; Definition joint density model is:
p(d,w)=p(d)p(w|d),
Z ∈ Z=(z wherein
1, z
2..., z
Q) be latent semantic space, Q is the size of latent semantic space;
Being interpreted as of model so:
The probability that p (d) expression document occurs in data centralization, p (w|z) expression is when having determined semantic z, the probability that relevant word w occurs is respectively much, semantic distribution situation in document of p (z|d) expression, utilize above these definition, just can form a generation model, utilize it to produce new data:
(1) at first selects a document d according to distribution p (d) random sampling;
(2) behind the selected document, the semantic z that sampling selects document to express according to p (z|d);
(3) behind the selected semanteme, select the word of document according to p (w|z);
Process according to above theoretical training probability latent semantic analysis model topic model is:
Step2.1, utilize the TFIDF algorithm element of matrix N to be carried out the calculating of weight, generate a new co-occurrence matrix
Step2.2, utilize probability latent semantic analysis algorithm to co-occurrence matrix
Learn, generating size is the WZ=(p (w of n * Q
r, z
q))
RqAnd size is Q * m
kDZ=(p (z
q, d
s))
QsTwo matrixes, wherein z
q∈ Z=(z
1, z
2..., z
Q), Z represents latent semantic space, Q represents the size of latent semantic space; P (w
r, z
q) represent that r Feature Words is at potential semantic z
qOn probability size; P (z
q, d
s) represent that s document is at potential semantic z
qOn probability size;
Step3, utilize multi-class support vector machine SVM (Support Vector Machine) sorter respectively the DZ corresponding to probability latent semantic analysis model of each layer training gained to be trained, what generate each layer correspondence has a supervision probability latent semantic analysis category of model device
When level=0, sorter is
What adopt in the experiment is the LIBSVM multicategory classification device that professor Lin Zhiren of Taiwan Univ. writes.
(3) hierarchical classification mechanism
Research purpose: utilizing has supervision layering probability latent semantic analysis model to talk with wheel to classify
Research background: the dialogue wheel carries out hierarchical classification, proceeds so classification if the classification of words wheel belongs to the mother node theme; Otherwise stop the subject categories of mark words wheel.The hierarchical classification process is as follows:
Step1: calculate current words wheel T
iLanguage feature assemble vector
The WZ that utilization has supervision layering probability latent semantic analysis Algorithm Learning to obtain will
Be mapped on the latent semantic space Z, namely utilize latent semantic space Z to represent T
iThe language feature content of assembling, that is:
Represent current words wheel T
iLanguage feature be gathered in probability distribution on the latent semantic space Z, that is to say Feature Words vector W is carried out Feature Dimension Reduction;
Step2: that utilizes that training obtains has a supervision layering probability latent semantic analysis category of model device
To T
iCarry out the subject categories classification;
Step3: if T
iSubject categories belong to mon_topics, level increases by 1 so, forwards Step4 to; Otherwise, mark T
iSubject categories be the subject categories of identifying, finish;
Step4: utilize corresponding
To T
iCarry out the subject categories classification, forward Step3 to.
3, identification and the splicing mechanism of words wheel level event
Research purpose: will talk about wheel and belong in the corresponding event.
Research background: according to subject categories under the words wheel, recognition and tracking the present invention to concrete event has adopted a kind of combination words to take turns affiliated subject categories, the front and back words are taken turns the mistiming and the front and back that occur and talked about beginning, continuity and the end mechanism of taking turns the tightness of speaker on the community network level and coming decision event, its principle as shown in Figure 4, specific works mechanism is as follows:
Step1: search and obtain [T
i.stamp-Th, T
i.stamp] generation in the time interval and words wheel set T={T that be not the event end
g..., T
i| 0<g≤i};
Step2: if T only contains element T
i, mark T so
iBe the initial sentence of a new event, algorithm finishes; Otherwise, make l=i-1, carry out Step3;
Step3: judge T
iWith T
lSubject categories whether identical;
Step4: if T
iWith T
lSubject categories identical, so with T
iBelong to T
lIn the affiliated event, algorithm finishes; Otherwise make l=l-1, carry out Step5;
Step5: if l 〉=g so, forwards Step3 to; Otherwise, forward Step6 to;
Step6: if T
iAffiliated event be empty, make so l '=i-1, forward Step7 to; Otherwise, finish algorithm;
Step7: calculate T
i.id with T
L '.id the tightness d on the community network level;
Step8: if d>0.5, so with T
iBelong to T
L 'In the affiliated event, algorithm finishes; Otherwise make l '=l '-1, carry out Step9;
Step9: if l ' 〉=g so, forwards Step7 to; Otherwise, mark T
iBe the initial sentence of a new events, finish algorithm.
The computing method of described community network tightness are:
I (T wherein
i.id) expression T
i.id in-degree sum, O (T
i.id) expression T
i.id out-degree sum, T
I-1.id similar; IO (T
i.id, T
I-1.id) expression T
i.id to T
I-1.id number of times and T talk
I-1.id to T
iThe number of times sum of .id speaking, the statistics of out-degree, in-degree is the summation of historical data, the tightness of community network is upgraded once per month.
For the ease of better understanding the computing method of community network tightness, give one example here and set forth, referring to accompanying drawing 5, raw data is converted into digraph, the tightness of the A that calculates so and the community network of B is:
Claims (2)
1. event recognition and tracking towards a real-time interaction text is characterized in that: comprise the steps:
The first step: words wheel level subject categories sorting phase:
(1) in the real-time interaction text, the speech Speech that once inputs take the user takes turns Turn as words, is expressed as with five-tuple:
T
i=(i,id,role,stamp,content)
Wherein, T
iRepresent i words wheel, and
It is Positive Integer Set; Id represents to distinguish unique indications of speaker; Role represents speaker's role, its minute two classification: speaker Speaker and recipient recipient; Stamp represents to talk about the timestamp that wheel occurs; Content represents once to talk about all texts of making a speech in the wheel;
T so
i.stamp just represent the time that i words wheel occurs, T
i.content the content that just represents i words wheel, described mutual text are the words wheels that comes from same chatroom or the discussion group;
(2) to current words wheel T
iContent T
i.content carry out the text pre-service, according to feature lexicon extraction Feature Words wherein, computational language proper vector
W wherein
Ih, 0<h≤n represents that h Feature Words is at T
i.content the number of times that occurs in, the number of n representation feature word; Described feature lexicon extracts from training data;
(3) if words wheel T
iBeing the first words wheel that occurs in the system, also is T
1, forward (5) to; Otherwise, carry out (4);
(4) calculate words wheel T
iThe self-adaptation language feature assemble vector
Wherein
0<h'≤n represents the number of times that h' Feature Words occurs, the number of n representation feature word in this language feature is assembled;
(5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories;
Second step, words wheel level event recognition and tracking phase:
(1) according to subject categories under the words wheel, the mistiming that front and back words wheel occurs and the front and back tightness of words wheel speaker on the community network level are judged current words wheel T
iThe whether beginning of event, continuity and end;
(2) if words wheel T
iBe the event end statement, namely formed a complete event, so mark T
iBe the words wheel of End Event, otherwise be labeled as the not words wheel of End Event;
(3) judge whether to arrive the regular update time; If arrive, then to there being supervision layering probability latent semantic analysis model to carry out model modification; Otherwise, finishing algorithm, described regular update refers to that the complete event that will newly identify each the end of month joins in the training set, trains again to model;
The computation process that the described self-adaptation language feature of the step of the first step (4) is assembled vector is:
Step1: calculate current words wheel T
iAfter the generation, at the time interval [T
i.stamp-△ T, T
i.stamp] the frequency V (T that the words wheel occurs in
i):
Wherein, C (T
1.stamp, T
i.stamp) be illustrated in the time interval [T
1.stamp, T
i.stamp] the words wheel number of times that occurs altogether in, C (T
i.stamp-△ T, T
i.stamp) be illustrated in the time interval [T
i.stamp-△ T, T
i.stamp] the words wheel number of times that occurs altogether in, △ T be one regular time the interval, initialization △ T=1 hour;
Step2: the size of the adaptive density threshold Th that determines to be pressed for time: calculate first Th', that is:
Then Th=Th', namely update time threshold value, wherein during initialization, Δ v is set to 0.3, threshold value Th=6 hour, utilizes above thought to reach the purpose of adaptive change time threshold Th size;
Step3: order
The expression time interval [T
i.stamp-Th, T
i.stamp] the words wheel set that occurs in, so T
iLanguage feature assemble vector and just be
In the language feature vector sum of all words wheels, that is:
The step (3) of the step of the first step (5), second step is described supervision layering probability latent semantic analysis model, and its training process is as follows:
Step1, according to the hierarchical nature of theme to training data set carry out hierarchical classification tissue, formation is a tree structure after the theme layering, is denoted as:
Wherein level represents the residing level of subject categories, and k represents current theme
K the sub-subject categories that belongs in the last layer theme, a
kRepresent current theme
If the sub-topics number that comprises is a
k=0, so
Be exactly the leafy node of theme, be designated as
Otherwise
Comprise exactly the mother node of sub-topics, be denoted as
Described mon_topics refers to include the node set of sub-topics, and leaf_topics refers to the leafy node set; When level=0, note
The subject categories of expression top layer, wherein a
0Expression top layer subject categories number;
So, the organizational process of training data is as follows:
Step1.1, generating feature term vector W, process is as follows:
The sum of the independent word that occurs in Step1.1.1, the set of statistics training data forms a Feature Words vector after the deletion stop words
Wherein
Represent the number of times that f Feature Words occurs in training data,
The number of representation feature word; Described stop words comprises: symbol, auxiliary word, preposition, conjunction, interjection, onomatopoeia, number;
Step1.1.2, utilize TFIDF algorithm pair
Carry out the calculating of term weight function, and sort by weight is descending, the deletion weight obtains the Feature Words vector after less than 0.1 Feature Words and is W={w
1, w
2... w
F '..., w
n, w wherein
F 'Represent f' Feature Words weight size in training data, the number of n representation feature word;
Step1.2, generation co-occurrence matrix N, process is as follows:
Step1.2.1, with all belong to theme in the training data
Document form a collection of document
M wherein
kExpression document number;
Step1.2.2, the dimension size is n * m so
kThe co-occurrence matrix N of word and document just is: N=(c (w
r, d
s))
Rs, wherein, c (w
r, d
s) r number of times that Feature Words occurs in s document of expression;
Step2, in the set of this training dataset, according to top-down mode, successively train corresponding probability latent semantic analysis model, process is as follows:
Step2.1, utilize the TFIDF algorithm element of matrix N to be carried out the calculating of weight, generate a new co-occurrence matrix
Step2.2, utilize probability latent semantic analysis algorithm to co-occurrence matrix
Learn, generating size is the WZ=(p (w of n * Q
r, z
q))
RqAnd size is Q * m
kDZ=(p (z
q, d
s))
QsTwo matrixes, wherein z
q∈ Z=(z
1, z
2..., z
Q), Z represents latent semantic space, Q represents the size of latent semantic space; P (w
r, z
q) represent that r Feature Words is at potential semantic z
qOn probability size; P (z
q, d
s) represent that s document is at potential semantic z
qOn probability size;
Step3, utilizing multi-class support vector machine SVM(Support Vector Machine) sorter trains the DZ corresponding to probability latent semantic analysis model of each layer training gained respectively, and what generate each layer correspondence has a supervision probability latent semantic analysis category of model device
When level=0, sorter is
The process that step in the described first step (5) utilization has supervision layering probability latent semantic analysis model to talk about the classification of wheel level subject categories is:
Step1: calculate current words wheel T
iLanguage feature assemble vector
The WZ that utilization has supervision layering probability latent semantic analysis Algorithm Learning to obtain will
Be mapped on the latent semantic space Z, namely utilize latent semantic space Z to represent T
iThe language feature content of assembling, that is:
Represent current words wheel T
iLanguage feature be gathered in probability distribution on the latent semantic space Z, that is to say Feature Words vector W is carried out Feature Dimension Reduction;
Step2: that utilizes that training obtains has a supervision layering probability latent semantic analysis category of model device
To T
iCarry out the subject categories classification;
Step3: if T
iSubject categories belong to mon_topics, level increases by 1 so, forwards Step4 to; Otherwise, mark T
iSubject categories be the subject categories of identifying, finish;
Step4: utilize corresponding
To T
iCarry out the subject categories classification, forward Step3 to;
The detailed process of step in the described second step (1) is as follows:
Step1: search and obtain [T
i.stamp-Th, T
i.stamp] generation in the time interval and words wheel set that be not the event end
Step2: if
Only contain element T
i, mark T so
iBe the initial sentence of a new event, algorithm finishes; Otherwise, make l=i-1, carry out Step3;
Step3: judge T
iWith T
lSubject categories whether identical;
Step4: if T
iWith T
lSubject categories identical, so with T
iBelong to T
lIn the affiliated event, algorithm finishes; Otherwise make l=l-1, carry out Step5;
Step5: if l 〉=g so, forwards Step3 to; Otherwise, forward Step6 to;
Step6: if T
iAffiliated event be empty, make so l'=i-1, forward Step7 to; Otherwise, finish algorithm;
Step7: calculate T
i.id with T
L '.id the tightness d on the community network level;
Step8: if d〉0.5, so with T
iBelong to T
L 'In the affiliated event, algorithm finishes; Otherwise make l'=l'-1, carry out Step9;
Step9: if l' 〉=g so, forwards Step7 to; Otherwise, mark T
iBe the initial sentence of a new events, finish algorithm.
2. a kind of event recognition and tracking towards the real-time interaction text as claimed in claim 1, it is characterized in that: the computing method of described community network tightness are:
I (T wherein
i.id) expression T
i.id in-degree sum, O (T
i.id) expression T
i.id out-degree sum, T
I-1.id similar; IO (T
i.id, T
I-1.id) expression T
i.id to T
I-1.id number of times and T talk
I-1.id to T
iThe number of times sum of .id speaking, the statistics of out-degree, in-degree is the summation of historical data, the tightness of community network is upgraded once per month.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110312540 CN102411611B (en) | 2011-10-15 | 2011-10-15 | Instant interactive text oriented event identifying and tracking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110312540 CN102411611B (en) | 2011-10-15 | 2011-10-15 | Instant interactive text oriented event identifying and tracking method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102411611A CN102411611A (en) | 2012-04-11 |
CN102411611B true CN102411611B (en) | 2013-01-02 |
Family
ID=45913682
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110312540 Expired - Fee Related CN102411611B (en) | 2011-10-15 | 2011-10-15 | Instant interactive text oriented event identifying and tracking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102411611B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104156228B (en) * | 2014-04-01 | 2017-11-10 | 兰州工业学院 | A kind of embedded feature database of client filtering short message and update method |
US9547471B2 (en) * | 2014-07-03 | 2017-01-17 | Microsoft Technology Licensing, Llc | Generating computer responses to social conversational inputs |
US10460720B2 (en) | 2015-01-03 | 2019-10-29 | Microsoft Technology Licensing, Llc. | Generation of language understanding systems and methods |
CN104881399B (en) * | 2015-05-15 | 2017-10-27 | 中国科学院自动化研究所 | Event recognition method and system based on probability soft logic PSL |
CN106021508A (en) * | 2016-05-23 | 2016-10-12 | 武汉大学 | Sudden event emergency information mining method based on social media |
CN106844765B (en) * | 2017-02-22 | 2019-12-20 | 中国科学院自动化研究所 | Significant information detection method and device based on convolutional neural network |
CN107145516B (en) * | 2017-04-07 | 2021-03-19 | 北京捷通华声科技股份有限公司 | Text clustering method and system |
CN107862081B (en) * | 2017-11-29 | 2021-07-16 | 四川无声信息技术有限公司 | Network information source searching method and device and server |
CN110246049A (en) * | 2018-03-09 | 2019-09-17 | 北大方正集团有限公司 | Topic detecting method, device, equipment and readable storage medium storing program for executing |
CN108427752A (en) * | 2018-03-13 | 2018-08-21 | 浙江大学城市学院 | A kind of article meaning of one's words mask method using event based on isomery article |
CN113626573B (en) * | 2021-08-11 | 2022-09-27 | 北京深维智信科技有限公司 | Sales session objection and response extraction method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6424971B1 (en) * | 1999-10-29 | 2002-07-23 | International Business Machines Corporation | System and method for interactive classification and analysis of data |
CN1403959A (en) * | 2001-09-07 | 2003-03-19 | 联想(北京)有限公司 | Content filter based on text content characteristic similarity and theme correlation degree comparison |
CN1535433A (en) * | 2001-07-04 | 2004-10-06 | 库吉萨姆媒介公司 | Category based, extensible and interactive system for document retrieval |
JP2009146397A (en) * | 2007-11-19 | 2009-07-02 | Omron Corp | Important sentence extraction method, important sentence extraction device, important sentence extraction program and recording medium |
-
2011
- 2011-10-15 CN CN 201110312540 patent/CN102411611B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6424971B1 (en) * | 1999-10-29 | 2002-07-23 | International Business Machines Corporation | System and method for interactive classification and analysis of data |
CN1535433A (en) * | 2001-07-04 | 2004-10-06 | 库吉萨姆媒介公司 | Category based, extensible and interactive system for document retrieval |
CN1403959A (en) * | 2001-09-07 | 2003-03-19 | 联想(北京)有限公司 | Content filter based on text content characteristic similarity and theme correlation degree comparison |
JP2009146397A (en) * | 2007-11-19 | 2009-07-02 | Omron Corp | Important sentence extraction method, important sentence extraction device, important sentence extraction program and recording medium |
Also Published As
Publication number | Publication date |
---|---|
CN102411611A (en) | 2012-04-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102411611B (en) | Instant interactive text oriented event identifying and tracking method | |
CN110866117B (en) | Short text classification method based on semantic enhancement and multi-level label embedding | |
CN106294593B (en) | In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study | |
CN114064918B (en) | Multi-modal event knowledge graph construction method | |
CN110413986A (en) | A kind of text cluster multi-document auto-abstracting method and system improving term vector model | |
CN104679738B (en) | Internet hot words mining method and device | |
CN103984681A (en) | News event evolution analysis method based on time sequence distribution information and topic model | |
CN103226580A (en) | Interactive-text-oriented topic detection method | |
CN109670039A (en) | Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering | |
CN111723295B (en) | Content distribution method, device and storage medium | |
CN112307351A (en) | Model training and recommending method, device and equipment for user behavior | |
Huang et al. | A topic BiLSTM model for sentiment classification | |
CN105760499A (en) | Method for analyzing and predicting online public opinion based on LDA topic models | |
CN109063147A (en) | Online course forum content recommendation method and system based on text similarity | |
CN110457711A (en) | A kind of social media event topic recognition methods based on descriptor | |
CN104881399A (en) | Event identification method and system based on probability soft logic PSL | |
CN103412878A (en) | Document theme partitioning method based on domain knowledge map community structure | |
CN116578705A (en) | Microblog emotion classification method based on pre-training language model and integrated neural network | |
Vlad et al. | UPB@ DANKMEMES: Italian memes analysis-employing visual models and graph convolutional networks for meme identification and hate speech detection | |
Huang et al. | Multi-granular document-level sentiment topic analysis for online reviews | |
CN110795533A (en) | Long text-oriented theme detection method | |
CN103744958A (en) | Webpage classification algorithm based on distributed computation | |
CN113869054A (en) | Deep learning-based electric power field project feature identification method | |
Pathuri et al. | Feature based sentimental analysis for prediction of mobile reviews using hybrid bag-boost algorithm | |
CN113435190B (en) | Chapter relation extraction method integrating multilevel information extraction and noise reduction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130102 Termination date: 20151015 |
|
EXPY | Termination of patent right or utility model |