CN105354186A

CN105354186A - News event extraction method and system

Info

Publication number: CN105354186A
Application number: CN201510749707.7A
Authority: CN
Inventors: 蒋昌俊; 闫春钢; 陈闳中; 丁志军; 吴亚光
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2015-11-05
Filing date: 2015-11-05
Publication date: 2016-02-24
Also published as: WO2017075912A1

Abstract

A news event extraction method and system are provided. The news event extraction method comprises: according to a query word, acquiring a news sentence set comprising the query word from a news corpus; for news sentences with accurate time, extracting the time thereof; classifying news sentences with same time into a same time stamp container; for each time stamp container, collecting statistics on frequency of occurrence of each word in news sentences in the time stamp container, and establishing a corresponding feature vector; for a news sentence without accurate time, and for different time stamp containers, establishing a phrase vector of same dimensions as the feature vectors of the time stamp containers, and calculating similarity between the phrase vector and the feature vectors of the time stamp containers; and if a greatest value of the calculated similarity is larger than a set threshold, adding the news sentence without accurate time to a time stamp container corresponding to the highest similarity. According to the method and system provided by the present invention, sentences without accurate time can be correctly classified.

Description

A kind of media event abstracting method and system

Technical field

The present invention relates to a kind of data processing technique, particularly relate to a kind of media event abstracting method and system.

Background technology

News report has true, fresh, important, ageing extremely strong feature, can give people a large amount of information in short width.Due to the opening flag of internet, cause the news above internet to have isomery, redundancy, the dynamically characteristic such as changeable, the information describing same news is dispersed on different web sites usually, and the form of expression is also different.In order to can from bundle disorderly without the information finding user to need quickly and accurately the data mighty torrent of chapter, media event extraction technique be one of most important instrument.In the media event abstracting method of existing unsupervised learning, generally employ and give up not containing the mode of the news sentence of time, determine the importance of event according to the frequency of the media event be drawn into.Owing to having in quite a few news sentence the mode that have employed and give tacit consent to nearest news and not comprising the concrete time, these media events just can not be reproduced to have in news extraction technique and be extracted, thus easily cause the extraction deviation of major event, reduce the accuracy of event importance ranking.

Given this, how to comprise when media event extracts and just do not become those skilled in the art's problem demanding prompt solution containing the news of time to reduce extraction deviation.

Summary of the invention

The shortcoming of prior art in view of the above, the object of the present invention is to provide a kind of media event abstracting method and system, not comprising when media event extracts not containing the inaccurate problem of event importance ranking that the news of time causes for solving in prior art.

For achieving the above object and other relevant objects, the invention provides a kind of media event abstracting method, described media event abstracting method comprises: in news corpus storehouse, obtain the news sentence collection comprising described query word according to query word; Described news sentence collection is divided into the news sentence containing correct time and does not contain the news sentence of correct time; For the news sentence containing correct time, extract the time wherein; Set up multiple timestamp container for the different time, and the news sentence with same time is referred to same timestamp container; For each timestamp container, the frequency that in statistics news sentence wherein, each word occurs, and set up corresponding proper vector; For the news sentence not containing correct time, do not set up the phrase vector identical with the proper vector dimension of described timestamp container for different time stamp container respectively containing the participle of the news sentence of correct time according to described, and calculate the similarity between described phrase vector and the proper vector of described timestamp container; If the maximal value of the similarity calculated is greater than the threshold value of setting, then do not join described in the highest timestamp container of similarity containing the news sentence of correct time.

Alternatively, described similarity comprises cosine similarity.

Alternatively, described media event abstracting method also comprises: for each timestamp container, adds up the sentence quantity that described timestamp container comprises described query word.

Alternatively, described media event abstracting method also comprises: process according to above-mentioned media event abstracting method for different query words, adds up the sentence quantity of the different query words in each timestamp container, obtains the ranking results of described query word.

Alternatively, described media event abstracting method also comprises: revise described threshold value.

Alternatively, the data acquiring mode in described news corpus storehouse comprises: be news sentence by the division of teaching contents of the news documents collected, and by described news sentence stored in news corpus storehouse.

Alternatively, described timestamp container wherein, t _iit is time variable; C (q) represents the sentence set matched with query word q in the C of news corpus storehouse; S.t represents the time tag of sentence s.

Alternatively, feature phrase w _jrepresent each word in the related phrases of q; Proper vector represent each word w _jdocument word frequency, document word frequency represent the frequency that i-th word occurs in a document, k represents the number of the word comprised in document.

Alternatively, described phrase vector is described similarity comprises cosine similarity, and described cosine similarity is

S i m i l a r i t y ({\overset{&RightArrow;}{s}}_{i}, \overset{&RightArrow;}{W}) = \frac{Σ_{i = 1}^{k} a_{i} \times F_{w_{i}}}{{(Σ_{i = 1}^{k} {a_{i}}^{2})}^{\frac{1}{2}} + {(Σ_{i = 1}^{k} {F_{w_{i}}}^{2})}^{\frac{1}{2}}} .

Alternatively, described query word is determined according to media event.

The invention provides a kind of media event extraction system, described media event extraction system comprises: news sentence acquisition module, for obtaining the news sentence collection comprising described query word in news corpus storehouse according to query word; Described news sentence collection is divided into the news sentence containing correct time and does not contain the news sentence of correct time; Free news processing module, for for the described news sentence containing correct time, extracts the time wherein; Set up multiple timestamp container for the different time, and the news sentence with same time is referred to same timestamp container; For each timestamp container, the frequency that in statistics news sentence wherein, each word occurs, and set up corresponding proper vector; Without time news processing module, for not setting up the phrase vector identical with the proper vector dimension of described timestamp container for different time stamp container respectively containing the participle of the news sentence of correct time according to described, and calculate the similarity between described phrase vector and the proper vector of described timestamp container; If the maximal value of the similarity calculated is greater than setting threshold value, then do not join described in the highest timestamp container of similarity containing the news sentence of correct time.

Alternatively, described news sentence acquisition module also for: be news sentence by the division of teaching contents of the news documents collected, and by described news sentence stored in described news corpus storehouse.

Alternatively, described similarity comprises cosine similarity.

Alternatively, described media event extraction system also comprises media event statistical module, for adding up the sentence quantity of the different query words in each timestamp container, obtains the ranking results of described query word.

As mentioned above, a kind of media event abstracting method of the present invention and system, have following beneficial effect: the sentence not containing time element can correctly be sorted out by (1), makes the sequence of media event importance more accurate.(2) enrich the quantity of the sentence be drawn into, make the difference of importance of different media event more obvious.(3) utilize the incoherent sentence of timestamp container rejection, reduce interference when other news sort to highlight.

Accompanying drawing explanation

Fig. 1 is shown as the schematic flow sheet of an embodiment of media event abstracting method of the present invention.

Fig. 2 is shown as the extraction schematic flow sheet of an embodiment of media event abstracting method of the present invention.

Fig. 3 is shown as the schematic flow sheet sorted out not containing correct time sentence of an embodiment of media event abstracting method of the present invention.

Fig. 4 is shown as the module diagram of an embodiment of media event extraction system of the present invention.

Element numbers explanation

1 media event extraction system

11 news sentence acquisition modules

12 free news processing modules

13 without time news processing module

S1 ~ S3 step

Embodiment

Below by way of specific instantiation, embodiments of the present invention are described, those skilled in the art the content disclosed by this instructions can understand other advantages of the present invention and effect easily.The present invention can also be implemented or be applied by embodiments different in addition, and the every details in this instructions also can based on different viewpoints and application, carries out various modification or change not deviating under spirit of the present invention.

It should be noted that, the diagram provided in the present embodiment only illustrates basic conception of the present invention in a schematic way, then only the assembly relevant with the present invention is shown in graphic but not component count, shape and size when implementing according to reality is drawn, it is actual when implementing, and the kenel of each assembly, quantity and ratio can be a kind of change arbitrarily, and its assembly layout kenel also may be more complicated.

The invention provides a kind of media event abstracting method.In one embodiment, as shown in Figure 1, described media event abstracting method comprises:

Step S1, obtains the news sentence collection comprising described query word in news corpus storehouse according to query word; Described news sentence collection is divided into the news sentence containing correct time and does not contain the news sentence of correct time.Data acquiring mode in described news corpus storehouse comprises: be news sentence by the division of teaching contents of the news documents collected, and by described news sentence stored in news corpus storehouse.Described query word can be determined according to media event.Can represent a query word with symbol " q ", symbol " C " represents a corpus, and symbol " s " represents a sentence.In one embodiment, described query word according to the high report from each side of attention rate and can be determined in mentioning the event that quantity is maximum.

Step S2, for the news sentence containing correct time, extracts the time wherein; Set up multiple timestamp container for the different time, and the news sentence with same time is referred to same timestamp container; For each timestamp container, the frequency that in statistics news sentence wherein, each word occurs, and set up corresponding proper vector.In one embodiment, described timestamp container wherein, t _iit is time variable; C (q) represents the sentence set matched with query word q in the C of news corpus storehouse; S.t represents the time tag of sentence s.Feature phrase w _jrepresent each word in the related phrases of q; Proper vector represent each word w _jdocument word frequency, document word frequency represent the frequency that i-th word occurs in a document, k represents the number of the word comprised in document.

Step S3, for the news sentence not containing correct time, do not set up the phrase vector identical with the proper vector dimension of described timestamp container for different time stamp container respectively containing the participle of the news sentence of correct time according to described, and calculate the similarity between described phrase vector and the proper vector of described timestamp container; If the maximal value of the similarity calculated is greater than the threshold value of setting, then do not join described in the highest timestamp container of similarity containing the news sentence of correct time.Described similarity comprises cosine similarity.Described media event abstracting method also comprises: revise described threshold value.User can revise described threshold value according to actual conditions when using.In one embodiment, described phrase vector is described similarity comprises cosine similarity, and described cosine similarity is

S i m i l a r i t y ({\overset{&RightArrow;}{s}}_{i}, \overset{&RightArrow;}{W}) = \frac{Σ_{i = 1}^{k} a_{i} \times F_{w_{i}}}{{(Σ_{i = 1}^{k} {a_{i}}^{2})}^{\frac{1}{2}} + {(Σ_{i = 1}^{k} {F_{w_{i}}}^{2})}^{\frac{1}{2}}} .

The maximal value of similarity and maximum similarity

{MaxSimilarity}_{{Vt}_{i}} = {M a x (S i m i l a r i t y ({\overset{&RightArrow;}{s}}_{i}, \overset{&RightArrow;}{W})), s_{i} &Element; (V_{t_{i}}, V_{t_{0}})} .

If the maximal value of the similarity calculated is greater than the threshold value of setting, then do not join in the highest timestamp container of similarity containing the news sentence of correct time by described, the sentence joined in timestamp container is called effective sentence, effective sentence the time of expression sentence s is t _i, for the threshold value adjusted according to actual conditions.May have corresponding effectively sentence in each timestamp container, the time is t _iall effective sentence of timestamp container be called valid sentence subclass

In one embodiment, described media event abstracting method also comprises: for each timestamp container, adds up the sentence quantity that described timestamp container comprises described query word.In one embodiment, described media event abstracting method also comprises: process according to above-mentioned media event abstracting method for different query words, add up the sentence quantity of the different query words in each timestamp container, obtain the ranking results of described query word.

In one embodiment, the general frame that described media event abstracting method comprises as shown in Figure 2, will not contain the process of correct time sentence classification as shown in Figure 3.Its process comprises: by the news corpus collected according to title, the time, the form of content is stored in database.Afterwards according to the sentence terminating symbol of Chinese, as ".", "! ", "? " be divided into Sentence-level Deng by the content part of every section of document, equally according to title, the time, the form of content (sentence) stores.Sentence in corpus can be divided three classes: 1, comprise the sentence of precise date: AD (AbsoluteDate) represents complete and is accurate to the temporal expressions mode of " day ", as 2010.10.1, on May 12nd, 2008, the form of YYYY-MM-DD directly can be processed into.2, comprise the sentence on issuing time relevant date: DCT-RD (dateofcreation-relativedate) expression itself does not possess precise date, can be obtained by semantic analysis by document issuing time, and then be processed into the form of YYYY-MM-DD.3, do not comprise the sentence of precise date: UD (UnderspecifiedDate) expression can not get precise date, cannot be processed into the form of YYYY-MM-DD.

Then obtain Sentence-level language material by query word, then adopt algorithm below, extract the sentence time according to step and temporally stab classification: (1) is set up not containing the timestamp container V of precise date ₀.(2) use regex (regular expression of time) to s _i∈ S (q|c) mates, and obtains ( represent sentence S _ithe precise date comprised); If do not exist, by S _imate with R-Words (such as " the year before ", " after the week "), obtain DateDistance (DateDistance represents the distance with DCT on the date); If DateDistance does not exist, by S _iput into V ₀; If DateDistance exists, calculate DateDistance and DCT and obtain (date such as reported is on May 12nd, 2013, is exactly on May 12nd, 2012 so the year before).(3) if ( represent that the date is the timestamp container of t) exist, will put into if do not exist, create will put into

Then the similarity of sentence and feature phrase is calculated: the object calculating the similarity of sentence and feature phrase has two, one is that part is not included into correct timestamp container containing the sentence of correct time, and two is reject sentence incoherent with feature in each timestamp container.Concrete algorithm steps is as follows: (1) is to all sentence s _i∈ V _tcarry out participle, add up each word W _ithe frequency occurred and set up proper vector (2) be each sentence s _i∈ V _tset up the vector that dimension is k (identical with the feature vector dimension of timestamp container) (3) each sentence s is calculated _i∈ V _tand proper vector cosine similarity (4) the sentence s the highest with proper vector similarity is found out _wand remember that this similarity is

M a x S i m i l a r i t y = M a x (S i m i l a r i t y (\overset{&RightArrow;}{s_{i}}, \overset{&RightArrow;}{W})) .

(5) threshold value is set for s _i∈ V _tif, by s _ifrom V _tin remove; For s _i∈ V ₀if, by s _iput into V _t.Individual in practice, rule of thumb, MaxSimilarity more close to 1 time, sentence differs larger with feature phrase similarity may be but still same event, so threshold value can arrange lower.And when MaxSimilarity is away from 1, threshold value need arrange higher make reject more accurate.It is obtained by repetition test and manual observation that the threshold value of similarity is arranged, and can need as the case may be to modify.

Finally, sentence quantity is added up.The sentence quantity corresponding according to query word sorts, thus the event corresponding to query word is shown time shaft by importance ranking.Such as, in search database, text comprises the document of " Obama ", obtains 6418 records.Subordinate sentence is carried out to these records, is comprised the different sentences totally 20468 of " Obama ".And then extract the time of sentence, obtain the sentence that 3209 have correct time.The relatively time of 3209 sentences, finally obtain 158 different timestamps, and these sentences are inserted in corresponding timestamp container." earthquake " keyword is so done too, obtains following results and see table.After correctly being sorted out by sentence not containing time element as can be seen here, the sentence of average about 14.6% will be made to obtain correct rank.

Keyword

Obama

Earthquake

Total number of events	158	197
			The event number of rank change	22	30
Accounting	13.9％	15.2％

The invention provides a kind of media event extraction system, described media event extraction system can use media event abstracting method as above.In one embodiment, as shown in Figure 4, described media event extraction system 1 comprises news sentence acquisition module 11, free news processing module 12 and without time news processing module 13, wherein:

News sentence acquisition module 11 for obtaining the news sentence collection comprising described query word in news corpus storehouse according to query word; Described news sentence collection is divided into the news sentence containing correct time and does not contain the news sentence of correct time.Data acquiring mode in described news corpus storehouse comprises: be news sentence by the division of teaching contents of the news documents collected, and by described news sentence stored in news corpus storehouse.Described query word can be determined according to media event.In one embodiment, described news sentence acquisition module 11 also for: be news sentence by the division of teaching contents of the news documents collected, and by described news sentence stored in described news corpus storehouse.

Free news processing module 12 is connected with news sentence acquisition module 11, for for the described news sentence containing correct time, extracts the time wherein; Set up multiple timestamp container for the different time, and the news sentence with same time is referred to same timestamp container; For each timestamp container, the frequency that in statistics news sentence wherein, each word occurs, and set up corresponding proper vector.In one embodiment, described timestamp container wherein, t _iit is time variable; C (q) represents the sentence set matched with query word q in the C of news corpus storehouse; S.t represents the time tag of sentence s.Feature phrase w _jrepresent each word in the related phrases of q; Proper vector represent each word w _jdocument word frequency, document word frequency represent the frequency that i-th word occurs in a document, k represents the number of the word comprised in document.

Be connected with free news processing module 12 and news sentence acquisition module 11 without time news processing module 13, for not setting up the phrase vector identical with the proper vector dimension of described timestamp container for different time stamp container respectively containing the participle of the news sentence of correct time according to described, and calculate the similarity between described phrase vector and the proper vector of described timestamp container; If the maximal value of the similarity calculated is greater than setting threshold value, then do not join described in the highest timestamp container of similarity containing the news sentence of correct time.Described similarity comprises cosine similarity.Described media event abstracting method also comprises: revise described threshold value.User can revise described threshold value according to actual conditions when using.In one embodiment, described phrase vector is that described similarity comprises cosine similarity, and described cosine similarity is.

In one embodiment, described media event extraction system 1 also comprises media event statistical module, for adding up the sentence quantity of the different query words in each timestamp container, obtains the ranking results of described query word.Media event statistical module also for for each timestamp container, adds up the sentence quantity that described timestamp container comprises described query word.

In sum, sentence not containing time element can correctly be sorted out by a kind of media event abstracting method of the present invention and system, itself will express the sentence of media event put into correct time containers not containing the time, thus add the quantity of the media event extracted, improve the accuracy of event importance ranking.So the present invention effectively overcomes various shortcoming of the prior art and tool high industrial utilization.

Above-described embodiment is illustrative principle of the present invention and effect thereof only, but not for limiting the present invention.Any person skilled in the art scholar all without prejudice under spirit of the present invention and category, can modify above-described embodiment or changes.Therefore, such as have in art usually know the knowledgeable do not depart from complete under disclosed spirit and technological thought all equivalence modify or change, must be contained by claim of the present invention.

Claims

1. a media event abstracting method, is characterized in that, described media event abstracting method comprises:

In news corpus storehouse, the news sentence collection comprising described query word is obtained according to query word; Described news sentence collection is divided into the news sentence containing correct time and does not contain the news sentence of correct time;

For the news sentence containing correct time, extract the time wherein; Set up multiple timestamp container for the different time, and the news sentence with same time is referred to same timestamp container; For each timestamp container, the frequency that in statistics news sentence wherein, each word occurs, and set up corresponding proper vector;

For the news sentence not containing correct time, do not set up the phrase vector identical with the proper vector dimension of described timestamp container for different time stamp container respectively containing the participle of the news sentence of correct time according to described, and calculate the similarity between described phrase vector and the proper vector of described timestamp container; If the maximal value of the similarity calculated is greater than the threshold value of setting, then do not join described in the highest timestamp container of similarity containing the news sentence of correct time.

2. media event abstracting method according to claim 1, is characterized in that: described similarity comprises cosine similarity.

3. media event abstracting method according to claim 1, it is characterized in that: described media event abstracting method also comprises: process according to above-mentioned media event abstracting method for different query words, add up the sentence quantity of the different query words in each timestamp container, obtain the ranking results of described query word.

4. media event abstracting method according to claim 1, is characterized in that: described media event abstracting method also comprises: revise described threshold value.

5. media event abstracting method according to claim 1, is characterized in that: described timestamp container wherein, t _iit is time variable; C (q) represents the sentence set matched with query word q in the C of news corpus storehouse; S.t represents the time tag of sentence s.

6. media event abstracting method according to claim 5, is characterized in that: described feature phrase w _jrepresent each word in the related phrases of q; Described proper vector represent each word w _jdocument word frequency, document word frequency represent the frequency that i-th word occurs in a document, k represents the number of the word comprised in document.

7. media event abstracting method according to claim 6, is characterized in that: described phrase vector is described similarity comprises cosine similarity, and described cosine similarity is

S i m i l a r i t y ({\overset{&RightArrow;}{s}}_{i}, \overset{&RightArrow;}{W}) = \frac{Σ_{i = 1}^{k} a_{i} \times F_{w_{i}}}{{(Σ_{i = 1}^{k} {a_{i}}^{2})}^{\frac{1}{2}} + {(Σ_{i = 1}^{k} {F_{w_{i}}}^{2})}^{\frac{1}{2}}} .

8. a media event extraction system, is characterized in that: described media event extraction system comprises:

News sentence acquisition module, for obtaining the news sentence collection comprising described query word in news corpus storehouse according to query word; Described news sentence collection is divided into the news sentence containing correct time and does not contain the news sentence of correct time;

Free news processing module, for for the described news sentence containing correct time, extracts the time wherein; Set up multiple timestamp container for the different time, and the news sentence with same time is referred to same timestamp container; For each timestamp container, the frequency that in statistics news sentence wherein, each word occurs, and set up corresponding proper vector;

Without time news processing module, for not setting up the phrase vector identical with the proper vector dimension of described timestamp container for different time stamp container respectively containing the participle of the news sentence of correct time according to described, and calculate the similarity between described phrase vector and the proper vector of described timestamp container; If the maximal value of the similarity calculated is greater than setting threshold value, then do not join described in the highest timestamp container of similarity containing the news sentence of correct time.

9. media event extraction system according to claim 8, is characterized in that: described news sentence acquisition module also for: be news sentence by the division of teaching contents of the news documents collected, and by described news sentence stored in described news corpus storehouse.

10. media event extraction system according to claim 8, it is characterized in that: described media event extraction system also comprises media event statistical module, for adding up the sentence quantity of the different query words in each timestamp container, obtain the ranking results of described query word.