AU2018100678A4

AU2018100678A4 - News events extracting method and system

Info

Publication number: AU2018100678A4
Application number: AU2018100678A
Authority: AU
Inventors: Hongzhong CHEN; Zhijun Ding; Changjun JIANG; Yaguang WU; Chungang YAN
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2015-11-05
Filing date: 2018-05-18
Publication date: 2018-06-14
Anticipated expiration: 2024-01-15

Abstract

Abstract of Disclosure The present invention provides a news event extraction method and system. The news event extraction method comprises: acquiring a news sentence set containing a query word from a news corpus according to the query word; for news sentences containing accurate time, extracting time in the news sentences; classifying news sentences with the same time into the same timestamp container; for each timestamp container, statistically collecting occurrence frequency of each word in the news sentences in the timestamp container, and establishing a corresponding characteristic vector; aiming at news sentences containing no accurate time, establishing a phrase vector with the same dimensions as the characteristic vector of the timestamp container aiming at different timestamp containers, and calculating a similarity between the phrase vector and the characteristic vector of the timestamp container; and if a maximum value of the calculated similarities is greater than a set threshold, adding the news sentence containing no accurate time into a timestamp container corresponding to the highest similarity. The present invention can correctly classify sentences containing no time element. Acquire a news sentence set containing a query word from a news corpus according to Sl the query word; and divide the news sentence set into news sentences containing accurate time and news sentences containing no accurate time Extract time in the news sentences containing accurate time; establish a plurality of timestamp containers aiming at different time, and classify news sentences with the 2 same time into the same timestamp container; and for each timestamp container, statistically collect occurrence frequency of each word in the news sentences in the timestamp container, and establish a corresponding characteristic vector For the news sentences containing no accurate time, establish a phrase vector with the same dimensions as the characteristic vector of the timestamp container according to S3 word segmentation of the news sentence containing no accurate time respectively aiming at different timestamp containers, and calculatea similarity between the phrase vector and the characteristic vector of the timestamp container; and if a maximum value of the calculated similarities is greater than a set threshold, add the news sentence containing no accurate time into a timestamp container corresponding to the highest similarity

Description

NEWS EVENT EXTRACTION METHOD AND SYSTEM

Background of the Present Invention

Field of Invention

The present invention relates to a data processing technology, in particular to a news event extraction method and system.

Description of Related Arts

News reports have the features of truth, freshness, importance, and extremely strong timeliness and can provide a great amount of information to people in short length. Due to the open characteristic of the Internet, news on the Internet has the features of heterogeneity, redundancy, dynamic change and the like, information which describes the same news is usually scattered on different websites, and patterns of presentation are different. In order to rapidly and accurately find information needed by users from disorder data floods, the news event extraction technology is one of the most important tools. In the existing unsupervised learning news event extraction methods, usually a mode of abandoning news sentences containing no time is adopted and the importance of events is determined according to the frequency of the extracted news events. Since a mode of defaulting latest news is adopted in quite a part of news sentences and no specific time is contained, these news events cannot be extracted by adopting the existing news extraction technologies, thereby deviation of extraction of major events is easily caused and the accuracy of event importance ranking is decreased.

In view of this, how to include news containing no time during news event extraction to reduce extraction deviation becomes a problem which needs to be urgently solved by one skilled in the art.

Summary of the Present Invention

In view of the above-mentioned disadvantages of the prior art, the purpose of the present invention is to provide a news event extraction method and system, which are used for solving the problem that the accuracy of event importance ranking is not high since news containing no time are not included during news event extraction in the prior art.

In order to realize the above-mentioned purpose and other related purposes, the present invention provides a news event extraction method. The news event extraction method comprises: acquiring a news sentence set containing a query word from a news corpus according to the query word; dividing the news sentence set into news sentences containing accurate time and news sentences containing no accurate time; aiming at the news sentence containing accurate time, extracting time in the news sentences; establishing a plurality of timestamp containers aiming at different time, and classifying news sentences with the same time into the same timestamp container; aiming at each timestamp container, statistically collecting occurrence frequency of each word in the news sentences in the timestamp container, and establishing a corresponding characteristic vector; for the news sentences containing no accurate time, establishing a phrase vector with the same dimensions as the characteristic vector of the timestamp container according to word segmentation of the news sentence containing no accurate time respectively aiming at different timestamp containers, and calculating a similarity between the phrase vector and the characteristic vector of the timestamp container; and if a maximum value of the calculated similarities is greater than a set threshold, adding the news sentence containing no accurate time into a timestamp container corresponding to the highest similarity.

Alternatively, the similarity comprises cosine similarity.

Alternatively, the news event extraction method further comprises: for each timestamp container, statistically collecting a number of sentences containing the query word in the timestamp container.

Alternatively, the news event extraction method further comprises: processing different query words according to the news event extraction method, statistically collecting a number of sentences containing different query words in each timestamp container and obtaining a ranking result of the query words.

Alternatively, the news event extraction method further comprises: modifying the threshold.

Alternatively, a mode of acquiring data in the news corpus comprises: dividing contents of a collected news document into news sentences and storing the news sentences into the news corpus.

Alternatively, the timestamp container is

where t, is a time variable; C(q) denotes a set of sentences matched with a query word q in a news corpus C; and s.t denotes a time label of a sentence 5.

Alternatively, acharacteristic phrase is

, where

Wj denotes each word in sentences related to q; and the characteristic vector is

where Fwj denotes document word frequency of each word Wj, the document word frequency is

, NWj represents occurrence times of an ith word in a document and k denotes a number of words contained in the document.

Alternatively, the phrase vector is

, the similarity comprises cosine similarity and the cosine similarity is

Alternatively, the query word is determined according to news events.

The present invention further provides a news event extraction system. The news event extraction system comprises: a news sentence acquisition module used for acquiring a news sentence set containing a query word from a news corpus according to the query word; and dividing the news sentence set into news sentences containing accurate time and news sentences containing no accurate time; a time-containing news processing module used for, aiming at the news sentence containing accurate time, extracting time in the news sentences; establishing a plurality of timestamp containers aiming at different time, and classifying news sentences with the same time into the same timestamp container; and aiming at each timestamp container, statistically collecting occurrence frequency of each word in the news sentences in the timestamp container, and establishing a corresponding characteristic vector; and a no-time-containing news processing module used for establishing a phrase vector with the same dimensions as the characteristic vector of the timestamp container according to word segmentation of the news sentence containing no accurate time respectively aiming at different timestamp containers, and calculating a similarity between the phrase vector and the characteristic vector of the timestamp container; and if a maximum value of the calculated similarities is greater than a set threshold, adding the news sentence containing no accurate time into a timestamp container corresponding to the highest similarity.

Alternatively, the news sentence acquisition module is further used for dividing contents of a collected news document into news sentences and storing the news sentences into the news corpus.

Alternatively, the similarity comprises cosine similarity.

Alternatively, the news event extraction system further comprises a news event statistics module used for statistically collecting a number of sentences containing different query' words in each timestamp container and obtaining a ranking result of the query words.

As described above, the news event extraction method and system provided by the present invention have the following beneficial effects: (1) the sentences containing no time element can be correctly classified such that the accuracy of news event importance ranking is higher; (2) the number of extracted sentences is increased such that the difference ofthe importance of different news events is more obvious; and (3) the unrelated sentences are removed by using the timestamp containers such that the interference of other news to the ranking of important news is decreased.

Brief Description of the Drawings FIG. 1 illustrates a flowchart of one embodiment accordingto news event extraction method ofthe present invention. FIG. 2 illustrates an extraction flowchart of one embodiment according to news event extraction method of the present invention. FIG. 3 illustrates a flowchart of classifying sentences containing no accurate time according to one embodiment of news event extraction methodin the present invention. FIG. 4 illustrates a modular schematic diagram of one embodiment according to news event extraction system of the present invention.

Description of component mark numbers: I News event extraction system II News sentence acquisition module 12 Time-containing news processing module 13 No-time-containing news processing module SI-S3 Steps

Detailed Description of the Preferred Embodiments

The implementation modes of the present invention will be described below through specific embodiments. One skilled in the art can easily understand other advantages and effects of the present invention according to contents disclosed by the description. The present invention may also be implemented or applied through other different specific implementation modes. Various modifications or changes may also be made to all details in the description based on different points of view and applications without departing from the spirit of the present invention.

It needs to be stated that the drawings provided in the following embodiments are just used for schematically describing the basic concept of the present invention, thus only illustrate components only related to the present invention and are not drawn according to the numbers, shapes and sizes of components during actual implementation, the configuration, number and scale of each component during actual implementation thereof may be freely changed, and the component layout configuration thereof may be more complex.

The present invention provides a news event extraction method. In one embodiment, as illustrated in FIG. 1, the news event extraction method comprises the following steps:

In step SI, a news sentence set containing a query word is acquired from a news corpus according to the query word; and the news sentence set is divided into news sentences containing accurate time and news sentences containing no accurate time. A mode of acquiring data in the news corpus comprises the following operations: dividing contents of a collected news document into news sentences, and storing the news sentences into the news corpus. The query word may be determined according to news events. A symbol “q” may be used for representing a query word, a symbol “C” may be used for representing a corpus and a symbol “s” may be used for representing a sentence. In one embodiment, the query word may be determined according toevents which are highly concerned and are the most reported and mentioned from all aspects.

In step S2, for the news sentence containing accurate time, time in the news sentences is extracted; a plurality of timestamp containers are established aiming at different time, and news sentences with the same time are classified into the same timestamp container; and for each timestamp container, occurrence frequency of each word in the news sentences in the timestamp container statistically collected, and a corresponding characteristic vector is established. In one embodiment, the timestamp container is

, where f is a time variable; C(q) denotes a set of sentences matched with a query word q in a news corpus C; and s.l denotes a time label of a sentence s. Acharacteristic phrase is

where w, denotes each word in sentences related to q; and the characteristic vector is

, where FWJ denotes document word frequency of each word Wj, the document word frequency is

Nwj represents occurrence times of an ith word in a document and k denotes a number of words contained in the document.

In step S3, aiming at the news sentences containing no accurate time, a phrase vector with the same dimensions as the characteristic vector of the timestamp container is established according to word segmentation of the news sentence containing no accurate time respectively aiming at different timestamp containers, and a similarity between the phrase vector and the characteristic vector of the timestamp container is calculated; and if a maximum value of the calculated similarities is greater than a set threshold, the news sentence containing no accurate time is added into a timestamp container corresponding to the highest similarity. The similarity comprises cosine similarity. The news event extraction method further comprises the following operations: modifying the threshold. The user may modify the threshold according to the actual situation during use. In one embodiment, the phrase vector is

, the similarity comprises cosine similarity, and the cosine similarity is

A maximum value of the similarities, i.e., the maximum similarity, is

If the maximum value of the calculated similarities is greater than a preset threshold, the news sentence containing no accurate time is added into a timestamp container corresponding to the highest similarity, the sentence which is added into the timestamp container is called as an effective sentence and the effective sentence is

where su denotes that time of a sentence 5 is /, and d is a threshold adjusted according to the actual situation. Each timestamp container may have corresponding effective sentences, and all effective sentences of a timestamp container corresponding to time /, are called as an effective sentence set ,S’„.

In one embodiment, the news event extraction method further comprises the following operation: aiming at each timestamp container, statistically a number of sentences containing the query word in the timestamp container. In one embodiment, the news event extraction method further comprises the following operations: processing different query words according to the above news event extraction method, statistically collecting a number of sentences containing different query words in each timestamp container and obtaining a ranking result of the query words.

In one embodiment, an integral framework included in the news event extraction method is as illustrated in FIG. 2, and a process of classifying sentences containing no accurate time is as illustrated in FIG. 3. A processing process thereof comprises the following operations: storing collected news corpora in a database according to titles, time and contents. Then, contents of each document are divided into sentences according to sentence end symbols of Chinese such as ”, “! ” and “? ”, and are also stored according to titles, time and contents (sentences). Sentences in a corpus may be divided into the following three classes: 1) sentences containing accurate data: AD (Absolute Date) denotes an expression of time which is complete and accurate to “day ", e.g.. 2010.10.1 and May 12, 2008. and the time may be directly processed into the form of YYYY-MM-DD; 2) sentences containing date related to release time: DCT-RD (date of creation-relative date) denotes that the sentences themselves do not contain accurate date, may be obtained through semantic analysis according to document release time and may be processed into the form of YYYY-MM-DD; and 3) sentences containing no accurate date: UD (underspecified Date) denotes that accurate date cannot be obtained, and cannot be processed into the form of YYYY-MM-DD.

Then, sente nee-level corpora are acquired through query words, time in sentences is extracted according to steps by adopting the following algorithm and the sentences are classified according to time: (1) a timestamp container if containing no accurate date is established; (2) matching is performed to

by using regex (regular expression of time) to obtain S,, (¾ denotes accurate date containing in a sentence Sy); if S), does not exist. 5, is matched with R-Words (e.g.. “one year ago” and “one week later”) to obtain Date Distance (Date Distance denotes a distance in date from DCT); if Date Distance does not exist, Sjis put into Va\ and if Date Distance exists. Date Distance and DCT are calculated to obtain 5,, (e.g,, if a reported date is May 12, 2013, one year ago is May 12, 2012); and (3) if If, (If,denotes a timestamp container corresponding to date t) has already existed, S,·, is put into if,; and if If, does not exist, if, is created and 5,, is put into If,.

Then, a similarity between each sentence and each characteristic phrase is calculated. There are two purposes to calculate the similarity between each sentence and each characteristic phrase, wherein one is to classify partial sentences containing no accurate time into correct timestamp containers, and the other is to remove sentences which are not correlated to the characteristic in each timestamp container. Specific algorithm steps are as follows: (1) word segmentation is performed to all sentences s-t- if, occurrence frequency Fwi of each word fif is statistically collected and a characteristic vector

is established; (2) a vector

with k dimensions (which are the same as the dimensions of the characteristic vector of the timestamp container) is established for each sentence st

(3) the cosine similarity

between each sentence s,& V, and the characteristic vector

is calculated; (4) the sentence sw having the highest similarity to the characteristic vector is found out and the similarity is denoted as

; and (5) a threshold d is set; for s, G Vh if

, s,,· is removed from V,; and for s,· e Vo, if

Sj is put into Vt. In practice, according to experiences, when Max Similarity is closer to 1, since the events may be the same event even though the difference between the sentence and the characteristic phrase is relatively great, the threshold may be set to be lower. When MaxSimilarity is far away from 1, the threshold needs to be set to be higher to realize more accurate removal. The setting of the threshold of the similarity is obtained through repetitive tests and artificial observation and may be modified according to the actual needs.

Finally, the number of sentences is statistically collected. Ranking is performed according to the number of sentences corresponding to the query words, such that the events corresponding to the query words are ranked according to importance and a time axis is presented. For example, documents with texts containing “JilE 4¾’’(Obama) in a database are searched to obtain 6418 records. Sentence segmentation is performed to these records to obtain totally 20468 different sentences containing (Obama). Further, time in the sentences is extracted to obtain 3209 sentences containing accurate time. The time of the 3209 sentences is compared to finally obtain 158 different timestamps, and these sentences are put into corresponding timestamp containers. For a keyword “J4M” (earthquake), the same operations are performed to obtain the following results as shown in the following table. Accordingly, it can be seen that, after sentences containing time elements are correctly classified, averagely about 14.6% of sentences can be correctly ranked.

The present invention further provides a news event extraction system. The news event extraction system may adopt the above-mentioned news event extraction method. In one embodiment, as illustrated in FIG. 4, the news event extraction system 1 comprises a news sentence acquisition module 11, a time-containing news processing module 12 and a no-time-containing news processing module 13, wherein:

The news sentence acquisition module 11 is used for acquiring a news sentence set containing a query word from a news corpus according to the query word; and dividing the news sentence set into news sentences containing accurate time and news sentences containing no accurate time. A mode of acquiring data in the news corpus comprises the following operations: dividing contents of a collected news document into news sentences and storing the news sentences into the news corpus. The query word may be determined according to news events. In one embodiment, the news sentence acquisition module 11 is further used for dividing contents of a collected news document into news sentences and storing the news sentences into the news coipus.

The time-containing news processing module 12 is connected with the news sentence acquisition module 11 and is used for extracting time from the news sentences containing accurate time; establishing a plurality of timestamp containers aiming at different time, and classifying news sentences with the same time into the same timestamp container; and statistically collecting occurrence frequency of each word in the news sentences in each timestamp container, and establishing a corresponding characteristic vector. In one embodiment, the timestamp container is

, where t, is a time variable; C(q) denotes a set of sentences matched with a query word q in a news corpus C; and s.t denotes a time label of a sentence s. Acharacteristic phrase is

, Nwj represents occurrence times of an ith word in a document, and k denotes a number of words contained in the document.

The no-time-containing news processing module 13 is connected with the time-containing news processing module 12 and the news sentence acquisition module 11, and is used for establishing a phrase vector with the same dimensions as the characteristic vector of the timestamp container according to word segmentation of the news sentence containing no accurate time respectively aiming at different timestamp containers, and calculating a similarity between the phrase vector and the characteristic vector of the timestamp container; and if a maximum value of the calculated similarities is greater than a set threshold, adding the news sentence containing no accurate time into a timestamp container corresponding to the highest similarity. The similarity comprises cosine similarity. The news event extraction method further comprises the following operation: modifying the threshold. The user may modify the threshold according to the actual situation during use. In one embodiment, the phrase vector is S the similarity comprises cosine similarity and the cosine similarity is

In one embodiment, the news event extraction system 1 further comprises a news event statistics module used for statistically collecting a number of sentences containing different query words in each timestamp container and obtaining a ranking result of the query words. The news event statistics module is further used for statistically collecting a number of sentences containing the quety word in each timestamp container.

To sum up, the news event extraction method and system provided by the present invention can correctly classify sentences containing no time element and put sentences which do not contain time but themselves express news events into correct timestamp containers, thereby the number of extracted new s events is increased and the accuracy οΓ event importance ranking is improved. Therefore, the present invention effectively overcomes various disadvantages in the prior art and thus has a great industrial utilization value.

The above-mentioned embodiments are just used for exemplarily describing the principle and effects of the present invention instead of limiting the present invention. One skilled in the art may make modifications or changes to the above-mentioned embodiments without departing from the spirit and the scope of the present invention. Therefore, all equivalent modifications or changes made by those who have common knowledge in the art without departing from the spirit and technical concept disclosed by the present invention shall be still covered by the claims of the present invention.

Claims

What is claimed is:

1. A news event extraction method, characterized in that the method comprises: acquiring a news sentence set containing a query word from a news corpus according to the query word; and dividing the news sentence set into news sentences containing accurate time and news sentences containing no accurate time; for the news sentence containing accurate time, extracting time in the news sentences; establishing a plurality of timestamp containers for different time, and classifying news sentences with the same time into the same timestamp container; and for each timestamp container, statistically collecting occurrence frequency of each word in the news sentences in the timestamp container, and establishing a corresponding characteristic vector; and for the news sentences containing no accurate time, establishing a phrase vector with the same dimensions as the characteristic vector of the timestamp container according to word segmentation of the news sentence containing no accurate time respectively aiming at different timestamp containers, and calculating a similarity between the phrase vector and the characteristic vector of the timestamp container; and if a maximum value of the calculated similarities is greater than a set threshold, adding the news sentence containing no accurate time into a timestamp container corresponding to the highest similarity.
2. The news event extraction method according to claim 1, characterized in that the similarity comprises cosine similarity.
3. The news event extraction method according to claim 1, characterized in that the method further comprises: processing different query words according to the news event extraction method, statistically collecting a number of sentences corresponding to different query words in each timestamp container and obtaining a ranking result of the query words.
4. The news event extraction method according to claim 1, characterized in that the method further comprises: modifying the threshold.
5. The news event extraction method according to claim 1, characterized in that the timestamp container is

where f is a time variable; C(q) denotes a set of sentences matched with a query word q in a news corpus C; and s.t denotes a time label of a sentence s.
6. The news event extraction method according to claim 5, characterized in that the characteristic phrase is

, where wj denotes each word insentences related to q\ and the characteristic vector is

, where Fwj denotes document word frequency of each word wh the document word frequency is

AHJ represents occurrence times of an ith word in a document and k denotes a number of words contained in the document.
7. The news event extraction method according to claim 6, characterized in that the phrase vector is

, the similarity comprises cosine similarity and the cosine similarity is
8. A news event extraction system, characterized in that the system comprises: a news sentence acquisition module used for acquiring a news sentence set containing a query word from a news corpus according to the query word; and dividing the news sentence set into news sentences containing accurate time and news sentences containing no accurate time; a time-containing news processing module used for extracting time in the news sentences from news sentence containing accurate time; establishing a plurality of timestamp containers aiming at different time, and classifying news sentences with the same time into the same timestamp container; and aiming at each timestamp container, statistically collecting occurrence frequency of each word in the news sentences in the timestamp container, and establishing a corresponding characteristic vector; and a no-time-containing news processing module used for establishing a phrase vector with the same dimensions as the characteristic vector of the timestamp container according to word segmentation of the news sentence containing no accurate time respectively aiming at different timestamp containers, and calculating a similarity between the phrase vector and the characteristic vector of the timestamp container; and if a maximum value of the calculated similarities is greater than a set threshold, adding the news sentence containing no accurate time into a timestamp container corresponding to the highest similarity.
9. The news event extraction system according to claim 8, characterized in that the news sentence acquisition module is further used for dividing contents of a collected news document into news sentences and storing the news sentences into the news coipus.
10. The news event extraction system according to claim 8, characterized in that the news event extraction system further comprises a news event statistics module used for statistically collecting a number of sentences containing different query words in each timestamp container and obtaining a ranking result of the query words.